newsfeeds
file to figure out which sites are fed by
NNTPlink. It expects to see entries of the form:
site:*:Tc,Wnm:/usr/local/news/bin/nntplink -k -q nntp.host.orgIt makes the assumption that all NNTPlink feed sites are up for grabs. This decision is made by looking at the contents of the last field on the newsfeed entry. If it matches the expression in
$nntplink
then it is a target.
If this is not true at your site, and until I think of something better you can try this:
make a dummy newsfeeds file with lines like
site:xxx:xxx:nntplink nntp.hostand point rebatch at that.
A separate configuration file is NOT an option - I want to think less not more each time I add a feed!
If that does not appeal, you could create a symbolic link giving
nntplink another name (e.g., nntplink2
), use
nntplink2
in your newsfeeds file and as the value of
$nntplink_id
in rebatch.conf
.
I also assume that the last thing on the newsfeeds line is the remote host to send it to. If that is not true, then a dummy newsfeeds file is called for.
rebatch can cope with INN's comments, whitespace and continued lines in the same way as INN does.
One other fun assumption is that one innxmit fits all. That is, it is suitable to send a batch file with...
innxmit -t 300 remotehost batchfile(
"-t 300"
ensures you don't get hung processes if the
remote site is off the air). You can alter the innxmit
parameters on a global basis, but not on a case by case one.
If there becomes a real need for a per-site configuration file, I will think about adding it. This was written for my site and (fortunately) I don't need that complexity.
read_newsfeeds
in rebatch.common
to parse the newsfeeds
file.
It reads line at a time from the file, discarding everything after a comment character, then leading and trailing whitespace.
If the line ends in a '\' character it is appended to the previous line, sans the '\', and another line read. When it finds a line that does not end in a continuation character (and it is not blank) it assumes we have a valid newsfeeds line.
That is, the entries
site1:*:Tc,Wnm:/usr/local/news/bin/nntplink -k -q nntp.host.org
## site2 is very fussy about the groups its getsare both parsed with as INN does.
site2:!*,\
hundreds of individual group lines,\
each ending in,\
a continuation character,\
!/local:Tc,Wnm:/usr/local/news/bin/nntplink -k -q nntp2.host.org
The line is then split apart on the ':' character.
If there is a '/' in the first (site) portion, it and everything after it is disposed of.
The fourth part is examined. If it contains the
$nntplink_id
(which can be a Perl regular expression)
then we have an NNTPlink feed site rebatch
can cope with. If we don't see $nntplink_id
, then this
entry is discarded.
rebatch then splits the fourth part apart on spaces and
takes the last portion (nntp.host.org
and
nntp2host.org
in the above examples) as the host name we
send articles to.
The sitename and nntphost name are recorded in arrays to be used.
Then it reads the newsfeeds
file (see above). It
works through the sitename/nntphost pairs
after that. This is where things get "interesting".
I have found six types of files that an channel feed NNTPlink can leave behind:
nntphost.link
is a status file NNTPlink
uses. It contains things like the process ID of current
NNTPlink process. This file is normal and
rebatch ignores it.
nntphost.1234
is created when the remote host gets
too far behind in its acceptance of articles, or NNTPlink
exited with articles to send. Files of type #3 get renamed into files
with this name.
nntphost.1234.tmp
is created when
NNTPlink can not contact the remote host, or the remote
host refuses articles for some reason. Files of this form are open
for writing by NNTPlink, so care needs to be taken with
them. If NNTPlink has one of these files open when it
exits, it renames it to be of the #2 form.
sitename
is created when NNTPlink is so
far behind that INN notices and starts writing its own batch files.
If this file exists, INN may have it open for writing, so the only way
to cope with it is to rename the file, then send INN a flush for that
site.
sitename.1234
is not created by NNTPlink or
INN. One of my beta sites asked for files of this form to be cleaned
up (something to do with NNTPlink funnel feeds?) so I do.
nntphost.rebatch
files are created by the
rebatch program and contain the concatenation of the five
other files.
For each site/hostname pair,
rebatch calls the subroutine do_site_flush
to scan the out.going
directory for filenames of the form
#1-#5 above. It does this with a shell glob matching the patterns
$nntphost.*Files of type #1 are ignored.
$sitename*
If it finds files of type #3 or #4 it makes a mental note to flush the site.
If it sees a file like #4 it renames it to
$nntphost.0
, to be picked up by a later glob. Hopefully
0 will never be a valid process ID, so this file will not be clobbered
by NNTPlink. To make sure, if it sees it would overwrite
a file, it appends 0 onto it until it gets a unique name.
Files like #2, #5 or #6 are ignored at this point, but a note is made a significant file is found ... that is, a batch file that may need transmitting.
Once the scanning is complete, it sees if it needs to flush the site and issues a ctlinnd command if it does.
If a batch file that needs transmitting
is found, it then calls the subroutine rebatch_files
to
collect the batch files together.
sub rebatch_files
takes the output of the shell glob
matching these two patterns:
$nntphost.*above and works through it. (NOTE the period in the second glob.)
$site.*
If the file is not of the form
nntphost.1234or
sitename.1234(that is types #2 and #5) it is ignored. Files of that form are concatenated to the
.rebatch
file, then unlinked.
Then the batchfile is transmitted to the
remote host in the subroutine send_batch.
sub send_batch
does shlock style locking
to ensure only one innxmit process is going to be running
at once to a site.
If no other rebatch process is active for that site, it then double forks...
The original parent moves to the next site.
The first child create the lock file, forks and then waits for the grandchild to die.
The grandchild reopens STDOUT
and STDERR
to the progress file, writes the current time in seconds since 1970 to
the file and then execs innxmit.
And that is about it!
It also begins by reading the
newsfeeds
file as detailed above, then works through each
site/host pair.
If it sees a lock file for the site it notes the time and counts
the number of ihave
lines in the progress file. It then
does a similar glob to rebatch to find all the batch
files for the site and counts the number of lines in them.
It then does a little math to figure out how many articles to send and the estimated time to send them, based on how many have gone before and how long it has taken.
There are two flaws in where.to that I know of:
ihave
lines and won't get counted as a 'transmitted' article.
I have no plans to fix these. Earlier versions of where.to did all sorts of clever things (remember the last message ID, look for its position in the batch file), but they don't work well with the globing.
If the site is not running, then it simply counts up the lines in
the candidate batch file and tells you. If it sees a
.tmp
file it tells you as well.
Answer: NO!
The purpose of running NNTPlink as a channel feed is to pass on the article as soon as it arrives on the server. [Tom Limoncelli calls this the "INN Instant Party" and "a gimmick" in his INN FAQ.]
If you do this, the text of the article will be in the system's buffer cache so won't have to be fetched off disk, so will go much quicker and so you get a performance boost.
From the INN FAQ:
Ian Phillipps <ian@unipalm.co.uk>:Compared to running the same set of sites via innxmit where they you might be sending the same article to n sites but not all at the same time. You are going to have to retrieve the same article n times from disk. [Unless you have more buffer space than sense, of course.](2) More important, if you have a large number of feeds, NNTPlink permits them to be fed simultaneously with the same articles. No big deal, until you think of the what's going on in the pagedaemon and the disk cache.
A "ps uaxr" rarely catches NNTPlink in the act ("D"), despite my having 17 of them last time I counted. Our biggest outgoing newsfeed delivered 16398 articles yesterday, using a total of 380 seconds CPU on a Sun IPC, and no disk time :-)
When it comes to transmitting a back log, all of the gains of the buffer cache go out the window, and we are back to having retrieve articles off disk again.
To transmit the backlog, the algorithm goes something like...
So far it is a toss up between innxmit and NNTPlink.
Then why did I choose innxmit?
rebatch is modelled after INN's nntpsend program. They share one feature: they both append to a batchfile that innxmit is working from. I examined the code of innxmit and saw it took great pains to cope if people did that. I have not examined the code of NNTPlink to see if it does the same. NNTPlink does so much *more* than innxmit it was difficult to see what it was doing. That is no slur on NNTPlink, I just think innxmit is a better tool for this job.
If anyone has hard evidence one way or the other about this, do let me know. If I am talking through my hat, do have the grace to tell me politely.