The theory behind rebatch and where.to.

Contents

  • What rebatch Does
  • What where.to Does
  • innxmit vs NNTPlink

    Assumptions

    rebatch reads your newsfeeds file to figure out which sites are fed by NNTPlink. It expects to see entries of the form:
    site:*:Tc,Wnm:/usr/local/news/bin/nntplink -k -q nntp.host.org
    It makes the assumption that all NNTPlink feed sites are up for grabs. This decision is made by looking at the contents of the last field on the newsfeed entry. If it matches the expression in $nntplink then it is a target.

    If this is not true at your site, and until I think of something better you can try this:

    make a dummy newsfeeds file with lines like

    site:xxx:xxx:nntplink nntp.host
    and point rebatch at that.

    A separate configuration file is NOT an option - I want to think less not more each time I add a feed!

    If that does not appeal, you could create a symbolic link giving nntplink another name (e.g., nntplink2), use nntplink2 in your newsfeeds file and as the value of $nntplink_id in rebatch.conf.

    I also assume that the last thing on the newsfeeds line is the remote host to send it to. If that is not true, then a dummy newsfeeds file is called for.

    rebatch can cope with INN's comments, whitespace and continued lines in the same way as INN does.

    One other fun assumption is that one innxmit fits all. That is, it is suitable to send a batch file with...

    innxmit -t 300 remotehost batchfile
    ("-t 300" ensures you don't get hung processes if the remote site is off the air). You can alter the innxmit parameters on a global basis, but not on a case by case one.

    If there becomes a real need for a per-site configuration file, I will think about adding it. This was written for my site and (fortunately) I don't need that complexity.


    What Rebatch Does

    This section details the operation of the rebatch script.

    Newsfeeds parsing

    rebatch (and where.to) start calling the subroutine read_newsfeeds in rebatch.common to parse the newsfeeds file.

    It reads line at a time from the file, discarding everything after a comment character, then leading and trailing whitespace.

    If the line ends in a '\' character it is appended to the previous line, sans the '\', and another line read. When it finds a line that does not end in a continuation character (and it is not blank) it assumes we have a valid newsfeeds line.

    That is, the entries

    site1:*:Tc,Wnm:/usr/local/news/bin/nntplink -k -q nntp.host.org
    ## site2 is very fussy about the groups its gets
    site2:!*,\
    hundreds of individual group lines,\
    each ending in,\
    a continuation character,\
    !/local:Tc,Wnm:/usr/local/news/bin/nntplink -k -q nntp2.host.org
    are both parsed with as INN does.

    The line is then split apart on the ':' character.

    If there is a '/' in the first (site) portion, it and everything after it is disposed of.

    The fourth part is examined. If it contains the $nntplink_id (which can be a Perl regular expression) then we have an NNTPlink feed site rebatch can cope with. If we don't see $nntplink_id, then this entry is discarded. rebatch then splits the fourth part apart on spaces and takes the last portion (nntp.host.org and nntp2host.org in the above examples) as the host name we send articles to.

    The sitename and nntphost name are recorded in arrays to be used.

    rebatch's gory details

    rebatch starts by making a lockfile and using shlock style locking to ensure that only one of itself is running at once. Nothing magic here.

    Then it reads the newsfeeds file (see above). It works through the sitename/nntphost pairs after that. This is where things get "interesting".

    I have found six types of files that an channel feed NNTPlink can leave behind:

    1. nntphost.link
    2. nntphost.1234
    3. nntphost.1234.tmp
    4. sitename
    5. sitename.1234
    6. nntphost.rebatch
    They are:
    1. nntphost.link is a status file NNTPlink uses. It contains things like the process ID of current NNTPlink process. This file is normal and rebatch ignores it.
    2. nntphost.1234 is created when the remote host gets too far behind in its acceptance of articles, or NNTPlink exited with articles to send. Files of type #3 get renamed into files with this name.
    3. nntphost.1234.tmp is created when NNTPlink can not contact the remote host, or the remote host refuses articles for some reason. Files of this form are open for writing by NNTPlink, so care needs to be taken with them. If NNTPlink has one of these files open when it exits, it renames it to be of the #2 form.
    4. sitename is created when NNTPlink is so far behind that INN notices and starts writing its own batch files. If this file exists, INN may have it open for writing, so the only way to cope with it is to rename the file, then send INN a flush for that site.
    5. sitename.1234 is not created by NNTPlink or INN. One of my beta sites asked for files of this form to be cleaned up (something to do with NNTPlink funnel feeds?) so I do.
    6. nntphost.rebatch files are created by the rebatch program and contain the concatenation of the five other files.
    5 and 6 are not created by NNTPlink or INN, but are considered anyway.

    For each site/hostname pair, rebatch calls the subroutine do_site_flush to scan the out.going directory for filenames of the form #1-#5 above. It does this with a shell glob matching the patterns

    $nntphost.*
    $sitename*
    Files of type #1 are ignored.

    If it finds files of type #3 or #4 it makes a mental note to flush the site.

    If it sees a file like #4 it renames it to $nntphost.0, to be picked up by a later glob. Hopefully 0 will never be a valid process ID, so this file will not be clobbered by NNTPlink. To make sure, if it sees it would overwrite a file, it appends 0 onto it until it gets a unique name.

    Files like #2, #5 or #6 are ignored at this point, but a note is made a significant file is found ... that is, a batch file that may need transmitting.

    Once the scanning is complete, it sees if it needs to flush the site and issues a ctlinnd command if it does.

    If a batch file that needs transmitting is found, it then calls the subroutine rebatch_files to collect the batch files together.

    sub rebatch_files takes the output of the shell glob matching these two patterns:

    $nntphost.*
    $site.*
    above and works through it. (NOTE the period in the second glob.)

    If the file is not of the form

    nntphost.1234
    or
    sitename.1234
    (that is types #2 and #5) it is ignored. Files of that form are concatenated to the .rebatch file, then unlinked.

    Then the batchfile is transmitted to the remote host in the subroutine send_batch.

    sub send_batch does shlock style locking to ensure only one innxmit process is going to be running at once to a site.

    If no other rebatch process is active for that site, it then double forks...

    The original parent moves to the next site.

    The first child create the lock file, forks and then waits for the grandchild to die.

    The grandchild reopens STDOUT and STDERR to the progress file, writes the current time in seconds since 1970 to the file and then execs innxmit.

    And that is about it!


    where.to gory detailsx

    where.to is a considerably simpler program.

    It also begins by reading the newsfeeds file as detailed above, then works through each site/host pair.

    If it sees a lock file for the site it notes the time and counts the number of ihave lines in the progress file. It then does a similar glob to rebatch to find all the batch files for the site and counts the number of lines in them.

    It then does a little math to figure out how many articles to send and the estimated time to send them, based on how many have gone before and how long it has taken.

    There are two flaws in where.to that I know of:

    1. where.to will tend to over estimate the time to completion... something about averages and rates and all that.
    2. where.to will over estimate the number the number of articles left to send in one case. When you have a batchfile with expired articles in it, innxmit will silently skip over the expired articles. The articles won't produce ihave lines and won't get counted as a 'transmitted' article.

    I have no plans to fix these. Earlier versions of where.to did all sorts of clever things (remember the last message ID, look for its position in the batch file), but they don't work well with the globing.

    If the site is not running, then it simply counts up the lines in the candidate batch file and tells you. If it sees a .tmp file it tells you as well.


    innxmit vs NNTPlink

    I have been asked about the virtues of using innxmit for sending batches of articles. After all, we abandoned nntpsend (which uses innxmit) in favour of NNTPlink for a performance boost, and by gayds we got one! Is it a step back to use anything but NNTPlink for this purpose?

    Answer: NO!

    The purpose of running NNTPlink as a channel feed is to pass on the article as soon as it arrives on the server. [Tom Limoncelli calls this the "INN Instant Party" and "a gimmick" in his INN FAQ.]

    If you do this, the text of the article will be in the system's buffer cache so won't have to be fetched off disk, so will go much quicker and so you get a performance boost.

    From the INN FAQ:

    Ian Phillipps <ian@unipalm.co.uk>:

    (2) More important, if you have a large number of feeds, NNTPlink permits them to be fed simultaneously with the same articles. No big deal, until you think of the what's going on in the pagedaemon and the disk cache.

    A "ps uaxr" rarely catches NNTPlink in the act ("D"), despite my having 17 of them last time I counted. Our biggest outgoing newsfeed delivered 16398 articles yesterday, using a total of 380 seconds CPU on a Sun IPC, and no disk time :-)

    Compared to running the same set of sites via innxmit where they you might be sending the same article to n sites but not all at the same time. You are going to have to retrieve the same article n times from disk. [Unless you have more buffer space than sense, of course.]

    When it comes to transmitting a back log, all of the gains of the buffer cache go out the window, and we are back to having retrieve articles off disk again.

    To transmit the backlog, the algorithm goes something like...

    Excepting any major brain deadness in innxmit or NNTPlink I would expect the performance to be identical. I have not tested this theory. It is very difficult to test this. You need two batches of articles of approximately the same composition, but with different message IDs. Life is too short.

    So far it is a toss up between innxmit and NNTPlink.

    Then why did I choose innxmit?

    rebatch is modelled after INN's nntpsend program. They share one feature: they both append to a batchfile that innxmit is working from. I examined the code of innxmit and saw it took great pains to cope if people did that. I have not examined the code of NNTPlink to see if it does the same. NNTPlink does so much *more* than innxmit it was difficult to see what it was doing. That is no slur on NNTPlink, I just think innxmit is a better tool for this job.

    If anyone has hard evidence one way or the other about this, do let me know. If I am talking through my hat, do have the grace to tell me politely.


    Part of the rebatch package
    Russell Street (r.street@auckland.ac.nz)
    Last updated: 12th February 1995