FAQ for "news in cans" v1.31 from Apr. 8th, 1998 Tobias.Hennerich@swabsib.s.bawue.de Changes from version 1.2 from Mar. 2nd, 1998: - Table of contents - Complete support of innfeed - Misc. new questions: - What are the differences relative to timehash and CNFS? - What will happen with "News in Cans" with the next release of INN? - I want to rebuild the overview database, how? - What's the meaning of shelf.01/199805182202!0000019110 ? - I would like to quickly read this one article... - I would like to use inpaths - Misc. other modifications everywhere: - Expire, news.daily, crontab, expireover Changes from version 1.1 from Feb. 9th, 1998: - First beta support of innfeed - Detailed explanations for crontab entries during installation - What should I do to change the can interval? TABLE OF CONTENTS General issues - What are "News in Cans"? - What are its advantages? - What are its disadvantages? - What are the differences relative to timehash and CNFS? - What will happen with "News in Cans" with the next release of INN? - Where to look for "News in Cans"? - Does a mailing list exist? - Does a web page exist? - Which operating systems are supported? Internals - How exactly does the system work? - What happens with crosspostings? - What happens with cancels? - Ok, the articles get stored very quickly without many disk head movements... - How exactly are the cans created? Installation - Short version - Long version - Building - Syslog.conf - Crontab - Overview.fmt - Newsfeeds - /var/spool/news/articles Use and Administration - I've already a working INN system and would like to use the existent articles - I would like to change expire's expiring times - What's then left to do for expire? - How does expireover works? - I would like to create a new overview database, how? - I would like to use an additional partition for the newsspool - I would like to join some file systems into a single partition - The shelfs are not filled equally with cans, why? - I would like to change the expire's expiring interval from 4 to ... hours - After an unclean shutdown innd complains like this... - What's does shelf.01/199805182202!0000019110 exactly means? - I would like to quickly read this one article... - I would like to use inpaths Misc. - I would like to learn more about the internals - I still have problems, questions. Who will help? - I do have one more question. Why isn't it answered here? - Anything more? GENERAL ISSUES Q: What are "News in Cans"? "News in cans" (NiC) is a modified version of the USENET news server INN that doesn't store articles in individual files but groups them together in large ones. All articles in such a file will be deleted simultaneously at a certain time, made evident by the name of the file itself. It's true that each file has a due date, very much like cans you buy at a store. Hence the name of this news system. Q: What are its advantages? - As it's not more necessary to create continously new files for news articles, it's possible to store articles very quickly in large numbers onto disk. The speed advantage on various test equipments was betweens 5 and 10 times relative to standard INN systems. Some systems stored articles at more than 200 articles per second. Even i486 systems with IDE disks achieved without further tuning > 30 articles per second. With inn-1.7.2-insync it were only half of these figures. - As the articles are grouped together in one file, the slack on the disk is reduced, and thus less disk space is needed. - Expiring isn't a matter of hours any more as only one large file (or only a few large files in case of high volume news). - As expiring is very time efficient, it's possible to expire more times per day and thus improve disk load balance. - Despite this expire's behaviour is the same as with traditional news systems; i.e. articles posted on Monday will be available exactly as long as those posted on Friday. Administration problems will not lead to random article removal. Articles with "Expire:" header will be traeted in the right way. - As secondary product a new, much quicker overchan than the one in traditional implementations of INN was created. Q: What are its disadvantages? - Programs that rely upon the storage of articles in individual files don't work any more. The most import programs of the INN distribution like innxmit, batcher, nnrpd, all control scripts etc. and, of course innd itself were adapted to the new system. Meanwhile also an adapted version of the feeder innfeed exists. Any other programs have to be modified before you could use them. A small library and a tool for reading articles exist for this purpose. Adaptation of the backend innxmit for example required only modification/ addition of 38 lines of code, out of more than 1600 lines. Most modifications consisted of the substitution of "qio" with "qci". For innfeed less than 50 lines of code out of 18,000 needed modifications. - A quick change of expiring times because of disk shortage is not (yet) possible. Or will have effect only some days later. It's however possible to remove a file not yet foreseen to be expired and thus overcome such shortages. - Documentation is not yet up to date. All modifications relative to the traditional INN versions are described here or explained in more detail in a paper (available only in German). - Very large news systems working in master/slave mode and exchanging articles via NFS are not yet supported. That's not because of a not working master/slave mode, but because it's not possible to access directly one article on another system - unless doing it by nntp. - This FAQ is a very poor translation from its german edition. Q: What are the differences relative to timehash and CNFS? Timehash stores all articles along a hashing algorithm (based on the time articles were received or posted) into separate files. The advantage of NiC, needing far less files to store articles, isn't seen in timehash. Actually timehash softens only the problems for very large newsgroups like misc.jobs.offered or control. This is true for NiC too. As NiC CNFS works with large files storing articles sequentially. But NiC's expiring process isn't possible with CNFS: if a new article is received, an old one will be removed (or if the incoming article is very large, many small ones could be removed). "Expire:" headers aren't considered in any way. The file format is different from NiC's and not easily accessible by shell scripts. Furthermore it's not possible to add new disks on a running system. Q: What will happen with "News in Cans" with the next release of INN? Well, what? NiC will be ported of course. As the next INN version will include a storage concept , that allows to store articles in different ways simultaneously, porting should be even easier than with todays code base. Q: Where to look for "News in Cans"? "News in Cans" are available at the following URL: ftp:/ftp.uni-stuttgart.de/pub/unix/comm/news/nid/inn-nid.tar.gz Inn-nid.tar.gz is a link to the most recent version of "News in Cans". Older versions will still be available. In the same directory other files will be made available for download as for example this FAQ. Q: Does a mailing list exist? Yes, the list too is provided by the University of Stuttgart. The list can be joined with a mail to majordomo@listserv.uni-stuttgart.de with the following line in the message body subcribe nid The rest comes automagically. Who wants more information should send a mail with the word "help" in the message body above mentioned address. Keep in mind that the list server doesn't puts a "Reply-To: in its messages and thus all mail from it should be answered as group reply to let the other list subscribers read your message. Q: Does a web page exist? Not yet. Up to now, I hadn't enough time to do it. If somebody wants to create one and mails it to me, then I should be able to home the page(s) on the web server of the University of Stuttgart. Q: Which operating systems are supported? I did develop and test on NeXtStep, FreeBSD and Solaris 2.6. The system presently works on some Linux systems too. Generally speaking NiC doesn't relay on special features of individual Unices. Thus it should work with minimal modifications on all systems the standard INN distribution supports. INTERNALS Q: How exactly does the system work? - When innd starts it reads the file expire.ctl, and for each newsgroup present in the active file, it stores how long the newsgroup should be kept on the system. - Upon receipt of news articles, the present date will be increased by a retention time ("Expire:" headers will be considered according to the settings in expire.ctl) and the article will be stored at the end of the regarding can. To avoid constant opening and closing of cans, innd will keep open a few cans simultaneously. - In the history file the following informations are stored: article with message id x is stored in can y at position z. Furthermore the overview database receives an entry per article, that article 12345 of newsgroup a.b.c is stored in can y at position z. The out.going files for the feeds don't look like message id x, a.b.c/12345 anymore but like message id x, y!z. - Like usual innd gets hold of the articles through the history file. Nnrpd for the news readers likewise through the overview database. The backends innxmit and batcher through the references in the out.going files. Q: What happens with crosspostings? Articles crossposted to different newsgroups are stored as usual only once by innd. More precisely, into the newsgroup that will be kept longest on the system in the can that will be removed latest. Thus it doesn't matter if the news spool is spread across different disks, and problems with symbolic links don't exist. Of course for each newsgroup a record is stored in the overview database to insure reading capability for each article from every newsgroup. Q: What happens with cancels? As it needs much effort to remove articles in the middle of 100 MB files, cancelled articles get only marked as such and will not be forwarded to news readers for being read. The article itself will not be removed. It should be relatively easy to take advantage of the Unix feature to create files with "holes" in them that need no disk space. Therefor, once a day, the cans would have to copied and all cancelled articles replaced by holes. This is not yet implemented. Q: Ok, the articles get stored very quickly without too much disk head movements. But what good is that if the overchan has to write a record per article (and much more for crosspostings) into a file and thus many disk head movements will still be necessary? This isn't directly related with the concept of NiC but the problem was at least somewhat reduced: Overchan got a buffer and thus will not write every single record immediately into an overview file but will begin to collect data. After - a timeout, or - a certain amount of data, or - when the overchan will record a local article (message id = local system) overchan will begin to write to the overview database. If during the buffer time some articles for a newsgroup were received, overchan will be able to write the database information using only one open/close call and thus needs much less disk accesses. Depending on the number of newsgroups present and the speed the articles are received, up to 50 % of file accesses could be saved. Q: How exactly are the cans created? Before storing an article innd checks if a can for the needed time frame already exists. If no can exists or if an existing can has already grown over a certain limit (e.g. 100 MB) a new can will be created. Cans belong into a shelf. That's why innd searches in the directory where usually the directories alt, comp, de, rec, misc etc. reside, for directories whose names begin with "shelf.". From these the one with the most available free disk space is taken, thus allowing a repartition on different disks. INSTALLATION Q: What have I to take care of when installing "News in Cans"? This FAQ can't deal with the installation of INN itself. It will always be much work and isn't the right thing to do for administrators that haven't installed any other software yet. Please look into the general INN FAQs for this purpose. They are included into "News in Cans". The short version of things to take care of is: - Obtain NiC, do "make" and "make install" as usual - At system level, adjust syslog.conf and crontab if necessary - At INN level, adjust newsfeeds and overview.fmt and create the shelfs The long version: Building - Obtain the sources and compile them. - Parameters for "News in Cans" aren't yet configured in config.data but have to be set using the file includes/can.h. In most cases the default values therein should be acceptable. Unfortunately no standard way exists to retrieve information about available disk space. R$ solved the problem leaving for each user to check the df command and modify innwatch accordingly. This is not much of a help in my case. You have to check for each operating system how to retrieve the available disk space and then modify #define STATFS in canwrite.c accordingly. For Solaris for example the command is #define STATFS statvfs. Dependingly different #includes statements have to be integrated too. Example Solaris: the manpage tells: --- cut --- statvfs(2) System Calls statvfs(2) NAME statvfs, fstatvfs - get file system information SYNOPSIS #include #include int statvfs(const char *path, struct statvfs *buf); int fstatvfs(int fildes, struct statvfs *buf); DESCRIPTION statvfs() returns a "generic superblock" describing a file system; it can be used to acquire information about mounted file systems. buf is a pointer to a structure (described below) that is filled by the function. --- cut --- -> canwrite.c needs the addition of #include and . - Compile INN as usual, configure all files in $INN/site and do "make install". Configuring - Who wants to closely see how the the system works, should use the debug facility of INN and add a line to /etc/syslog.conf (keep in mind that syslogd becomes upset with tabs. When cutting and pasting the following lines be careful): news.crit /var/log/news/news.crit news.err /var/log/news/news.err news.notice /var/log/news/news.notice new -> news.debug /var/log/news/news.debug This debug file will not been taken care by news.daily. Therefor it will grow forever. Trim it by hand or disable it in syslog.conf. - The crontab entry is somewhat different. As expire is extremely quick, the system performs an expire every 4 hours to balance disk load better. Once per night the overview database has to be shortened and the log files have to be trimmed. Therefore the crontab entry reads like this: 5 2 * * * news /usr/local/news/bin/news.daily expireover 5 6,10,14,18,22 * * * news /usr/local/news/bin/news.daily notdaily nomail With these entries the following issues are important: As the overview database gets trimmed only once per night, nnrpd has to take care not to offer articles for reading that have already been expired. That's possible because nnrpd ignores (for performance reasons) all entries of the overview database that should already have been expired due to the naming of the cans. That's why it's useless to expire less often, articles won't be available for more time (but it's possible to send for a longer time period articles to your peers). Who really wants a longer expire interval has to make a change that will be explained later. The times of 2, 6, 10, 14, 18 and 22 o'clock are a result of the fact that the system internally works with GMT and therefore cans have to be created in Germany 1 hour (winter time) or 2 hours (daylight saving time) later. Keep in mind that expire should be started shortly *after* and not *before* 2 o'clock or 10, 18 o'clock as at 1.55 h the files for 2 o'clock aren't yet ready for removal. The option "notdaily" of news.daily was modified in such way to accomplish all daylight tasks very quickly. So a renumber isn't carried out during daylight and expire doesn't rebuild the history file from scratch. - The overview database is necessary to access articles of different newsgroups with news clients. That's why on news systems that don't only distribute news also the program overchan has to run. The configuration file overview.fmt requires the entry: Xcanpos:full at its end. See also the file newsfeeds. - In the file newsfeeds two things have to be taken care of: * Overchan has to be entered always (excluding distributing only systems), e.g. with an entry like: overview!:*:Tc,WO:/usr/local/news/bin/overchan * The entries for referencing articles with f or n is still possible, but the backends innxmit and batcher doesn't use them anymore. Instead the articles are referenced with "c" (like canpos). A nntp feed will be fed by innxmit as follows: # Feed all local non-internal postings to nearnet; sent off-line via # nntpsend or send-nntp. nic.near.net\ :!junk/!foo\ :Tf,Wcm:nic.near.net ^ important! here a "c"! A feed for uucp-batches requires: # A UUCP feed, where we try to keep the "batching" between 4 and 1K. ihnp4\ :!junk,!control/!foo\ :Tf,Wcb,B4096/1024: ^ important! here a "c"! For innfeed the following is necessary: innfeed!:!*\ :Tc,Wcm*:/usr/news/local/startinnfeed ^ ^ | important! here a "c" ! | this c here is for channel and isn't needed by NiC (has to be there nevertheless) - The directory that holds the articles (often /var/spool/news/articles) requires the shelfs for the cans. Shelfs are directories in which the cans are then stored. One shelf for every file system. It's meaningless if these directories are on different disks, symlinked, mounted or created as actual directory as long as their name begins with "shelf." and ends with any alpha numeric sequence. A possible notation would be: shelf.01 shelf.02 shelf.03 But also: shelf.seagate shelf.ibm shelf.quantum If more than one shelf is created on a single disk INN will write only into the first directory on this disk (up to now that's a feature not a bug). These are already all the necessary modifications for the system to work! USE AND ADMINISTRATION Q: I already have a working INN and would like to use the existing articles That's difficult as NiC is using a totally different concept as usual. There are some possibilities to use at least part of the old articles: - Remove all articles and begin from scratch (ok, so you don't use any old article but I've done this many times 8-). - Halfen the settings in expire.ctl and let half of the disk be cleaned. Then install the new innd (but keep the batcher from the old innd), build a new history with makehistory and feed the articles with a small script into the new innd. E.g.: cd /var/spool/news/articles.old find . -type f -print | batcher.old -b 1000000 test -p "rnews" Using this method thousends of articles will be complained for non acceptance. The reason is that crosspostings that are on the same file system will be offered more than once to the new innd but be accepted only at the first offer. The disadvantage of this method is that the newsreaders will be offered the same article twice with different article numbers and the system will be down for some hours. - Get a new machine, install "News in Cans" and let the old and the new system work in a master/slave configuration. After some time the new system will be nearly the same as the old one and switching to the new system could be done. This method was experienced by me and it works great. - Don't get hold of a new machine but install "News in Cans" onto the present news server into a different directory, use different syslog entries, start the new innd with a different port number and let both innd work in a master/slave configuration for some time. This needs enough disk space or a shortened expire, allowing a transparent switch to the new system later. This method wasn't yet tested by me, but I know somebody who did it this way. Q: I would like to change expire times As innd reads expire.ctl at its invocation, after changing expire timings innd must reread expire.ctl. The easiest way to achieve this is by issuing 'ctlinnd reload active "new expire times"'. Q: What's then left to do for expire? (As up to now) expire searches the history file and decides what articles have to be removed. This is accomplished by looking at the cans names, which determine the expiring time. Deleting of message ids is done according to the date an article was received. That's the reason why expire needs to get hold of expire.ctl to read the remember entry. At the end expire scans all the shelfs and removes the regarding cans. Traditionally when expire was invoked with the -x option the history file didn't get rebuilt but history was scanned nevertheless to find all articles that would have been expired. That's not necessary any more with NiC, as the scan will not be performed and an expire -x only takes seconds to complete. Q: How does expireover works? As expire would only be able with immense effort to build a list of articles to remove, expireover scans all overview files and removes all entries of articles that should be removed (according to their can name). As this is still a task it will be performed only once per night (if disk space permits even less). Nnrpd records itself if an article should be offered for reading or if it is expired. If the overview db gets corrupt (e.g. a disk crash with a shelf on it) the new option "expireover -c" permits to look explicitely if an article is really existent in the cans. So cancelled and prematurely expired articles will be discovered and removed. This option is slow due to the effort needed and should only be used for recovery purposes. Q: I would like to create a new overview database, how? The old option expireover -a doesn't exist any more. Instead a new tool called makeoverview exists, which scans through all cans and delivers an output like innd. Thus using "makeoverview | overchan" permits to build a new overview db. This overview db could be unsorted and thus a "expireover -s" should be performed, as this sorts the .overview files. Nnrpd was modified in such way to work also with unsorted .overview files. I.e. If innd writes an overview! batch into out.going, it's possible to put its contents into the overview db by issuing an "cat overview! | overchan". Q: I would like to use an additional partition for the newsspool At the place where already one (or more) shelf.xx directories are installed a new disk can be mounted or via symbolic link point to a directory of the disk. This link or the mount point have to begin with "shelf." (see above). As soon as innd has to create a new can (because the can doesn't yet exists or an existing can for a specified time frame is already full) it checks the shelfs and takes the one with the most available disk space. Hence this can be done on the fly. A restart of innd isn't necessary. Q: I would like to join some file systems into one partition To accomplish this innd has to be shutted down, the shelfs grouped together on one disk leaving the cans in their shelfs. After restart of innd all shelfs will have the same amount of free disk space, thus innd will always use the first shelf. As time goes by the other shelfs will be emptied by expire and could then be removed manually. Q: The shelfs are not filled equally with cans, why? Innd doesn't try to fill up the shelfs equally, it tries to leave empty disk space equally. If you have relatively small shelfs (only some 100 MB) or if due to disk space shortage a new shelf was used, then it could be possible that a can size of 100 MB is way to large, to assure a quick filling of the new shelf. For such cases the can size can be changed in $INN/includes/can.h as follows: Before: #define MAXCANSIZE (100*1024*1024) /* max. 100 MB per can */ After: #define MAXCANSIZE (30*1024*1024) /* max. 30 MB per can */ This way the cans fill up much quicker and newly arriving articles get distributed into different shells quicker too. The drawback is that using more and smaller files will increase the systems load while opening and closing files. Q: I would like to change the expire interval from 4 to ... hours Change the following define in $INN/includes/can.h from: #define CANTIME ((time_t)(60*60*4)) /* every 4 hours a can */ to: #define CANTIME ((time_t)(60*60*8)) /* every 8 hours a can */ Now the crontab entry can be changed as follows: 5 2 * * * news /usr/local/news/bin/news.daily expireover 5 10,18 * * * news /usr/local/news/bin/news.daily notdaily nomail Keep an eye to the other issues regarding crontab in "Installation". Q: After an unclean shutdown innd complains like this... innd: found shelf.01/199712032101, delete=881179200 (61193.00) innd: truncated can at bytepos 51284764 of 51284764, recovered 50 articles innd: SERVER check history for entries of can shelf.01/199712070501! The short version: as long as the message tells about 2 identical bytepos (as in this example) (nearly) all is well and the messages could be ignored. The long version: this means that at startup when innd searches for all cans present on the system, the file size in the headers of the cans doesn't reflect the true file size of the cans (the file size in the header is rewritten each time a can is closed or every few minutes). Innd believes that the system lost power during storing of articles and therefor the last article couldn't be stored completely into the can. Innd tries to read the articles as per file size of the header (the last sure storage of articles) and eventually cuts the last article. The example above tells us that the can was truncated after its last byte and thus no article was lost. But most probably this means that overchan was shutted down also uncleanly and will not be ok: As overchan buffers articles up to 60 seconds in memory and writes them into the overview database only after that, these articles will not be stored in the overview database. As the overview database is the only way for nnrpd to read articles these articles won't be available for reading by the news readers. Hence the shutdown scripts for the system should assure a clean shutdown of innd or innd should allways be shut down manually using 'ctlinnd shutdown "reason why"'. Q: What's the meaning of shelf.01/199805182202!0000019110 ? That's the so called canpos, i.e. the position at which an article will be found. In this example the article could be found at byte position 19110 in the can named 199805182202, that will be the 2nd can of those, that will be expired at May 18th, 1998 at 22:00 h. This can is stored in shelf.01. Q: I would like to quickly read this one article... No problem. Therefor exists the new command ccat: As parameter use the canpos and the article will be sent to stdout. With newer shells the ! could be a problem. In this case issue the command as "ccat shelf.01/199805182202\!0000019110" or change shell to Bourne sh before issuing ccat. Q: I would like to use inpaths The new command nidhdr was written for this too: Nidhdr scans all articles of all cans for an arbitrary header and sends the output to stdout. So it's possible to gather statitics with the command "nidhdr path | inpaths -p news.domain.de" on the fly and very quickly. With appropriate hardware 1500 articles and more per second. Who doesn't know anything about inpaths, should look at: http://www.freenix.fr/top1000/ MISC. Q: I would like to learn more about the internals Obtain the appropriate paper at ftp.uni-stuttgart.de (german only) and read there what was true in August 1997. After that look at the sources. If that doesn't still help, read the next question in this FAQ. Q: I still do have problems, questions. Who will help? First, check if the problem is really related to "News in Cans". This program is running in production environments on several systems since months without severe problems. If problems were encountered, NiC was always the first culprit, even when afterwards heat problems, incompatibilities between SCSI controller and SCSI disks, configuration problems with INN were identified or the wrong news server was been connected. Nevertheless it is probable that bugs still exist in my modifications (I found some bugs in the INN code too) and improvements will be possible. If general problems can be excluded or questions of general interest are made, I suggest sending a mail to the list nid@listserv.uni-stuttgart.de. I'm reading this list too. If other things are concerned I could be reached by mail at Tobias.Hennerich@swabsib.s.bawue.de. I'm trying to read my mails daily... Q: I do have one more question. Why isn't this answered here? Please contact me. I'm trying to integrate all questions into this FAQ. Q: Anything more? I appreciate to hear about experiences with NiC. Write me if you are using this software and if it gives the expected results. If anybody would help developing this software further, s/he's welcome. I would like to thank (in chronological order): - Barbara Burr, welche die Studienarbeit betreute und es mir dadurch ermoeglichte aus einer Idee ein Projekt werden zu lassen - Frank Scholz, der massgeblich am Namen der "Dosen" beteiligt war und der auch sonst alle meine Probleme erzaehlt bekam - Den Admins des BaWue-Nets, welche es mir ermoeglichten die erste Installation der NiD auf einem Produktionssystem auszutesten - Dem RUS-Team, welches durch die Installation der NiD auf dem offiziellen Newsserver der Uni-Stuttgart eine Referenz-Installation ermoeglichte und durch ano-ftp-server und Mailingliste auch sonst Resourcen bereit stellte - Michael Giegerich, der mir ploetzlich und ohne "Vorwarnung" die englische Version der FAQ zumailte (das war wie Weihnachten und Ostern gleichzeitig).