Standard SysAdmin Help File

ECLIPSE
  1. Operations Procedures
  2. Free Space alarm
  3. CPU Load alarm
  4. Inetd process alarm
  5. Sendmail process alarm
  6. NNTP process alarm
  7. DNS process alarm
  8. Mail Queue alarm
  9. Mail Age alarm
  10. PING alarm
  11. SMTP connection alarm
  12. POP3 connection alarm
  13. HTTP connection alarm

Operations Procedures

Day Time Alarm Color Action
Weekday Normal working hours Yellow Notify the administrator
Weekday Normal working hours Red Notify the administrator and log occurance
Any day Outside working hours Yellow Monitor the situation closely
Notify administrator next working day
Any day Outside working hours Red Notify the ON-CALL administrator
Notify administrator next working day

Free Space alarm

All alarms regarding Free Space checks are serious and a report should be logged.

Common causes are flooding email caused by a mail loop of some kind (/var), too much auditing (/var, /home, even / or /usr), or system logs that have not been cleaned out on a regular basis (/var or /usr). Huge system logs indicate a failure of some kind and should be examined to determine the cause of the problem.

When /home fills up, the most common cause is due to users. Cleaning out netscape caches in /home/*/.netscape/cache/*/* can provide temporary relief. For a more permanent solution, find out who is using the most disk space using "du -s /home/*" and ask them by email to clean up their accounts. If all of the disk usage is actually required for valid reasons, then either expand the filesystem or create a /home2 and move some users to it. Another alternative would be to move some users to a different server that has more capacity.


CPU Load alarm

All RED alarms regarding CPU Load checks are potentially serious and may indicate a problem of some kind that should be investigated. The administrator should be notified immediately so they they can try to pinpoint the process causing the heavy load. A report should be logged only if the administrator has asked for it specifically (such as during an ongoing problem investigation).

Yellow alarms are NOT serious and only serve as a warning so that the situation can be monitored more closely. The administrator does not need to be notified and a report is not required.


Inetd process alarm

All alarms regarding the Inetd process are serious and a report should be logged. The administrator should be notified immediately since the system may require a reboot.

The inetd daemon is responsible for spawning processes to handle remote connections such as telnet, ftp, remote shells, and pop3 (required for PC based mail clients). The system will be basically dead to any new remote connections without te inetd process and in most cases will require access from the system console.

If the console cannot be logged into, then in most cases the system will require a hard boot. This involves pressing the reset button (if one exists) or powering the system OFF, waiting for at least 30 seconds, and then powering it back on.


Sendmail process alarm

All alarms regarding Sendmail process failures are serious and a report should be logged. The administrator should be notified during normal working hours. If there is an ON-CALL administrator for the server, then they should be notified outside normal working hours.

Sendmail provides email delivery to and from a server and must be running at all times.


NNTP process alarm

All alarms regarding NNTP process failures are NOT serious but a report should be logged. The administrator should be notified during normal working hours. If there is an ON-CALL administrator for the server, then they should be notified outside normal working hours.

NNTP provides news service and must be running for users to be able to both read and send out news postings.


DNS process alarm

All alarms regarding DNS process failures are EXTREMELY serious and a report must be logged. The administrator should be notified IMMEDIATELY during normal working hours. If there is an ON-CALL administrator for the server, then they should be notified outside normal working hours.

DNS provides host name lookup service and must be running 24/7. It is a critical service.


Mail Queue alarm

Mail Queue alarms are not necessarily serious. They could simply be a temporary condition (which will eventually clear itself up) that is due to a single event where a lot of mail messages has been generated by a process or user. Alternatively, they could indicate a problem with the Postoffice systems, a hostname resolution problem (DNS, NIS, or /etc/hosts), a disk space problem, or a problem with the sendmail daemon itself.

A report should be logged for all red alarms if they persist for more than 30 minutes and it seems like the queues are not clearing up. The administrator should be notified in all cases so that the cause can be determined and steps taken to handle the problem. Perhaps in some cases the alarm threshold will have to be raised.


Mail Age alarm

Mail Age alarms are not necessarily serious but should be checked immediately. They indicate messages in the sendmail queue that are older than a given limit in seconds. The mail queue can be checked by logging onto the host and running mailq or /usr/lib/sendmail -bp.

They could indicate a problem with the Postoffice systems, a hostname resolution problem (DNS, NIS, or /etc/hosts), a disk space problem, or a problem with the sendmail daemon itself or it's config file.

A report should be logged for all red alarms if they persist for more than 30 minutes and it seems like the queues are not clearing up. The administrator should be notified in all cases so that the cause can be determined and steps taken to avoid the problem in the future.


PING alarm

A PING alarm generally indicates either a host that is down or a router problem. It is useful in combination with a connection type alarm (such as SMTP or HTTP) to show whether the connection type alarm is due to something wrong with the server process on a running system or whether the entire connection is down.

Sometimes a single ping will fail over a slow serial connection, but subsequent ones will show as OK. This is why normally more than one ping is performed and the number of returned ones are counted. If there are 0 responses received, then the alarm will be set to red. If all are returned, then the alarm is set to green. The yellow threshold can be set somewhere in the middle for unreliable connections.


SMTP connection alarm

All alarms regarding SMTP connection failures are serious and a report should be logged. The administrator should be notified during normal working hours. If there is an ON-CALL administrator for the server, then they should be notified outside normal working hours.

SMTP provides mail service and must be running for users to be able to receive and send out mail messages.


POP3 connection alarm

All alarms regarding POP3 connection failures are serious only during normal weekday working hours. A report should be logged. POP3 is required for remote users to get access to their email using a PC based POP3 mail client such as Z-Mail.

These alarms generally indicate a problem with either the system load (usually accompanied by a CPU load alarm), a full or nearly full filesystem (again usually accompanied by a Free Space alarm), a failure of the inetd daemon (which usually the method by which the POP3 daemon is started), or a more serious problem such as the removal of all or part of the required software. The administrator should be notified so that the problem can be tracked down and fixed. If the POP3 service is not required, then it should be removed from the alarm checks.


HTTP connection alarm

All alarms regarding HTTP connection failures are NOT serious outside of normal working hours. If they occur during normal working hours, a report should be logged and the administrator notified.

HTTP provides web service and must be running during normal working hours.


September 26, 1997 - John van Gulik