FirstServed Tech Blog - FirstServed and the Art of Server Tuning

Monitoring disk usage with Nagios passive checks and SNMP traps

Why passive checks and SNMP traps?

The standard Nagios plugins include both a generic snmp plugin check_snmp and one specifically designed to monitor disk usage, check_disk_snmp.  Setting up disk usage monitoring with those requires that you activate disk usage monitoring on the host to be monitored, and set up the appropriate service and command on your Nagios server.

You have two options when configuring Nagios: either you set up a service with active checks, where Nagios polls an SNMP OID regularly to check if disk usage isn’t too low, or you set up your service with passive checks only, which means Nagios will not check disk usage itself, but will be fed with alerts sent out from the monitored host.  Though the second setup is much more complex than the first one, we prefer it because it allows us to delegate the checks to the monitored servers, and helps keep the load on the Nagios server down.

Setting up SNMP on the monitored host

First, create an SNMP v3 user that will be used for snmpd’s internal checks.  This is done by adding a line to SNMP’s persistent data file, which is usually to be found at /var/net-snmp/snmpd.conf.  Do NOT put the line below in your main snmpd configuration file!

createUser disman MD5 asecretpassphrase

Note that this line will disappear from the file when next starting the SNMP agent.  This is nothing to worry about!

Next, add the UCD-SNMP MIB to your system-wide SNMP settings ( found at /etc/snmp/snmp.conf or /user/local/share/snmp/snmp.conf ).  This will allow you to refer to the needed SNMP OID’s by name instead of by number.

mibs IF-MIB:UCD-SNMP-MIB

After this, add the following information to snmpd’s configuration file ( found at /etc/snmp/snmpd.conf for a Red Hat distro, or /usr/local/share/snmp/snmpd.conf if you installed net-snmp from source ):

# readonly user needed for disman monitoring
rouser disman auth

# security name used for internal checks
agentSecName disman

# the default trapcommunity – better change this from the default ‘public’
trapcommunity public

# choose whether to send SNMP v1 traps, SNMP v2c traps or SNMP v2c informs to your Nagios server
# do not send multiple trap types to the same destination
# change 10.10.10.1 to your Nagios server’s address

#trapsink 10.10.10.1
trap2sink 10.10.10.1
#informsink 10.10.10.1

# add the drives to monitor, and define what consists low disk space
# specify the disk space in kilobytes or as a percentage
# you can also use ‘includeAllDisks’ to monitor all available disks

#includeAllDisks 10%
# monitor ‘/’ for 10G available disk space
disk / 10000000
# define the monitoring frequency, what will be monitored, and what information will be added to the SNMP trap
# the line below monitors the dskErrorFlag every 60 seconds, and will send out an SNMP trap when the value reaches its lower or upper threshold – 0 or 1.
monitor -u disman -t -r 60 -o dskPath -o dskAvail -o dskTotal "dskTable" dskErrorFlag 0 1

Restart the snmpd daemon to complete the setup of the agent.  Don’t forget to add snmpd to the services started at boot if you haven’t already done so.

Setting up snmptrapd on the Nagios server

For your freshly configured SNMP traps to reach your Nagios installation, you’ll have to configure the snmp trap handling daemon to send the traps to SNMPTT ( SNMP Trap Translator ), which is a nifty little bit of software that permits you to process SNMP traps in a detailed manner.

First add a line like this to /etc/snmp/snmptrapd.conf:

authCommunity log, net, execute public

traphandle default /usr/sbin/snmptthandler

Then modify the snmptrapd stratup script, so the trap handler daemon no longer translates numeric OID’s to their symbolic form:

#   OPTIONS="-Lsd -p /var/run/snmptrapd.pid"
   OPTIONS="-On -p /var/run/snmptrapd.pid"

Start snmptrapd and add it to the startup services.

Setting up SNMPTT on the Nagios server

SNMPTT can be downloaded at http://www.snmptt.org/.  Follow the readme file to install the software, add snmptt to the startup services, and then edit snmptt.ini as follows:

# enable dns reverse translation of ip addresses to host names
dns_enable = 1
# assuming your nagios hosts are defined as short host names, strip the domain part of the dns answer
strip_domain = 1
# add a new file to the snmptt conf filessnmptt_conf_files = <<END
/etc/snmp/snmptt/disman-event
END

Next create the /etc/snmp/snmptt/disman-event file, with the content below.  Note that the PREEXEC lines normally wouldn’t be necessary, but the version of net-snmp we’re using seems to be buggy when reverse mapping ip addresses, and always passes <UNKNOWN> as the agent’s host name.

EVENT fsDiskSpaceLow .1.3.6.1.2.1.88.2.0.2 "Status Events" Warning
MATCH MODE=and
MATCH $1: (^dskTable$)
MATCH $5: 1
PREEXEC /usr/bin/dig +short -x $aA | awk -F. ‘{ print $$1 }’
FORMAT Disk Space on $p1 low ( $7 / $8 on disk $6 )
EXEC /usr/lib/nagios/plugins/eventhandlers/submit_check_result $p1 disk-usage 1 "Disk Space low ( $7 / $8 on disk $6 )"
SDESC
Disk Space low
EDESC

EVENT fsDiskSpaceNormal .1.3.6.1.2.1.88.2.0.3 "Status Events" Normal
MATCH MODE=and
MATCH $1: (^dskTable$)
MATCH $5: 0
PREEXEC /usr/bin/dig +short -x $aA | awk -F. ‘{ print $$1 }’
FORMAT Disk Space on $p1 normal ( $7 / $8 on disk $6 )
EXEC /usr/lib/nagios/plugins/eventhandlers/submit_check_result $p1 disk-usage 0 "Disk Space normal ( $7 / $8 on disk $6 )"
SDESC
Disk Space normal
EDESC

What does all the above mean?  Two OID’s are defined, corresponding to the Disman events mteTriggerRising and mteTriggerFalling.  Additionally, a further restriction is made on the variable ‘dskTable’ and the dskErrorFlag value.  This is necessary because the Disman event OID’s are generic events also used for load average monitoring, log file monitoring, etc…  This way, we can define other events with the same OID’s in SNMPTT and map them to a different Nagios service.

The first event generates a passive check result for Nagios for host $p1, service disk-usage and status WARNING.  The second event does the same, but with an OK status.

Restart your snmptt daemon after modifying the file, and proceed to the next step, adding the /usr/lib/nagios/plugins/eventhandlers/submit_check_result file, which looks like this:

#!/bin/sh
# Write a command to the Nagios command file to cause
# it to process a service check result

echocmd="/bin/echo"

CommandFile="/var/spool/nagios/cmd/nagios.cmd"

# get the current date/time in seconds since UNIX epoch
datetime=`date +%s`

# create the command line to add to the command file
cmdline="[$datetime] PROCESS_SERVICE_CHECK_RESULT;$1;$2;$3;$4"

# append the command to the end of the command file
`$echocmd $cmdline >> $CommandFile`

Setting up Nagios for passive checks

Create a service like this:

define service {
        host_name               <your host>
        service_description     disk-usage
        check_command           check-disk-usage-linux!<your host’s ip address>!<snmp community>

        active_checks_enabled           0
        check_freshness                 1
        check_period                    24×7
        contact_groups                  sysad
        event_handler_enabled           0
        freshness_threshold             86400
        is_volatile                     1
        max_check_attempts              1
        normal_check_interval           86400
        notifications_enabled           1
        notification_interval           0
        notification_options            w,u,c,r,f
        notification_period             24×7
        passive_checks_enabled          1
        retry_check_interval            86400
        stalking_options                w,u,c
 
}

The explanation for all options can be found in the Nagios documentation describing passive-only services.  The check_command is used by Nagios when no passive check data have been received within the freshness_threshold, one day in this example.  The command should look like this:

define command {
        command_name    check-disk-usage-linux
        command_line    $USER1$/check_snmp_disk -H $ARG1$ -C $ARG2$
}

Check your Nagios after modifying the configuration files, and restart the daemon.  You should now be able to test your disk usage monitoring by modifying the low disk space threshold in the target host’s snmpd.conf file, and restarting the snmpd daemon.

Troubleshooting

If the above setup doesn’t work out of the box, which is pretty probable considering how easy SNMP is to set up, try the following:

  • Set up SNMPTT to log traps to syslog.  If you get no syslog output, the problem lies with your target host’s snmpd configuration.
  • Set up SNMPTT to debug to a log file.  You’ll see detailed information about the translation of the traps.
  • Set up SNMPTT to log unknown traps to a log file. If your events are not correctly defined, you’ll see your traps appear in that file.
  • Add ‘traphandle default /usr/bin/traptoemail -s localhost -f <your email> <your email>’ to your snmptrapd.conf file and restart the daemon.  This will send all SNMP traps to your e-mail address so you can analyze them in further detail.
  • Check your Nagios event log.  Do the service check results appear in the log, with the correct hostname and service name?

Good luck!

Leave a Reply

You must be logged in to post a comment.