Thursday, June 26, 2014

Nagios: Dell Server OpenManage to Monitor Dell Hardware


If you have a Dell server you would be crazy not to have OpenManage Server Administrator on it.  It provides a web interface to keep an eye on the system including hardware and the RAID health.  Ideally we will be notified via email if a disk in an array pops and even though OpenManage has a built in tool that can be scripted to do this it can get hacky and inconsistent.  After trying a few ways for Nagios to monitor OpenManage I came across the incredible check_openmanage plugin by Trond Hasle Amundsen.  Take one look at his detailed tutorial and it might be intimidating to someone new to Nagios, but its actually very easy to get a simple check set up that will cover all of the important stuff.  Soon you will have lovely Dell hardware alerts in your email inbox.

This how-to is going to cover getting check_openmanage working on a Dell server with Windows (Part 2) and a Dell server with CentOS (Part 3).  On the Windows box we will use the NSClient++ program to pass NRPE checks from the Nagios server to the windows client.  On the CentOS server we will also use NRPE, but the NRPE service from a yum repository.

1. nagios server
Since we are using NRPE all we need to do is set up our configs so the Nagios server can pass a check_nrpe request to the Dell server.  That's it, assuming check_nrpe is already installed on the Nagios server and a host config for the Dell server is made, we should be good, no additional plugins needed on the Nagios server.  Inside my host config I will define the check_openmanage service:
 define service{  
   use generic-service  
   host_name servername.local  
   service_description Dell OpenManage  
   check_command check_nrpe!check_openmanage!30  
 }  
I am already using check_nrpe for other checks, but for reference here is how mine is defined, you made need to add something similar to your commands.cfg:
 define command{  
   command_name check_nrpe  
   command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -t $ARG2$  
 }  
Remember to service nagios restart after any config changes to Nagios.

2. windows server host
Now that the Nagios server is set up we can take care of the Dell servers.  First we will implement a solution for Windows Server.  Here is what we will need on our Windows box:
The plugin requires OpenManage Server Administrator, if you don't have it installed go ahead and grab the latest version and install, the x86 has always worked for me.  The install takes a while, be patient and once it's done pull up the panel (https://localhost:1311) to test it out, you will get cert errors, that's OK.

Install NSClient++.  I wrote a quick and dirty guide here on it.  Next we will install the check_openmanage plugin.  Download the bits from the above link, technically for the windows box you only need the exe file, but I downloaded the zip so I can store it on my file server for use on the CentOS box, the zip contains the exe you need.  Extract the zip, and put the exe in in your NSClient++ directory:


With the plugin in place we can open a command prompt and test it.  We can run the plugin by hand and pass it the location of omreport.exe (which is installed with OpenManage) and have it return the omreport data in a pretty console line.  My omreport.exe is located at: C:\Program Files x86)\Dell\SysMgt\oma\bin\omreport.exe.  If you installed 64 bit OpenManage then you will use Program Files and not Program Files x86 within your omreport.exe location.


As you can see I am getting an output from omreport.exe, it is telling me I am using non-certified hard drives, shame on me!  Now we know that check_openmanage works we can tell NSClient++ to run it.  Edit the nsclient.ini that is in the NSClient++ folder.  Add the following:
 [/settings/external scripts/scripts]  
 check_openmanage = check_openmanage.exe -b pdisk_cert=all --omreport "C:\Program Files (x86)\Dell\SysMgt\oma\bin\omreport.exe"  
You will need to make sure the line --omreport points to the location of omreport.exe on your system.  Notice the -b pdisk_cert=all?  "-b" is blacklist, and since I am using non-certified disks in this server I don't want that particular warning to show up in my Nagios panel.

Make sure CheckExternalScripts = 1 in the same file, and restart the nsclient service:


Since our Nagios server is all set up, we should be getting a report on our nagios admin panel:


If there are any issues at this point you can pass check_nrpe with the check_openmanage plugin from the Nagios server to the Windows server by running the following from your on Nagios server (make sure to browse to the plugins directory that has check_nrpe):
 # ./check_nrpe -H 192.168.1.100 -c check_openmanage -t 30  
You should see the omreport output, which will start with System - OK, if there are no OpenManage alerts.  Any other issues you will need to double check configs and firewall settings, typical Nagios stuff!

3. centos host
Since we can't use check_openmanage.exe on our CentOS server, we need to use the check_openmanage script, which is in the zip download from before, its the one that has NO .exe extension.  Of course we will need OpenManage Server Administrator installed, which comes packaged in a repository provided by Dell, how nice!  Log into your CentOS server and run the following, this comes right from the Dell page here, which may be updated so check that first.
 # wget -q -O - http://linux.dell.com/repo/hardware/latest/bootstrap.cgi | bash  
 # yum install srvadmin-all  
 # /opt/dell/srvadmin/sbin/srvadmin-services.sh start  
You can test the Server Administrator panel by visiting https://localhost:1311 or by opening port 1311 in iptables and visiting remotely using the IP address of the CentOS server.

The Dell binaries are stored in /opt/dell/srvadmin/sbin/.  In this directory we will find our omreport script, which is what the check_openmanage script will use.  Now install NRPE and the nagios-plugin package, if you haven't already.
 # rpm -Uvh http://epel.mirror.net.in/epel/6/i386/epel-release-6-8.noarch.rpm  
 # yum install nrpe nagios-plugins-all  
First we will put the check_openmanage script into the nagios plugins directory, located at /usr/lib64/nagios/plugins/.  This directory is owned by root so you many need to chmod 777 plugins/ in order to copy the file.  The file (check_openmanage) is found in the zip we downloaded before, use your favorite method copying files but if you are not that comfortable then I recommend trying out via WinSCP.  To test the plugin browse to the plugins folder and run it (you may need to chmod 777 the file first):


OK... good to know I need to fix some things!

Next we need to modify /etc/nagios/nrpe.conf.  Under the Allowed Host Addresses section add the IP of your nagios server like so:
 allowed_hosts=127.0.0.1,192.168.1.18  
Way at the bottom of the file are the service check commands, we will add one for check_openmanage:
 command[check_openmanage]=/usr/lib64/nagios/plugins/check_openmanage -b pdisk_cert=all  
Notice just like the Windows server I am adding -b pdisk_cert=all, because I don't mind to be reminded that I am using uncertified hard drives.

Now we need to start the NRPE service and add it to chkconfig so it runs on boot.  Also open port 5666, so the Nagios server can communicate to the CentOS server.
 # service nrpe start  
 # chkconfig nrpe on  
If the service fails to start there may be an issue with your config file, just run a tail /var/log/messages and see what it says.

At this point, with nrpe listening and port 5666 open we can manually run check_nrpe on the Nagios server to test the connection to our CentOS server:


Now, similar to testing the Windows server above, we can take it one step further and pass check_nrpe with a check_openmanage request:


Since our Nagios server is all set up from part 1, open your Nagios admin panel and look for some win:


Now need to get back to fixing servers...

resources: