Thursday, June 26, 2014

Nagios: Dell Server OpenManage to Monitor Dell Hardware


If you have a Dell server you would be crazy not to have OpenManage Server Administrator on it.  It provides a web interface to keep an eye on the system including hardware and the RAID health.  Ideally we will be notified via email if a disk in an array pops and even though OpenManage has a built in tool that can be scripted to do this it can get hacky and inconsistent.  After trying a few ways for Nagios to monitor OpenManage I came across the incredible check_openmanage plugin by Trond Hasle Amundsen.  Take one look at his detailed tutorial and it might be intimidating to someone new to Nagios, but its actually very easy to get a simple check set up that will cover all of the important stuff.  Soon you will have lovely Dell hardware alerts in your email inbox.

This how-to is going to cover getting check_openmanage working on a Dell server with Windows (Part 2) and a Dell server with CentOS (Part 3).  On the Windows box we will use the NSClient++ program to pass NRPE checks from the Nagios server to the windows client.  On the CentOS server we will also use NRPE, but the NRPE service from a yum repository.

1. nagios server
Since we are using NRPE all we need to do is set up our configs so the Nagios server can pass a check_nrpe request to the Dell server.  That's it, assuming check_nrpe is already installed on the Nagios server and a host config for the Dell server is made, we should be good, no additional plugins needed on the Nagios server.  Inside my host config I will define the check_openmanage service:
 define service{  
   use generic-service  
   host_name servername.local  
   service_description Dell OpenManage  
   check_command check_nrpe!check_openmanage!30  
 }  
I am already using check_nrpe for other checks, but for reference here is how mine is defined, you made need to add something similar to your commands.cfg:
 define command{  
   command_name check_nrpe  
   command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -t $ARG2$  
 }  
Remember to service nagios restart after any config changes to Nagios.

2. windows server host
Now that the Nagios server is set up we can take care of the Dell servers.  First we will implement a solution for Windows Server.  Here is what we will need on our Windows box:
The plugin requires OpenManage Server Administrator, if you don't have it installed go ahead and grab the latest version and install, the x86 has always worked for me.  The install takes a while, be patient and once it's done pull up the panel (https://localhost:1311) to test it out, you will get cert errors, that's OK.

Install NSClient++.  I wrote a quick and dirty guide here on it.  Next we will install the check_openmanage plugin.  Download the bits from the above link, technically for the windows box you only need the exe file, but I downloaded the zip so I can store it on my file server for use on the CentOS box, the zip contains the exe you need.  Extract the zip, and put the exe in in your NSClient++ directory:


With the plugin in place we can open a command prompt and test it.  We can run the plugin by hand and pass it the location of omreport.exe (which is installed with OpenManage) and have it return the omreport data in a pretty console line.  My omreport.exe is located at: C:\Program Files x86)\Dell\SysMgt\oma\bin\omreport.exe.  If you installed 64 bit OpenManage then you will use Program Files and not Program Files x86 within your omreport.exe location.


As you can see I am getting an output from omreport.exe, it is telling me I am using non-certified hard drives, shame on me!  Now we know that check_openmanage works we can tell NSClient++ to run it.  Edit the nsclient.ini that is in the NSClient++ folder.  Add the following:
 [/settings/external scripts/scripts]  
 check_openmanage = check_openmanage.exe -b pdisk_cert=all --omreport "C:\Program Files (x86)\Dell\SysMgt\oma\bin\omreport.exe"  
You will need to make sure the line --omreport points to the location of omreport.exe on your system.  Notice the -b pdisk_cert=all?  "-b" is blacklist, and since I am using non-certified disks in this server I don't want that particular warning to show up in my Nagios panel.

Make sure CheckExternalScripts = 1 in the same file, and restart the nsclient service:


Since our Nagios server is all set up, we should be getting a report on our nagios admin panel:


If there are any issues at this point you can pass check_nrpe with the check_openmanage plugin from the Nagios server to the Windows server by running the following from your on Nagios server (make sure to browse to the plugins directory that has check_nrpe):
 # ./check_nrpe -H 192.168.1.100 -c check_openmanage -t 30  
You should see the omreport output, which will start with System - OK, if there are no OpenManage alerts.  Any other issues you will need to double check configs and firewall settings, typical Nagios stuff!

3. centos host
Since we can't use check_openmanage.exe on our CentOS server, we need to use the check_openmanage script, which is in the zip download from before, its the one that has NO .exe extension.  Of course we will need OpenManage Server Administrator installed, which comes packaged in a repository provided by Dell, how nice!  Log into your CentOS server and run the following, this comes right from the Dell page here, which may be updated so check that first.
 # wget -q -O - http://linux.dell.com/repo/hardware/latest/bootstrap.cgi | bash  
 # yum install srvadmin-all  
 # /opt/dell/srvadmin/sbin/srvadmin-services.sh start  
You can test the Server Administrator panel by visiting https://localhost:1311 or by opening port 1311 in iptables and visiting remotely using the IP address of the CentOS server.

The Dell binaries are stored in /opt/dell/srvadmin/sbin/.  In this directory we will find our omreport script, which is what the check_openmanage script will use.  Now install NRPE and the nagios-plugin package, if you haven't already.
 # rpm -Uvh http://epel.mirror.net.in/epel/6/i386/epel-release-6-8.noarch.rpm  
 # yum install nrpe nagios-plugins-all  
First we will put the check_openmanage script into the nagios plugins directory, located at /usr/lib64/nagios/plugins/.  This directory is owned by root so you many need to chmod 777 plugins/ in order to copy the file.  The file (check_openmanage) is found in the zip we downloaded before, use your favorite method copying files but if you are not that comfortable then I recommend trying out via WinSCP.  To test the plugin browse to the plugins folder and run it (you may need to chmod 777 the file first):


OK... good to know I need to fix some things!

Next we need to modify /etc/nagios/nrpe.conf.  Under the Allowed Host Addresses section add the IP of your nagios server like so:
 allowed_hosts=127.0.0.1,192.168.1.18  
Way at the bottom of the file are the service check commands, we will add one for check_openmanage:
 command[check_openmanage]=/usr/lib64/nagios/plugins/check_openmanage -b pdisk_cert=all  
Notice just like the Windows server I am adding -b pdisk_cert=all, because I don't mind to be reminded that I am using uncertified hard drives.

Now we need to start the NRPE service and add it to chkconfig so it runs on boot.  Also open port 5666, so the Nagios server can communicate to the CentOS server.
 # service nrpe start  
 # chkconfig nrpe on  
If the service fails to start there may be an issue with your config file, just run a tail /var/log/messages and see what it says.

At this point, with nrpe listening and port 5666 open we can manually run check_nrpe on the Nagios server to test the connection to our CentOS server:


Now, similar to testing the Windows server above, we can take it one step further and pass check_nrpe with a check_openmanage request:


Since our Nagios server is all set up from part 1, open your Nagios admin panel and look for some win:


Now need to get back to fixing servers...

resources:

Monday, June 16, 2014

Nagios: Monitor Active Directory and Exchange 2010 Services


We have an SBS 2011 box that we want to monitor for a client.  Nagios was set up for them before to notify the managers of low disk space on their servers, and now they can be notified of any weird AD or Exchange issues on their SBS server.  We are going to be using a cocktail of technologies that will help us talk to the SBS box as well as a host of powershell plugins and Nagios default commands that will reply back with the health of the system.
  • check_nt will allow us to check important processes and services.
  • check_nrpe will allow us to run powershell monitor scripts on the client machine
  • check_smtp will make sure exchange is listening for email.
check_nt
First we will tackle setting up everything on the client machine for check_nt.  First download the latest NSClient++ (the 64 bit works for me) and run.  Accept the licence.  Select Typical install.  On the NSClient++ Configuration window I like to check the box to Allow all users to write config file.  On the next window put your Nagios server IP in and I select everything except for NSCA (not needed for this server).


Next Next (or 1-3 Next's) then Install.

First thing you will want to do it is adds port 5666 (nrpe) and 12489 (check_nt) to the windows firewall so Nagios can talk to it.  Next thing open C:\Program Files\NSClient++\nsclient.ini and double check a couple things:
  • CheckExternalScripts = 1
  • allowed hosts = 192.168.1.18 #Nagios server IP
  • NRPEServer = 1
  • NSClientServer = 1
Any changes made to nsclient.ini require the NSClient++ service to be restarted in services.msc.  Restart it for good messure.


After the NSClient++ install and opening of the firewall we now should be able to see if check_nt is working so lets log into the Nagios server and test the check_nt checks.  Browse to your plugin folder, in my case it is /usr/local/nagios/libexec and run the following, looking for the proper output:


Notice that is a capitol H for host.  You can always do a .\check_nt -h to help troubleshoot.  Now lets open our Nagios configs and add the checks in.

Full disclosure: the way I organize my servers is to have a separate .cfg for each server sitting in the nagios config directory, in my case it is: /usr/local/nagios/etc/servers.  Inside the folder there are configs called servername.cfg and inside the configs I define the hostname and the service definitions to run against the client server.  There is more than one way to skin a grape and I prefer this way because some checks I want to run against some servers and some check I don't, so breaking it out by server makes sense to me.  If I had 1000 servers to monitor I would probably not do it this way, but I digress.  We will open our servername.cfg and add the check_nt checks:

 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description Uptime  
      check_command check_nt!UPTIME  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description NSClient++ Version  
      check_command check_nt!CLIENTVERSION  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description CPU Load  
      check_command check_nt!CPULOAD!-l 5,80,90  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description Memory Usage  
      check_command check_nt!MEMUSE!-w 80 -c 90  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description C:\ Drive Space  
      check_command check_nt!USEDDISKSPACE!-l c -w 80 -c 90  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description D:\ Drive Space  
      check_command check_nt!USEDDISKSPACE!-l d -w 80 -c 90  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description Drive Space H:\ Exchange Logs  
      check_command check_nt!USEDDISKSPACE!-l h -w 80 -c 90  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description Drive Space I:\ Mailbox DBs  
      check_command check_nt!USEDDISKSPACE!-l i -w 80 -c 90  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description Microsoft Exchange Active Directory Topology  
      check_command check_nt!PROCSTATE!-d SHOWALL -l MSExchangeADTopologyService.exe  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description Microsoft Exchange Protected Service Host  
      check_command check_nt!PROCSTATE!-d SHOWALL -l Microsoft.Exchange.ProtectedServiceHost.exe  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description Microsoft Exchange Service Host  
      check_command check_nt!PROCSTATE!-d SHOWALL -l Microsoft.Exchange.ServiceHost.exe  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description Microsoft Exchange System Attendant  
      check_command check_nt!PROCSTATE!-d SHOWALL -l mad.exe  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description Active Directory Domain Services  
      check_command check_nt!PROCSTATE!-d SHOWALL -l lsass.exe  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description DNS Server Service  
      check_command check_nt!PROCSTATE!-d SHOWALL -l dns.exe  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description DFS Namespace Service  
      check_command check_nt!PROCSTATE!-d SHOWALL -l dfssvc.exe  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description DFS Replication Service  
      check_command check_nt!PROCSTATE!-d SHOWALL -l DFSRs.exe  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description Intersite Messaging Service  
      check_command check_nt!PROCSTATE!-d SHOWALL -l ismserv.exe  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description Microsoft Exchange Forms Based Authentication Service  
      check_command check_nt!SERVICESTATE!-d SHOWALL -l MSExchangeFBA  
 }  
 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description Microsoft Exchange Information Store  
      check_command check_nt!SERVICESTATE!-d SHOWALL -l MSExchangeIS  
 }

Yikes, that is a lot of content, we are checking system uptime, NSClient++ version, cpu load, memory usage, and drive spaces.  We are are also checking AD and Exchange services and processes.  Some of this you will need, some of it you won't, so customize for your environment.  These checks I found from several different sources, see below for those.  Next we need to add the check_nt command definition to commands.cgf, this may already be done, that's OK.

Open /usr/local/nagios/etc/objects/commands.cfg and add the following:

 # 'check_nt' command definition  
 define command{  
      command_name     check_nt  
      command_line     $USER1$/check_nt -H $HOSTADDRESS$ -p 12489 -v $ARG1$ $ARG2$  
 }

At this point our check_nt checks should be working.  Do a service nagios restart on the Nagios server and make sure the configs are good and the service starts.  Back to your Nagios admin panel you will see a sea of new checks under your host, and will eventually be crunched by the Nagios server.  Any issues with this please check out the links below to go into more depth with Nagios, check_nt and how it all works together.

check_nrpe
With check_nrpe we can run some custom powershell scripts against the SBS box to help monitor Active Directory and Exchange.  Back to our SBS server, lets download and put the scripts in there new home.  I am using the following scripts I got from the very helpful telnetport25.com.
  • Exchange2010BackupMonitoring.ps1
  • Exchange2010ContentIndexMonitor.ps1
Drop them into C:\Program Files\NSClient++\scripts.  Please open them up and read the comments, you may need to change stuff to match your environment.  For example with the Backup script you may need to change how many days old you expect to see a backup, if you backup exchange once a week, then change to 7 or 8 days.

You will also want to set the powershell script execution policy to Bypass so that NSClient++ can run the script.  Once the scripts are in place open powershell, browse to the scripts folder and execute the scripts as a test:


If you have issues here make sure your user has access to the exchange shell plugin and exchange command-lets.  Once that is done its time to edit nsclient.ini.  Browse to C:\Program Files\NSClient++ and open nsclient.ini in your favorite editor and add the following to the end of the file:

 [/settings/external scripts/scripts]  
 check_exbackup=cmd /c echo scripts\Exchange2010BackupMonitoring.ps1 | PowerShell.exe -Command -  
 check_exindex=cmd /c echo scripts\Exchange2010ContentIndexMonitor.ps1 | PowerShell.exe -Command -  

What we are doing with this is going to call check_exbackup from the Nagios server and NSClient++ will know what to do with it based on these entries.  Save, close, and restart the NSClient++ service.


Back to the Nagios server lets test our new checks out.  Once again browse to where your plugins are, /usr/local/nagios/libexec, in my case and run the following:


We want to edit the servername.cfg and add the checks for these new scripts.  Add the following:

 define service{  
      use generic-service  
      host_name sbsserver.local  
      service_description Exchange DB Content Indexing  
      check_command check_nrpe!check_exindex!60  
 }
 define service{  
      use generic-service  
      host_name sbsserver.local
      service_description Microsoft Exchange Backups  
      check_command check_nrpe!check_exbackup!60  
 }

You will notice we are calling check_nrpe, and it might be added to commands.cfg by default, lets check by hand.  Browse to /usr/local/nagios/etc/objects and open commands.cfg and add/look for the following: 

 define command{  
      command_name check_nrpe  
      command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -t $ARG2$  
 }

Notice we set a timeout (-t) of 60 but with the exchange scripts we may need to set it to 120, that is because powershell has to load the exchange command-lets and that could take extra time, so increasing the wait time might be needed.  Save and do a service nagios restart to check the configs.


Our nagios panel is looking nice (see title image!)  Now you are comfortable with check_nt and check_nrpe you can go crazy with plugins, there are a lot of options.  Check out the Exchange and Windows Server plugin sections of the Nagios Exchange for more goodness.  There are also some Active Directory scripts in the Windows Server section if you feel you want more monitor other than the check_nt services listed above.

check_smtp
The vanilla nagios-plugins package has a nice check_smtp plugin we are going to use to say "helo" to our exchange box.  Lets go back to our Nagios server and into the Nagios plugins folder (/usr/local/nagios/libexec) and test it out:


OK looks good to me, once again lets add it to our servername.cgf and make sure its in commands.cfg:

 define service{  
    use generic-service  
    host_name sbsserver.local
    service_description Check SMTP
    check_command check_smtp!60  
 }

And command definition:

 # 'check_smtp' command definition  
 define command{  
   command_name  check_smtp  
   command_line  $USER1$/check_smtp -H $HOSTADDRESS$ -t $ARG1$  
 }  

Save everything and do a nagios service restart.  With this we are running the plugin locally on the Nagios server and simply asking Exchange if its up, there is no need to check_nt, check_nrpe or NSCient++, its all happening on the Nagios server.  Once nagios gets around to it we should have a healthy reply:


thanks and resources:
Thanks to those who helped me get this going: