Strange Nagios problem

windependence · Jul 23, 2013

I'm running 9.1-RELEASE and just installed Nagios because Zabbix has a bug and won't work correctly with my new version of PHP. I'm new to Nagios but not to FreeBSD, all my production stuff runs FreeBSD.

Nagios is running fine except for two problems that are driving me crazy. First off, it is falsely reporting HTTP is down on my main server. The other two servers are using the same plugin and configuration so I am baffled. It reports:

Code:

Connection refused HTTP CRITICAL - Unable to open TCP socket.

I don't get it because the other two are using the same setup except of course Apache22 might be slightly different on the other boxes as they are running different web sites. The Nagios log has this one line in it:

Code:

CURRENT SERVICE STATE: localhost;HTTP;CRITICAL;HARD;4;Connection refused

But I know httpd is running just fine and the web server is up and accessible.

The other problem is on server number 2 Nagios is giving me a critical alert, but says the plugin isn't installed. The error is

Code:

(Return code of 127 is out of bounds - plugin may be missing)

I have googled this until I'm blue in the face but the problem is that not all of the servers are doing this. If it was all of them I'm sure I could find the problem. It's just this one server, and this is the only alert on that server, all the other plugins are working fine. Here is the log entry for that error:

Code:

CURRENT SERVICE STATE: sombrero.mexicanstotheusa.com;Root Partition;CRITICAL;HARD;3;(Return code of 127 is out of bounds - plugin may be missing)

I'm about to remove all monitoring from the server and forget it because I had Zabbix running and when I upgraded everything to 9.1, the PHP bug got me, Munin for some reason doesn't want to work at all, and now Nagios is being totally strange. I will post any details here if you guys can help me figure this out. I'm hoping someone with Nagios knowledge is roaming the forums

Thanks in advance,

-Tim

SirDice · Jul 23, 2013

It seems it gets a "connection refused" which means there's a RST being sent back. As this is on localhost is the webserver actually listening on localhost? Or is it perhaps bound to the IP address?

windependence · Jul 23, 2013

That was it for the HTTP! I added the loopback to the Apache configuration and restarted Apache and it cleared. Thank you so much @SirDice!

Now on to problem #2.

-Tim

da1 · Jul 23, 2013

Is Nagios running on server number 2 and doing a local check or is it installed on another machine and doing the checks via nrpe(2)?

SirDice · Jul 23, 2013

windependence said:
That was it for the HTTP! I added the loopback to the Apache configuration and restarted Apache and it cleared.

It's probably better if you run the check against the IP address. That's the important bit of your web server. With the check you now have it's possible the httpd daemon on localhost is still running while the one bound to the IP address is dead. That will probably never happen but it's something to keep in mind. If you really want to check for availability you should also run the check remotely.

windependence · Jul 23, 2013

da1 said:
Is Nagios running on server number 2 and doing a local check or is it installed on another machine and doing the checks via nrpe(2)?

It's running on the main box an checking via nrpe2 as you said. There are two remotes and the one main server. The other remote (server3) is just fine with the root filesystem check, that's what makes it so weird.

-Tim

da1 · Jul 24, 2013

Some things I would check:

Does nrpe have the "sudo" command enabled and does the sudoers file contain the correct executable line?
pkg_libchk from ports-mgmt/bsdadminscripts
/var/log/messages - does the sudo command execute correctly? any other relevant errors?
If you run the checks locally, does everything appear to be in order?

windependence · Jul 24, 2013

Thanks for the quick feedback, @da1.

To address your points:

I don't have sudo installed on any of the boxes. Not sure where you were going here.
Do you mean I should install this and run the command? Not familiar.
No significant messages in the log.
Yes, it works locally. Here is the output of the command:

Code:

# /usr/local/libexec/nagios/check_disk -w 20% -c 10% -p /
DISK OK - free space: / 337 MB (73% inode=97%);| /=118MB;396;445;0;495

Thanks again for the help, I'm pretty much a n00b at Nagios.

-Tim

da1 · Jul 24, 2013

Hi,

If you check /usr/local/etc/nrpe.conf there is a section about sudo.
This will do an OS-wide library check.

Have you tried running the check command from the Nagios machine manually?

Example:

nagios# /usr/local/libexec/nagios/check_nrpe2 -H <host> -c check_root
Or check_load or any other check_ cmd you have in nrpe.conf.

Is this the only command that fails BTW?

windependence · Jul 26, 2013

I finally tracked this down to a bad stanza in my hosts.cfg file. For anyone else with this problem here is the bad code:

Code:

define service{
use generic-service
service_description Root Partition
contact_groups admins
host_name cloudserver,sombrero.mexicanstotheusa.com
check_command check_nrpe!check_disk
}

I got this from someone's how to site so be forwarned, apparently not everyone checks their code when they write a blog post.

@da1, thanks for all your help on this!

-Tim

Strange Nagios problem

windependence

SirDice

Administrator

windependence

da1

SirDice

Administrator

windependence

da1

windependence

da1

windependence