System metrics - what tools do you use?

dvl@ · Jul 23, 2015

What tools are you using to collect system metrics? I am asking about IO throughput, temperature, disk space used, etc. I want to have the pretty graphs too.

To be clear, I'm not asking about monitoring tools, such as Nagios, which I'm already using.

phoenix · Jul 23, 2015

Not sure what your asking for.

We monitor just about everything via SNMP. Originally using a semi-automated process using MRTG (to gather the stats and update the rrd databases) and Routers2 (to display all the pretty graphs). We've in the process of moving most of the monitoring over to LibreNMS (a fork of Observium).

LibreNMS uses SNMP to monitor just about everything under the sun, stores it in rrd databases, and then presents it all in pretty graphs, bar charts, and alerts. The downside to LibreNMS is that it's all done on a fixed 5-minute interval. Our MRTG setup lets us use any time interval for the monitoring (1 minute for network and disk I/O, and 1 minute load avg; 5 minutes for disk, CPU, and RAM usage; 15 minutes for other things, etc). The upside to LibreNMS is that it's a lot more automated, and supports distributed polling setups.

We also use Nagios, but that's strictly for host status/alerts (is this system up, is that network reachable, is the web server online, that kind of thing).

jrm@ · Jul 23, 2015

Oko gave me a tour of his Observium setup and I have to say, it does have a very nice interface. The downside (for me) with Observium/LibreNMS are the hard dependencies on things like MySQL, Apache, and Bash. Hopefully the LibreNMS developers are more open to generalizing those requirements. Oko also had some interesting things to say about net-mgmt/collectd5. Hopefully he'll chime in, since he's spent a lot of time looking into this and has quite a nice setup.

I'm using sysutils/ganglia-monitor-core / sysutils/ganglia-webfrontend and I like them, but I'm not doing anything too sophisticated. You can check out our setup here. Apparently you can hook in to SNMP, but I haven't messed with that, so I can't say much.

junovitch@ · Jul 23, 2015

Since net-mgmt/collectd5 was brought up, it would be worth mentioning that the write_graphite plugin is available by default and Sebulon's how to is a good starting point for setting it up.
https://forums.FreeBSD.org/threads/enterprise-class-reporting-monitoring-with-graphite.45652

dvl@ · Jul 23, 2015

junovitch said:
Since net-mgmt/collectd5 was brought up, it would be worth mentioning that the write_graphite plugin is available by default and Sebulon's how to is a good starting point for setting it up.
https://forums.FreeBSD.org/threads/enterprise-class-reporting-monitoring-with-graphite.45652

Is anyone using collectd for disk bandwidth metrics? If so, I would like to see your graphs.

dvl@ · Jul 23, 2015

phoenix said:
Originally using a semi-automated process using MRTG (to gather the stats and update the rrd databases) and Routers2 (to display all the pretty graphs).

By Routers2, do you mean http://blog.webernetz.net/tag/routers2-cgi/ ?

Oko · Jul 24, 2015

dvl@ said:
What tools are you using to collect system metrics? I am asking about IO throughput, temperature, disk space used, etc. I want to have the pretty graphs too.

Hi Dan,

I use two tools.

First tool Observium is SNMP based. I am in the process of migrating from TurnKey Observium to its fork LibreNMS which is recently ported to Open and FreeBSD. Naturally since I am primarily OpenBSD user I will try to run LibreNMS on that platform first. The second tool is collectd uses its own daemon. Observium has a plug-in to display collectd graphs. I will give you temporary read access to the part of my Lab so that see real live demo. Credentials are sent via PM.

Some comments which you can read at your leisure.

Observium started as a simple PHP interface for a SNMP walk through tool to collect metrics from servers, routers, switches, UPSs, and PDUs. At this time I am not monitoring my switches, UPSs, and PDUs so you have not seeing some of the best of what Observium has to offer. SNMP is the Observium biggest asset but also it weakest point. SNMP is pooling protocol. It is very difficult to pool devices behind firewalls/VPNs if you don't have a proxy. There is no proxy for Observium.

When you open Observium and click on one of the servers one of the tabs will be collectd.

Collectd works as a push mechanism which makes it very suitable for monitoring servers behind the firewalls and on VPNs. You install client which is trivial to configure a (takes about a minute). The biggest weakness of collectd is lack of decent front end. Actually IMHO Observium is the best front end for collectd. Most of what you see here

https://collectd.org/wiki/index.php/List_of_front-ends

Are half baked tools Collect-web (Perl based) being the best. I really like collectd for metric.

Observium can display log files. Since I am using turn-key Observium based on Debian it uses rsyslog as a backend. That causes troubles with OpenBSD syslog and even more with FreeBSD native syslog (BTW native OpenBSD syslog is capable of TLS and many other things and it is light years ahead of FreeBSD native syslog, situation is the same with sensorsd OpenBSD native sensoring framework, also OpenBSD native snmp shines vs bsd-snmp (from FreeBSD) vs net-snmp, OpenBSD also has native daemon for monitoring UPSs). However Observium supports syslog-ng. I came to the conclusion that log files should be displayed by a specialize data mining tool. I am in the process of putting syslog-ng + ELK OpenBSD server.

Observium used to have plugin for NfSen but it is luckily unmentioned. I hate those kind kitchen sink tools. Between I really like NfSen for netflow monitoring.

jrm mentioned MySQL as something he dislikes about Observium. I agree. I add few more things. I dislike that it is difficult to use custom MIBs. PF at least on OpenBSD has magnificent MIBs for SNMP protocol but it is not possible to walk through from Observium. Observium people insist on using Debian or Ubuntu. They don't even officially support Red Hat.

Recently Observium got multimillion dollar grant. They will be using money to add more to that kitchen sink. Namely they want to add early alerting system (e-mail notification) and daemon correction which I already have. I use Monit and its centralized server M/Monit in particular.

I am really happy about LibreNMS fork of Observium and it has been long overdue. The whole fork thing reminds me of Foswiki fork of TWiki.

There is a native OpenBSD monitoring infrastructure symon

http://wpd.home.xs4all.nl/symon/

It is showing little bit age but it is usable.

I am also very familiar with M/Monit, Nagios, Munin, Ganglia, Cacti, and Zabbix. I could probably write a small article about monitoring or login tools for that mater.

Obviously on the above list only Munin and Cacti are strictly speaking metric monitoring. jrm is gaga about Ganglia which I used to have in the Lab while we had a ROCKS cluster. The cluster has been decommissioned and Ganglia with it. However I might be getting new 42U node cluster and it will be monitored with Ganglia.

phoenix · Jul 24, 2015

dvl@ said:
By Routers2, do you mean http://blog.webernetz.net/tag/routers2-cgi/ ?

Yeah, that's the one. It's a nicer, more featureful web interface to rrd files created by MRTG. It's especially nice for creating aggregate or compound graphs using data from multiple rrd files (ex, show total network throughout for all internet connections in a single graph).

This is the main website for it.

junovitch@ · Jul 25, 2015

dvl@ said:
Is anyone using collectd for disk bandwidth metrics? If so, I would like to see your graphs.

I am using the native disk utilization but not any bandwidth type metrics. Keep in mind anything that doesn't have a native plug-in can still be easily grabbed with a shell script and the collectd-exec(5) plug-in.

Here's a few that I've used in the past to give you an idea on where to start.

PF states via bsnmpd(1)

Code:

#!/bin/sh
# Exec script for collectd to read bsnmpd stats
HOSTNAME="${COLLECTD_HOSTNAME:-`hostname -s`}"
INTERVAL="${COLLECTD_INTERVAL:-60}"

while sleep "$INTERVAL"
do
  /usr/bin/bsnmpwalk \
  -s my_password@localhost -n 1.3.6.1.4.1.12325.1.200.1.3.1 \
  | awk -v HOSTNAME=$HOSTNAME -v INTERVAL=$INTERVAL \
  '{ print "PUTVAL", HOSTNAME"/exec-snmp/gauge-pfCounter/pfStateTableCount interval="INTERVAL, "N:"$3 }'
done

www/squid

Code:

#!/bin/sh
# Exec script for collectd to read Squid stats
SQUID_HOST="localhost"
HOSTNAME="${COLLECTD_HOSTNAME:-`hostname -s`}"
INTERVAL="${COLLECTD_INTERVAL:-60}"

while sleep "$INTERVAL"
do
  squidclient -h "$SQUID_HOST" cache_object://localhost/counters \
  | awk -F ' = ' -v HOSTNAME=$HOSTNAME -v INTERVAL=$INTERVAL \
  '/requests|^(server|client)/ \
  { print "PUTVAL", HOSTNAME"/exec-squid/counter-squid/"$1, "interval="INTERVAL, "N:"$2 }'

  squidclient -h "$SQUID_HOST" cache_object://localhost/ipcache \
  | awk -F ':' -v HOSTNAME=$HOSTNAME -v INTERVAL=$INTERVAL \
  '/IPcache (Requests|Hits|Misses)/ \
  { gsub(/ /, "", $1); gsub(/ /, "", $2);
  print "PUTVAL", HOSTNAME"/exec-squid/counter-squid/"$1, "interval="INTERVAL, "N:"$2 }'

  squidclient -h "$SQUID_HOST" cache_object://localhost/storedir \
  | awk -F ':' -v HOSTNAME=$HOSTNAME -v INTERVAL=$INTERVAL \
  '/(Maximum Swap Size|Current Store Swap Size)/ \
  { gsub(/KB/, ""); gsub(/ /, "", $1); gsub(/ /, "", $2);
  print "PUTVAL", HOSTNAME"/exec-squid/gauge-squid/"$1,  "interval="INTERVAL, "N:"$2 }'
done

dns/unbound. This could be the base one as well with a path change.

Code:

#!/bin/sh
# Exec script for collectd to read Unbound stats
# Requires that user running the script has permissions to the Unbound key
HOSTNAME="${COLLECTD_HOSTNAME:-`hostname -s`}"
INTERVAL="${COLLECTD_INTERVAL:-60}"

while sleep "$INTERVAL"
do
  /usr/local/sbin/unbound-control stats \
  | awk -v HOSTNAME=$HOSTNAME -v INTERVAL=$INTERVAL \
  '/^(total|num|mem)\./ \
  { gsub(/\./, "_"); gsub(/=/, " "); \
  print "PUTVAL", HOSTNAME"/exec-unbound/gauge-unbound/"$1, "interval="INTERVAL, "N:"$2 }'
done

dvl@ · Aug 13, 2015

I have installed LibreNMS and added a host. Here are some example graphs.

Of note, the number of logged in users is wrong (see other post).

The following is disk IO on a system with a 6-disk raidz2 (ada2-ada7; ada0 & ada1 are unused). It would be nice to remove pass* from the following chart.