Solved zpool health monitoring

I have Zabbix too, maybe I write some scripts to integrate it with SNMP (I don't use Zabbix agents). Also some scripts for Monit.
 
I do use agents on my servers, only use SNMP for network equipment. I really need to rewrite my ZFS template, I was using a specific patch to get specific information. But the basic idea was to use Zabbix' LLD to find all pools and all datasets within each pool. Then use zfs-get(8) or zpool-get(8) to get interesting properties for each pool and dataset. Which you could then put in graphs or add alerts, thresholds, etc.

With Zabbix 5 (and 6) you can do a lot more processing on the Zabbix server itself. So I don't need to gather each property individually and you can 'bulk' transfer a bunch of values in one go. My templates still stem from a time when this wasn't possible.
 
I finally wrote script for Zabbix zpool capacity monitoring and also a write for Monit which checks capacity and also zpool status.
 
I do use agents on my servers, only use SNMP for network equipment. I really need to rewrite my ZFS template, I was using a specific patch to get specific information. But the basic idea was to use Zabbix' LLD to find all pools and all datasets within each pool. Then use zfs-get(8) or zpool-get(8) to get interesting properties for each pool and dataset. Which you could then put in graphs or add alerts, thresholds, etc.

With Zabbix 5 (and 6) you can do a lot more processing on the Zabbix server itself. So I don't need to gather each property individually and you can 'bulk' transfer a bunch of values in one go. My templates still stem from a time when this wasn't possible.

I finally wrote script for Zabbix zpool capacity monitoring and also a write for Monit which checks capacity and also zpool status.

How do I find your scripts so I don't have to rewrite the wheel? :)
 
For Zabbix I only check the % capacity using this script:

Code:
#!/usr/local/bin/bash

zpool list -Ho capacity zroot | awk -F"%" '{print $1}'

And my snmpd.conf contains this line:

Code:
extend .1.3.6.1.4.1.2024.50 zroot /usr/local/bin/bash /usr/local/share/snmp/zroot_capacity.sh

Then in Zabbix I use .1.3.6.1.4.1.2024.50 to fetch the data.

For Monit I use this in my monitrc:

Code:
check program zfs_health with path "/usr/local/etc/monit/zfs_health_check.sh 50"
  if status != 0 then alert

and the zfs_health_check.sh contains:

Code:
#!/bin/sh

maxCapacity=$1 # in percentages

usage="Usage: $0 maxCapacityInPercentages\n"

if [ ! "${maxCapacity}" ]; then
  printf "Missing arguments\n"
  printf "${usage}"
  exit 1
fi

# Output for monit user interface

printf "==== ZPOOL STATUS ====\n"
printf "$(/sbin/zpool status)"
printf "\n\n==== ZPOOL LIST ====\n"
printf "%s\n" "$(/sbin/zpool list)"

condition=$(/sbin/zpool status | grep -E 'DEGRADED|FAULTED|OFFLINE|UNAVAIL|REMOVED|FAIL|DESTROYED|corrupt|cannot|unrecover')

if [ "${condition}" ]; then
  printf "\n==== ERROR ====\n"
  printf "One of the pools is in one of these statuses: DEGRADED|FAULTED|OFFLINE|UNAVAIL|REMOVED|FAIL|DESTROYED|corrupt|cannot|unrecover!\n"
  printf "$condition"
  exit 1
fi

capacity=$(/sbin/zpool list -H -o capacity | cut -d'%' -f1)

for line in ${capacity}
  do
    if [ $line -ge $maxCapacity ]; then
      printf "\n==== ERROR ====\n"
      printf "One of the pools has reached it's max capacity!"
      exit 1
    fi
  done

errors=$(/sbin/zpool status | grep ONLINE | grep -v state | awk '{print $3 $4 $5}' | grep -v 000)

if [ "${errors}" ]; then
  printf "\n==== ERROR ====\n"
  printf "One of the pools contains errors!"
  printf "$errors"
  exit 1
fi

# Finish - If we made it here then everything is fine
exit 0
 
Back
Top