Hi,
I have a FreeBSD 9.0 ZFS storage server that all of a sudden is starting to spool "swap_pager" errors to the console. The system is still responsive to pings during these lockups and the "swap_pager" messages do not make it to /var/log/messages.
The server has always been under heavy load, but the load isn't any higher lately. Only change is a move to another rack. And I'm only using like 800K of swap so it's not like I'm heavily swapping.
http://www.freebsd.org/doc/faq/troubleshoot.html#indefinite-wait-buffer tells me that its hardware, disk, cabling etc. From SMART log of the two mirrored system disk there was "Timeout errors". System disks have therefore been replaced, one at a time. Also SATA and power cables have been replaced, but without any luck.
Problem occurs on a daily basis but not at the same time. We have Nagios monitoring the host but it never changes state to warning or critical. From when the problem occurs Nagios states gives "flatout" performance graphed, see attach picture. The problem is resolved with a hard reboot.
Since the console is frozen, when the problem occurs, I have no options to view logs or other system information. I have thought of installing more ram, move swap partition to another disk. Reinstall system is of cause last resort, but don't wanna go there before I know whether it is a software or hardware problem.
Hopefully all needed information are below, otherwise just ask.
System Hardware:
Storage backend:
swapinfo:
mount:
uname:
df:
/etc/rc.conf
/boot/loader.conf
http://picpaste.com/nagios_swap_pager_error-6WQnx9QU.png
I have a FreeBSD 9.0 ZFS storage server that all of a sudden is starting to spool "swap_pager" errors to the console. The system is still responsive to pings during these lockups and the "swap_pager" messages do not make it to /var/log/messages.
The server has always been under heavy load, but the load isn't any higher lately. Only change is a move to another rack. And I'm only using like 800K of swap so it's not like I'm heavily swapping.
http://www.freebsd.org/doc/faq/troubleshoot.html#indefinite-wait-buffer tells me that its hardware, disk, cabling etc. From SMART log of the two mirrored system disk there was "Timeout errors". System disks have therefore been replaced, one at a time. Also SATA and power cables have been replaced, but without any luck.
Problem occurs on a daily basis but not at the same time. We have Nagios monitoring the host but it never changes state to warning or critical. From when the problem occurs Nagios states gives "flatout" performance graphed, see attach picture. The problem is resolved with a hard reboot.
Since the console is frozen, when the problem occurs, I have no options to view logs or other system information. I have thought of installing more ram, move swap partition to another disk. Reinstall system is of cause last resort, but don't wanna go there before I know whether it is a software or hardware problem.
Hopefully all needed information are below, otherwise just ask.
System Hardware:
- Supermicro X8DTL-iF motherboard
- 2*Intel Xeon E5420 CPUs
- 4*16GB Ram modules (64gb)
- 2*160GB system disks in raid1 (geom raid)
Storage backend:
- 1*LSI 9207-8e
- 4*36 bays JBODs connect with SAS
swapinfo:
Code:
Device 1K-blocks Used Avail Capacity
/dev/raid/r0p3 8388608 708 8387900 0%
mount:
Code:
/dev/raid/r0p2 on / (ufs, local, journaled soft-updates)
devfs on /dev (devfs, local, multilabel)
/dev/raid/r0p4 on /var (ufs, local, journaled soft-updates)
/dev/raid/r0p5 on /tmp (ufs, local, journaled soft-updates)
/dev/raid/r0p6 on /usr (ufs, local, journaled soft-updates)
zdata on /zdata (zfs, local, nfsv4acls)
ztest on /ztest (zfs, local, nfsv4acls)
x.x.x.x:/fs2/JOBS/ndata on /ndata (nfs)
uname:
Code:
FreeBSD HOSTNAME 9.0-RELEASE-p3 FreeBSD 9.0-RELEASE-p3 #0: Tue Jun 12 02:52:29 UTC 2012 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64
df:
Code:
/dev/raid/r0p2 on / (ufs, local, journaled soft-updates)
devfs on /dev (devfs, local, multilabel)
/dev/raid/r0p4 on /var (ufs, local, journaled soft-updates)
/dev/raid/r0p5 on /tmp (ufs, local, journaled soft-updates)
/dev/raid/r0p6 on /usr (ufs, local, journaled soft-updates)
zdata on /zdata (zfs, local, nfsv4acls)
ztest on /ztest (zfs, local, nfsv4acls)
x.x.x.x:/fs2/JOBS/ndata on /ndata (nfs)
/etc/rc.conf
Code:
defaultrouter="x.x.x.x"
hostname="xxxx"
keymap="danish.iso.kbd"
ifconfig_em0="up"
ifconfig_em1="up"
ifconfig_em2="up"
ifconfig_em3="up"
cloned_interfaces="lagg0"
ifconfig_lagg0="laggproto lacp laggport em0 laggport em1 laggport em2 laggport em3 x.x.x.x netmask x.x.x.x"
sshd_enable="YES"
ntpd_enable="YES"
ntpd_flags="${ntpd_flags} -g"
powerd_enable="YES"
# Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable
dumpdev="NO"
nrpe2_enable="YES"
samba_enable="YES"
zfs_enable="YES"
ftpd_enable="YES"
/boot/loader.conf
Code:
geom_raid_load="YES"
kern.maxfiles="16384"
mpslsi_load="YES"
hw.intr_storm_threshold=20000
net.inet.tcp.sendspace=1048576
net.inet.tcp.recvspace=1048576
kern.ipc.maxsockbuf=4194304
kern.ipc.nmbjumbop=262144
http://picpaste.com/nagios_swap_pager_error-6WQnx9QU.png