swap_pager - indefinite wait buffer

Hi,

I have a FreeBSD 9.0 ZFS storage server that all of a sudden is starting to spool "swap_pager" errors to the console. The system is still responsive to pings during these lockups and the "swap_pager" messages do not make it to /var/log/messages.

The server has always been under heavy load, but the load isn't any higher lately. Only change is a move to another rack. And I'm only using like 800K of swap so it's not like I'm heavily swapping.

http://www.freebsd.org/doc/faq/troubleshoot.html#indefinite-wait-buffer tells me that its hardware, disk, cabling etc. From SMART log of the two mirrored system disk there was "Timeout errors". System disks have therefore been replaced, one at a time. Also SATA and power cables have been replaced, but without any luck.

Problem occurs on a daily basis but not at the same time. We have Nagios monitoring the host but it never changes state to warning or critical. From when the problem occurs Nagios states gives "flatout" performance graphed, see attach picture. The problem is resolved with a hard reboot.

Since the console is frozen, when the problem occurs, I have no options to view logs or other system information. I have thought of installing more ram, move swap partition to another disk. Reinstall system is of cause last resort, but don't wanna go there before I know whether it is a software or hardware problem.

Hopefully all needed information are below, otherwise just ask.

System Hardware:
  • Supermicro X8DTL-iF motherboard
  • 2*Intel Xeon E5420 CPUs
  • 4*16GB Ram modules (64gb)
  • 2*160GB system disks in raid1 (geom raid)

Storage backend:
  • 1*LSI 9207-8e
  • 4*36 bays JBODs connect with SAS

swapinfo:
Code:
Device          1K-blocks     Used    Avail Capacity
/dev/raid/r0p3    8388608      708  8387900     0%

mount:
Code:
/dev/raid/r0p2 on / (ufs, local, journaled soft-updates)
devfs on /dev (devfs, local, multilabel)
/dev/raid/r0p4 on /var (ufs, local, journaled soft-updates)
/dev/raid/r0p5 on /tmp (ufs, local, journaled soft-updates)
/dev/raid/r0p6 on /usr (ufs, local, journaled soft-updates)
zdata on /zdata (zfs, local, nfsv4acls)
ztest on /ztest (zfs, local, nfsv4acls)
x.x.x.x:/fs2/JOBS/ndata on /ndata (nfs)

uname:
Code:
FreeBSD HOSTNAME 9.0-RELEASE-p3 FreeBSD 9.0-RELEASE-p3 #0: Tue Jun 12 02:52:29 UTC 2012     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

df:
Code:
/dev/raid/r0p2 on / (ufs, local, journaled soft-updates)
devfs on /dev (devfs, local, multilabel)
/dev/raid/r0p4 on /var (ufs, local, journaled soft-updates)
/dev/raid/r0p5 on /tmp (ufs, local, journaled soft-updates)
/dev/raid/r0p6 on /usr (ufs, local, journaled soft-updates)
zdata on /zdata (zfs, local, nfsv4acls)
ztest on /ztest (zfs, local, nfsv4acls)
x.x.x.x:/fs2/JOBS/ndata on /ndata (nfs)

/etc/rc.conf
Code:
defaultrouter="x.x.x.x"
hostname="xxxx"
keymap="danish.iso.kbd"
ifconfig_em0="up"
ifconfig_em1="up"
ifconfig_em2="up"
ifconfig_em3="up"
cloned_interfaces="lagg0"
ifconfig_lagg0="laggproto lacp laggport em0 laggport em1 laggport em2 laggport em3 x.x.x.x  netmask x.x.x.x"
sshd_enable="YES"
ntpd_enable="YES"
ntpd_flags="${ntpd_flags} -g"
powerd_enable="YES"
# Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable
dumpdev="NO"
nrpe2_enable="YES"
samba_enable="YES"
zfs_enable="YES"
ftpd_enable="YES"

/boot/loader.conf
Code:
geom_raid_load="YES"
kern.maxfiles="16384"
mpslsi_load="YES"
hw.intr_storm_threshold=20000
net.inet.tcp.sendspace=1048576
net.inet.tcp.recvspace=1048576
kern.ipc.maxsockbuf=4194304
kern.ipc.nmbjumbop=262144

http://picpaste.com/nagios_swap_pager_error-6WQnx9QU.png
 
Have a look through this thread, install sysutils/smartmontools, then read smartctl(8)(). smartctl needs to be re-set for diagnostic before debug, so be mindful of that. After running smartctl, if you get no error in <Hardware_ECC_Recovered>, then the problem should be non-hardware (I think). If you do get ECC errors, then it's probably hardware.

In the above referenced thread, I solved the problem when I realized that the power cable connection for the sata drive was very shitty and intermittently lost connection. This, after changing the cable several times. You can never under-estimate the amount of shit product out there. After changing to the well-connecting cable, I have not had any other timeout problems (I was too embarrassed to post the answer in the thread referred to above).
 
From [CMD="smartctl -a /dev/xx"][/CMD] i can see that the disks is still increasing its counting on the <Hardware ECC Recovered> errors, i will try swap SATA and power cables again.

Maybe the enclosure have been twisted during the move causing trouble on the motherboard. So if the cable swapping don't do the trick, the problem could be within the motherboard.

To be continued...
 
Not every connector on the sata power and data cable connects well with the connection point on the device! That's where you need to look first. Stop wasting your time and buy a pair of good quality power and data cables. Use those to test the timeout problem. If it gets corrected, problem solved - if not, you wasted $20 instead of 5 days.

To be clear, a cable is a cable but all connectors are not created equal
 
After swapping both SATA and power cables two times, and yes i went for the quality cables, no luck. Same problem.

I then decided to go with the swapping of the motherboard, same model. Now the swap_pager error is not posted on the console any more but after short period of time(10 min - 2 hours) the system is stating to lag and being unresponsive.

If i'am in a screen i can still jump between the screens. From top i can see that everything is "frozen" and not doing anything except for CPU interrupts being high. So i try to run "systat -vmstat" to know what is going on but the process is just hanging and waiting for "vnread".

Still nothing is being posted to the /var/log/messages

Code:
load: 0.37  cmd: csh 2241 [vnread] 1.39r 0.00u 0.00s 0% 16k
load: 0.37  cmd: csh 2241 [vnread] 2.28r 0.00u 0.00s 0% 16k
load: 0.37  cmd: csh 2241 [vnread] 2.58r 0.00u 0.00s 0% 16k

Following i have swapped all hardware in the server, same models and firmware. Sadly, I get the same problem.

So now i decided to make a reinstall with FreeBSD 9.1 and see how that works out. I will post a update when i have switched to the new system.
 
Test whether Power Supply voltage & amp output is consistent or if it fluctuates. Make sure there are no other power-hungry devices on the same line feeding the server.
 
Long time no see, sorry about that. Here is how the problem got resolved: the server got swapped to an equal model and everything worked just fine. Each component has been tested with other components and was found working. So, for sure, a hardware problem! Lesson learned: have no faith in cabling!

Thanks for the help :)
 
Back
Top