I'm running FreeBSD 11.2-RELEASE-p4 as a storage server (iSCSI) for 2 VMware hosts.
Specs:
- Supermicro X11SSH-LN4F
- Xeon E3-1220 v6
- 64GB DDR4 ECC
- 8 x 3TB HDDs + 240GB SSD
- 2 x IBM M1015 HBAs
- Root on ZFS
This morning the host became unresponsive and had to be hard rebooted.
This is what I found in /var/log/messages:
The logs say "kernel: pid 917 (telegraf), uid 0, was killed: out of swap space" but according to monitoring graphs, the swap space was barely used at the time:
Any idea what could have caused this?
Specs:
- Supermicro X11SSH-LN4F
- Xeon E3-1220 v6
- 64GB DDR4 ECC
- 8 x 3TB HDDs + 240GB SSD
- 2 x IBM M1015 HBAs
- Root on ZFS
Code:
[root@stor01 ~]# zpool status
pool: tank
state: ONLINE
scan: scrub repaired 0 in 5h39m with 0 errors on Tue Oct 2 16:48:14 2018
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
da2p3 ONLINE 0 0 0
da7p3 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
da0p3 ONLINE 0 0 0
da5p3 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
da1p3 ONLINE 0 0 0
da6p3 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
da3p3 ONLINE 0 0 0
da4p3 ONLINE 0 0 0
cache
da8 ONLINE 0 0 0
errors: No known data errors
[root@stor01 ~]# gmirror status
Name Status Components
mirror/swap COMPLETE da0p2 (ACTIVE)
da1p2 (ACTIVE)
da2p2 (ACTIVE)
da3p2 (ACTIVE)
da4p2 (ACTIVE)
da5p2 (ACTIVE)
da6p2 (ACTIVE)
da7p2 (ACTIVE)
This morning the host became unresponsive and had to be hard rebooted.
This is what I found in /var/log/messages:
Code:
Oct 7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 38628, size: 32768
Oct 7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 282715, size: 4096
Oct 7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 37724, size: 24576
Oct 7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 37423, size: 4096
Oct 7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 38536, size: 12288
Oct 7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 12136, size: 4096
Oct 7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 136691, size: 4096
Oct 7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 101, size: 20480
Oct 7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 38544, size: 8192
Oct 7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 38578, size: 4096
Oct 7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 40832, size: 4096
Oct 7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 17781, size: 4096
Oct 7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 334391, size: 4096
Oct 7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 16786, size: 4096
Oct 7 07:07:29 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 17763, size: 12288
Oct 7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 136691, size: 4096
Oct 7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 38837, size: 32768
Oct 7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 16786, size: 4096
Oct 7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 36574, size: 4096
Oct 7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 37724, size: 24576
Oct 7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 37423, size: 4096
Oct 7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 37462, size: 24576
Oct 7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 40832, size: 4096
Oct 7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 38827, size: 40960
Oct 7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 334391, size: 4096
Oct 7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 37949, size: 8192
Oct 7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 37911, size: 4096
Oct 7 07:08:38 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct 7 07:08:43 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct 7 07:09:21 stor01 kernel: WARNING: 10.69.102.2 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct 7 07:09:30 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct 7 07:10:12 stor01 last message repeated 2 times
Oct 7 07:10:12 stor01 last message repeated 2 times
Oct 7 07:10:12 stor01 kernel: WARNING: 10.69.102.2 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct 7 07:10:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 51635, size: 4096
Oct 7 07:10:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 53877, size: 4096
Oct 7 07:10:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 53728, size: 8192
Oct 7 07:10:12 stor01 kernel: pid 917 (telegraf), uid 0, was killed: out of swap space
Oct 7 07:10:28 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct 7 07:10:49 stor01 last message repeated 2 times
Oct 7 07:11:00 stor01 kernel: WARNING: 10.69.102.2 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct 7 07:11:00 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct 7 07:11:00 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct 7 07:11:00 stor01 ctld[28691]: child process 41534 terminated with signal 13
Oct 7 07:11:01 stor01 ctld[28691]: child process 41535 terminated with signal 13
Oct 7 07:11:01 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): connection error; dropping connection
Oct 7 07:11:08 stor01 kernel: WARNING: 10.69.102.2 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct 7 07:11:55 stor01 kernel: WARNING: 10.69.102.2 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct 7 07:11:55 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct 7 07:12:12 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
The logs say "kernel: pid 917 (telegraf), uid 0, was killed: out of swap space" but according to monitoring graphs, the swap space was barely used at the time:
Any idea what could have caused this?