HELP NEEDED: kernel: swap_pager: indefinite wait buffer

I'm running FreeBSD 11.2-RELEASE-p4 as a storage server (iSCSI) for 2 VMware hosts.

Specs:
- Supermicro X11SSH-LN4F
- Xeon E3-1220 v6
- 64GB DDR4 ECC
- 8 x 3TB HDDs + 240GB SSD
- 2 x IBM M1015 HBAs
- Root on ZFS

Code:
[root@stor01 ~]# zpool status
  pool: tank
 state: ONLINE
  scan: scrub repaired 0 in 5h39m with 0 errors on Tue Oct  2 16:48:14 2018
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da2p3   ONLINE       0     0     0
            da7p3   ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            da0p3   ONLINE       0     0     0
            da5p3   ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            da1p3   ONLINE       0     0     0
            da6p3   ONLINE       0     0     0
          mirror-3  ONLINE       0     0     0
            da3p3   ONLINE       0     0     0
            da4p3   ONLINE       0     0     0
        cache
          da8       ONLINE       0     0     0

errors: No known data errors
[root@stor01 ~]# gmirror status
       Name    Status  Components
mirror/swap  COMPLETE  da0p2 (ACTIVE)
                       da1p2 (ACTIVE)
                       da2p2 (ACTIVE)
                       da3p2 (ACTIVE)
                       da4p2 (ACTIVE)
                       da5p2 (ACTIVE)
                       da6p2 (ACTIVE)
                       da7p2 (ACTIVE)

This morning the host became unresponsive and had to be hard rebooted.

This is what I found in /var/log/messages:
Code:
Oct  7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 38628, size: 32768
Oct  7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 282715, size: 4096
Oct  7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 37724, size: 24576
Oct  7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 37423, size: 4096
Oct  7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 38536, size: 12288
Oct  7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 12136, size: 4096
Oct  7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 136691, size: 4096
Oct  7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 101, size: 20480
Oct  7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 38544, size: 8192
Oct  7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 38578, size: 4096
Oct  7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 40832, size: 4096
Oct  7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 17781, size: 4096
Oct  7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 334391, size: 4096
Oct  7 07:07:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 16786, size: 4096
Oct  7 07:07:29 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 17763, size: 12288
Oct  7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 136691, size: 4096
Oct  7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 38837, size: 32768
Oct  7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 16786, size: 4096
Oct  7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 36574, size: 4096
Oct  7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 37724, size: 24576
Oct  7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 37423, size: 4096
Oct  7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 37462, size: 24576
Oct  7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 40832, size: 4096
Oct  7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 38827, size: 40960
Oct  7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 334391, size: 4096
Oct  7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 37949, size: 8192
Oct  7 07:07:48 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 37911, size: 4096
Oct  7 07:08:38 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct  7 07:08:43 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct  7 07:09:21 stor01 kernel: WARNING: 10.69.102.2 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct  7 07:09:30 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct  7 07:10:12 stor01 last message repeated 2 times
Oct  7 07:10:12 stor01 last message repeated 2 times
Oct  7 07:10:12 stor01 kernel: WARNING: 10.69.102.2 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct  7 07:10:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 51635, size: 4096
Oct  7 07:10:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 53877, size: 4096
Oct  7 07:10:12 stor01 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 53728, size: 8192
Oct  7 07:10:12 stor01 kernel: pid 917 (telegraf), uid 0, was killed: out of swap space
Oct  7 07:10:28 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct  7 07:10:49 stor01 last message repeated 2 times
Oct  7 07:11:00 stor01 kernel: WARNING: 10.69.102.2 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct  7 07:11:00 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct  7 07:11:00 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct  7 07:11:00 stor01 ctld[28691]: child process 41534 terminated with signal 13
Oct  7 07:11:01 stor01 ctld[28691]: child process 41535 terminated with signal 13
Oct  7 07:11:01 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): connection error; dropping connection
Oct  7 07:11:08 stor01 kernel: WARNING: 10.69.102.2 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct  7 07:11:55 stor01 kernel: WARNING: 10.69.102.2 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct  7 07:11:55 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection
Oct  7 07:12:12 stor01 kernel: WARNING: 10.69.101.11 (iqn.1998-01.com.vmware:esxi01-5c2577bd): no ping reply (NOP-Out) after 5 seconds; dropping connection

The logs say "kernel: pid 917 (telegraf), uid 0, was killed: out of swap space" but according to monitoring graphs, the swap space was barely used at the time:

Screenshot 2018-10-07 at 22.07.26.png


Any idea what could have caused this?
 
https://www.freebsd.org/doc/en/books/faq/troubleshoot.html#idp59131080

What does the error swap_pager: indefinite wait buffer: mean?

This means that a process is trying to page memory to disk, and the page attempt has hung trying to access the disk for more than 20 seconds. It might be caused by bad blocks on the disk drive, disk wiring, cables, or any other disk I/O-related hardware. If the drive itself is bad, disk errors will appear in /var/log/messages and in the output of dmesg. Otherwise, check the cables and connections.
_______________________________________________
 
Same thing happened this morning:

Screenshot 2018-10-08 at 10.01.19.png


Sadly the remote console is not responding to key strokes and SSH is not responding at all. Interesting thing is that ctld is running OK and iSCSI is still serving datastores to my ESXi hosts.

How is this even possible considering the swap space is spread across the same drives, connected with the same cables, HBAs etc.?

BTW, I've already swapped Mini SAS cables between backplane and HBA, I'm going to swap HBA next and if there is no joy - the PSU.
 
Sadly, the problem returned after 9 days of normal operation:

Screenshot 2018-10-18 at 11.30.38.png


Is it a good practice to have swap mirrored across 8 drives?

Code:
[root@stor01 ~]# gmirror status
       Name    Status  Components
mirror/swap  COMPLETE  da0p2 (ACTIVE)
                       da1p2 (ACTIVE)
                       da2p2 (ACTIVE)
                       da3p2 (ACTIVE)
                       da4p2 (ACTIVE)
                       da5p2 (ACTIVE)
                       da6p2 (ACTIVE)
                       da7p2 (ACTIVE)

If not, what else would you recommend here?

I'm wondering if gmirror gets into a weird state or something, it's interesting that it works fine after hard reboot. Surely it rules out any problems with physical connections etc., no?

I have already replaced PSU and mini SAS cables between HBA controller and SAS backplane.

Short SMART test shows no issues with the drives, I've now initiated long tests on all drives.

Any advice on what else I can do to diagnose this further?

TIA
 
How much SWAP that is in total? You would probably do with a maximum of 16GBs of swap in total, if your system ends up needing that much swap you have other more serious problems and the amount of swap is not going do a thing to the help the situation.
 
2 GB partition on each drive in 8-way mirror totalling 2 GB swap space which, according to the attached graphs, is not being utilised much (bottom line):

1539892265476.png


You can tell from the above graph when the server crashed as snmpd wasn't running and no stats were collected.

Interestingly, "Shared real memory", "Physical memory" and "Real memory" climbed up to 100% just before the crash.

Perhaps there is something wrong with my swap setup and the system cannot use it for some reason?

Code:
[root@stor01 ~]# swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/mirror/swap   2097148        0  2097148     0%
[root@stor01 ~]# cat /etc/fstab
# Device                Mountpoint      FStype  Options         Dump    Pass#
/dev/mirror/swap                none    swap    sw              0       0
[root@stor01 ~]# gmirror status
       Name    Status  Components
mirror/swap  COMPLETE  da0p2 (ACTIVE)
                       da1p2 (ACTIVE)
                       da2p2 (ACTIVE)
                       da3p2 (ACTIVE)
                       da4p2 (ACTIVE)
                       da5p2 (ACTIVE)
                       da6p2 (ACTIVE)
                       da7p2 (ACTIVE)
 
Long SMART tests have now completed showing no errors.

I have just disabled powerd, just in case.

Below is system memory information:

Code:
[root@stor01 ~]# freecolor -mt
Physical  : [...................................] 0%    (569/63643)
Swap      : [##################################.] 98%   (2024/2047)
Total     : [#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%] (65691=2594+63097)

[root@stor01 ~]# sysctl hw | egrep 'hw.(phys|user|real)'
hw.physmem: 68493099008
hw.usermem: 2371002368
hw.realmem: 68719476736

SYSTEM MEMORY INFORMATION:
mem_wire:       66122309632 (  63059MB) [ 99%] Wired: disabled for paging out
mem_active:  +      8253440 (      7MB) [  0%] Active: recently referenced
mem_inactive:+      3682304 (      3MB) [  0%] Inactive: recently not referenced
mem_cache:   +            0 (      0MB) [  0%] Cached: almost avail. for allocation
mem_free:    +    601030656 (    573MB) [  0%] Free: fully available for allocation
mem_gap_vm:  +      -217088 (      0MB) [  0%] Memory gap: UNKNOWN
-------------- ------------ ----------- ------
mem_all:     =  66735058944 (  63643MB) [100%] Total real memory managed
mem_gap_sys: +   1758040064 (   1676MB)        Memory gap: Kernel?!
-------------- ------------ -----------
mem_phys:    =  68493099008 (  65320MB)        Total real memory available
mem_gap_hw:  +    226377728 (    215MB)        Memory gap: Segment Mappings?!
-------------- ------------ -----------
mem_hw:      =  68719476736 (  65536MB)        Total real memory installed

SYSTEM MEMORY SUMMARY:
mem_used:       68114763776 (  64959MB) [ 99%] Logically used memory
mem_avail:   +    604712960 (    576MB) [  0%] Logically available memory
-------------- ------------ ----------- ------
mem_total:   =  68719476736 (  65536MB) [100%] Logically total memory

I'm guessing majority of RAM has been "consumed" by ZFS cache:
Code:
[root@stor01 ~]# sysctl -a | grep vfs.zfs.arc_
vfs.zfs.arc_meta_limit: 16415329280
vfs.zfs.arc_free_target: 113014
vfs.zfs.arc_grow_retry: 60
vfs.zfs.arc_shrink_shift: 7
vfs.zfs.arc_average_blocksize: 8192
vfs.zfs.arc_no_grow_shift: 5
vfs.zfs.arc_min: 8207664640
vfs.zfs.arc_max: 65661317120

Is it possible that the system is not releasing ARC memory quickly enough and starts killing other processes?

Considering the system only acts as iSCSI server (ctld), is it worth tweaking any memory/ZFS tunables?
 
ZFS is supposed to consume up to 75% of the available free memory and then free it up in preference to anything else as soon as additional memory is needed by the system. But I have never seen this work in practice on any of my systems that run ZFS. I always get the same problems where swap is used up to 100% and you get errors.

So for me setting those sysctls to limit the arc size is a must. I usually limit it to 50% of the memory rather than 75% and that works better for me.
 
ZFS is supposed to consume up to 75% of the available free memory
Actually, by default it uses the total (RAM) memory minus 1GB. Which, if you have 4GB, would be indeed be 75%. But if you have 96GB it would use 98% (95GB).
 
To me the gmirror swap arrangement seems awkward. You are using ZFS on Root but GEOM for swap.
Have you thought about throwing a whole drive at swap.
For instance throw in a small SSD for swap drive testing and eliminate partitions/gmirror as a culprit.
There was a mailing list post about swap interleaving that makes me think that multi-drive swap is bad.
Note this comment in code:
"Also be aware that swap ops are constrained by the swap device interleave stripe size."
Line 493 swap_pager.c
 
Phishfry, this sort of mirrored swap arrangement is offered by the installer. I've had 4 drives in the gmirror swap running absolutely fine until last month when I built this server and moved drives over and added 4 new ones in (I was previously using HP Microserver Gen8).

It did cross my mind to get rid of gmirror and stick to a single swap partition/drive. I'm assuming if said swap drive died then potentially the system would crash, right?
 
Right, got rid of gmirror swap and instead added the same 8 partitions as separate swap devices in:

Code:
[root@stor01 ~]# swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/da0p2        2097152        0  2097152     0%
/dev/da1p2        2097152        0  2097152     0%
/dev/da2p2        2097152        0  2097152     0%
/dev/da3p2        2097152        0  2097152     0%
/dev/da4p2        2097152        0  2097152     0%
/dev/da5p2        2097152        0  2097152     0%
/dev/da6p2        2097152        0  2097152     0%
/dev/da7p2        2097152        0  2097152     0%
Total            16777216        0 16777216     0%
[root@stor01 ~]# cat /etc/fstab
# Device                Mountpoint      FStype  Options         Dump    Pass#
/dev/da0p2              none    swap    sw              0       0
/dev/da1p2              none    swap    sw              0       0
/dev/da2p2              none    swap    sw              0       0
/dev/da3p2              none    swap    sw              0       0
/dev/da4p2              none    swap    sw              0       0
/dev/da5p2              none    swap    sw              0       0
/dev/da6p2              none    swap    sw              0       0
/dev/da7p2              none    swap    sw              0       0
[root@stor01 ~]# freecolor -mt
Physical  : [###########........................] 34%   (21664/63643)
Swap      : [###################################] 100%  (16384/16384)
Total     : [################%%%%%%%%%%%%%%%%%%%] (80027=38048+41979)

Let's see how this goes...
 
Sorry guys, yes of course. When I said 75% that's because I have 4GB in my laptop and 4GB - 1GB is 75%.

I believe the actual algorithm is 1GB less than total RAM, or 50% of total RAM, depending on which is the highest amount. But in my opinion leaving 1GB free isn't enough and the system quite often ends up using swap space. When I've limited it to 2GB free I've never seen that problem.
 
I am very underqualified in storage principals. Not sure I should have even commented.
Upon reading more I see many different views on the subject.
https://forums.freebsd.org/threads/the-use-of-swap.56799/

The thing with swap is you probably don't need it with 64GB ram but if you do use it and a swap disk fails it can crash the system.
At very least on the next bootup it will fail to boot. fstab will list a drive that is broken and FreeBSD will barf.
 
I don't want to drive you off a cliff but if a application is gobbling up memory all swap is going to do is delay the problem.
Have you considered your monitoring program? Above you show:
"kernel: pid 917 (telegraf), uid 0, was killed: out of swap space"
 
Personally I run with no swap.
But I feel like you are still interleaving the swap by spreading it over many drives.(Although not mirrored)
How about one swap partition on one drive only. That would eliminate the interleaving as a problem.
But back to redundancy, There is none with that arrangement.
Back to my above point you shouldn't need too much swap anyway.
Supposedly you can have too much swap space too.
 
Supposedly you can have too much swap space too.
Indeed. I'm re-purposing an old Juniper box, and got this with 2G of swap:
Code:
warning: total configured swap (466033 pages) exceeds maximum recommended amount (111776 pages).
warning: increase kern.maxswzone or reduce amount of swap.
Upon further inspection I realized this has 256MB of memory, not the 2GB I misread it to be!

Personally I prefer to put swap onto a GMIRROR on top of two authenticating GELI's.
 
Personally I prefer to put swap onto a GMIRROR on top of two authenticating GELI's.
To me spreading it out over 8 drives like the OP seems like a problem waiting to happen.

Just for my info how many drives are you spreading your gmirror swap over?

I think ShelLuser covered good points in the post from 2011 that I linked to.
If I was using only-ZFS I would consider adding a zpool (or zvol ??) for swap.
To me that seems logical.

I am still back to my first though. Are these filesystem/swap issues or is the real problem a program that is running amuck.
My true thought is the latter.

Maybe telegraf is the issue. I don't hear many people speak about it. A monitoring tool is the last thing you might suspect.

I just built a rig with UFS geom mirror/stripe over 4 NVMe on two SuperMicro M.2 paddle cards and 24 disk ZFS array using a RAID60 arrangement in a Chenbro RM23524 on SM X10DRi /2608LV3 / 3x LSI 3008 HBA. 64GB RAM booting off a SATADOM.
 
peterpakos
I was reviewing your graphs and this sticks out to me.
Physical Memory =33% used. Not bad but the amount seems concerning. 22GB of RAM consumed.
How much of that is allocated to your two VM's?
Reason I ask is that 22GB seems like a huge number compared to what I am seeing in use.

Perhaps you need to investigate memory usage with top and see what processes are doing with ps -ax
 
I have to ask this question too. Why two M1015 controllers. With an 8 drive arrangement one card would do.
Then put your SSD on the motherboard SATA3.
If you were running an array of SSD drives I would understand. Dual Path backplane too.
The M1015 is only PCIe 2 with x8 interface. 8 SSDs will soak the interface. Been there done that.
With today's rotating disks you cannot saturate that interface.

To me it seems sacrilegious to put a PCIe 2.0 card on a SM X11 board. The SAS3008 are not much more.
 
I don't want to drive you off a cliff but if a application is gobbling up memory all swap is going to do is delay the problem.
Have you considered your monitoring program? Above you show:

The same thing happened also after disabling telegraf, so it's not that.
 
Personally I run with no swap.
But I feel like you are still interleaving the swap by spreading it over many drives.(Although not mirrored)
How about one swap partition on one drive only. That would eliminate the interleaving as a problem.
But back to redundancy, There is none with that arrangement.
Back to my above point you shouldn't need too much swap anyway.
Supposedly you can have too much swap space too.

Touch wood, the system's been very stable since I got rid of gmirror swap. Time will tell if it stays this way...
 
Back
Top