Rel.13 ZFS woes

After upgrading server from 12.3 to stable/13, at 03:01 when the perioric daily started, there was already the
full system freeze: no cmdline reaction (except in the guests), no login possible, and all 800+ processes blocked in "D" state, Pushbutton service needed, all guests and jails killed:

Code:
38378  -  DJ       0:03.36 find -sx / /ext /var /usr/local /usr/ports /usr/obj 
39414  -  DJ       0:00.00 sendmail: running queue: /var/spool/mqueue (sendmail
39415  -  DJ       0:00.00 sendmail: running queue: /var/spool/clientmqueue (se
39416  -  DJ       0:00.00 /usr/local/www/cgit/cgit.cgi
39417  -  D<       0:00.00 /usr/local/bin/ruby /ext/libexec/heatctl.rb (ruby27)
39418  -  DJ       0:00.00 sendmail: running queue: /var/spool/clientmqueue (se
39419  -  DJ       0:00.00 sendmail: running queue: /var/spool/mqueue (sendmail
39420  -  DJ       0:00.00 sendmail: running queue: /var/spool/clientmqueue (se
39421  -  DJ       0:00.00 sendmail: accepting connections (sendmail)
39426  -  D        0:00.00 sendmail: running queue: /var/spool/mqueue (sendmail
39427  -  D        0:00.00 sendmail: running queue: /var/spool/clientmqueue (se
39428  -  DJ       0:00.00 sendmail: Queue runner@00:03:00 for /var/spool/clien
39429  -  DJ       0:00.00 sendmail: accepting connections (sendmail)
39430  -  DJ       0:00.00 sendmail: running queue: /var/spool/clientmqueue (se
39465  -  Ds       0:00.01 newsyslog
39466  -  Ds       0:00.01 /bin/sh /usr/libexec/save-entropy
59365  -  DsJ      0:00.09 /usr/sbin/cron -s

Apparent reason: ZFS.

Code:
last pid: 39657;  load averages:  0.27,  1.24,  4.55    up 0+04:05:42  04:11:54
805 processes: 1 running, 804 sleeping
CPU:  0.1% user,  0.0% nice,  0.9% system,  0.0% interrupt, 99.0% idle
Mem: 16G Active, 5118M Inact, 1985M Laundry, 7144M Wired, 462M Buf, 905M Free
ARC: 1417M Total, 326M MFU, 347M MRU, 8216K Anon, 30M Header, 706M Other
     119M Compressed, 546M Uncompressed, 4.57:1 Ratio
Swap: 36G Total, 995M Used, 35G Free, 2% Inuse, 76K In

In 12.3 this showed about 6G ARC, 11G wired and 5G swap. Now the ARC varies between 700 and 1500M, and "compressed" is always 100M - except when no work is done, then it grows. That may be nice for a desktop that is idle mostly and then gives fast reactions - but for a server that normally runs some workload, caching will always stay at the bare minimum, so in fact the ARC just does not work:

Code:
last pid: 38718;  load averages:  2.12,  2.93,  2.88    up 0+01:09:08  05:30:25
625 processes: 1 running, 624 sleeping
CPU:  0.0% user,  0.1% nice,  6.3% system,  0.0% interrupt, 93.6% idle
Mem: 12G Active, 1433M Inact, 9987M Wired, 50M Buf, 8237M Free
ARC: 749M Total, 116M MFU, 254M MRU, 2457K Anon, 42M Header, 334M Other
     84M Compressed, 396M Uncompressed, 4.70:1 Ratio
Swap: 36G Total, 36G Free

At the respective time of the stall, there was a fat compile on 16 cores, and the finds from periodic daily running over vast trees (first time after boot, so not in l2arc), requiring lots of inode caching. And with this new philosophy of always shrinking, it apparently did shrink a bit too much, and deadlocked.

So, after 15 years of tuning to not overgrow small memory, one can now start tuning to not undershrink with big memory. :/
 
if you have many jails look at anticongestion defined in /etc/defaults/periodic.conf
you can probably invoke it in /etc/periodic.conf.local in jails
avoid find storms
 
Would specifying sysctl :
vfs.zfs.arc_min &
vfs.zfs.arc_max
help ?
I have in loader.conf
Code:
vfs.zfs.arc_max="10240M"
vfs.zfs.arc_min="1024M"
and it shows in sysctl as
Code:
vfs.zfs.arc.max: 10737418240
vfs.zfs.arc.min: 0

And, unrelated to this (but apparently similar coding quality), I am seeing
mountd[7727]: WARNING: No mask specified for 192.168.99.1, using out-of-date default
from /etc/zfs/exports
/var/sysup/mnt/tmp.3.53809 -maproot=root:wheel -network=192.168.99.1 -mask 255.255.255.240
Changing to
/var/sysup/mnt/tmp.3.53809 -maproot=root:wheel -network=192.168.99.1/27
gets away with the error. (Read the exports manpage for entertainment. I seem to remember the second one did not work in 12.3, but the first one did.)

Answering Your question: no, it does not help. Specifying arc_min = arc_max might give a statically sized cache, but that is not the point of the ARC.
 
if you have many jails look at anticongestion defined in /etc/defaults/periodic.conf
you can probably invoke it in /etc/periodic.conf.local in jails
avoid find storms
You could also just not sail in bad weather to avoid leaks on the ship. America would not have been discovered.
 
Do you use any restrictions/settings for vfs.zfs.arc.max & min inside your jails?

On anything < 13, and in relation to the arc max & min settings, you only had the following two tunables:
  • vfs.zfs.arc_max
  • vfs.zfs.arc_min
At 13 (probably due to the change to OpenZFS) that changed into:
  • vfs.zfs.arc.max
  • vfs.zfs.arc.min
That means that referring to the old denotations of the min/max tunable settings using the "combining-underscore" do not work as intended anymore. There is an effort uderway to re-introduce the tunables vfs.zfs.arc_max and vfs.zfs.arc_min as legacy versions—that may allready have happened for 13-STABLE (I don't know). I'm not sure what works with your (very recent) 13-STABLE. Requesting the sysctl values of all four possibilities should make clear what is supported on your 13-STABLE. Looking at your output:

I have in loader.conf
Code:
vfs.zfs.arc_max="10240M"
vfs.zfs.arc_min="1024M"
and it shows in sysctl as
Code:
vfs.zfs.arc.max: 10737418240
vfs.zfs.arc.min: 0

The output of vfs.zfs.arc.min: 0 suggests that the default value is being used; that seems not to be in line with the setting in /boot/loader.conf: vfs.zfs.arc_min="1024M".
 
Do you use any restrictions/settings for vfs.zfs.arc.max & min inside your jails?
No. Jails have WITHOUT_ZFS= in /etc/src.conf

On anything < 13, and in relation to the arc max & min settings, you only had the following two tunables:
  • vfs.zfs.arc_max
  • vfs.zfs.arc_min
At 13 (probably due to the change to OpenZFS) that changed into:
  • vfs.zfs.arc.max
  • vfs.zfs.arc.min
Yes, I noticed that. Most tuneables have slightly changed format.

That means that referring to the old denotations of the min/max tunable settings using the "combining-underscore" do not work as intended anymore. There is an effort uderway to re-introduce the tunables vfs.zfs.arc_max and vfs.zfs.arc_min as legacy versions—that may allready have happened for 13-STABLE (I don't know). I'm not sure what works with your (very recent) 13-STABLE.
That's fine with me, I can change variable names as req'd.
These min and max are only advisory, anyway, and I can see what the ARC does: it is now very reluctant to grow under workload, and that's the real problem.

It might be there were tuners underway who had mostly desktops in mind. Have another such case, where some stuff was put into 13 to better support DHCP, but that cannot work with a firewall - and one is now greatly surprized that in two years nobody noticed that. So probably I am either the only one running a server or the only one using firewalls. ;)

The output of vfs.zfs.arc.min: 0 suggests that the default value is being used; that seems not to be in line with the setting in /boot/loader.conf: vfs.zfs.arc_min="1024M".
What is certain it is inconsistent between arc_min and arc_max, and also strangely zfs-stats report something near the correct values:
Code:
        Min Size (Hard Limit):          9.96%   1020.02 MiB
        Max Size (High Water):          10:1    10.00   GiB
But this is just another package of perl code on top of it all.

The bottomline is, in the beginning at ~2007, when I tried to get ZFS running (on 384M installed mem), I went to arc.c and fixed the things necessary. I did this a lot until about Rel.10, then it started to work properly with sysctl tuning only (and I got more memory).
Now with 32G I might expect it to run properly on it's own, and only need fine-tuning. But it seems I have to go to arc.c again (I am reluctant).
 
I have a question: What the system wedged (no forward progress, de-facto deadlock), or was it just slowed down unacceptably?

D state means that the process is waiting for disk IO, right? I know the technical definition is "uninterruptible IO wait" or something like that, but in practice, that usually means disk. In that case, with 800 processes stuck in D state, you should have seen slow progress: A typical disk can do roughly 100 IOs per second (worst case random IOs), so each process should have made a little bit of progress every 8 seconds. If you did "top", you should have seen some CPU usage, as roughly 100 times per second some random process will get an IO done, and can make a little bit of forward progress. With sufficient patience, the system would have eventually finished its tasks and got itself dug out.

On the other hand, if the large number of outstanding IOs and waiting processes overwhelms some bug, then the system might have been completely hung. That would be VERY VERY bad: user processes must never be able to cause the OS to break, and a hang is the worst form of breakage (because the system doesn't even reboot and requires manual intervention to come back). I'm not saying that unacceptably bad performance is OK (it is not), only that there is a difference in severity between a massive performance regression and a hung system.

From reading the above messages, it seems to me that the root cause could be either overly agressive ARC shrinkage, or a bug where the minimum ARC size gets stuck at 0, allowing the cache shrinkage to go insane. From a root cause analysis point of view, there is a huge difference between a bug (introduced when making OpenZFS more old-style-FreeBSD-ZFS compatible) and a deliberate choice to optimize the ZFS ARC management for the desktop use case.
 
I have a question: What the system wedged (no forward progress, de-facto deadlock), or was it just slowed down unacceptably?
It was dead. No passwd prompt on the console, no reaction whatsoever on most terminals, for everything accessing files or a current working dir. LAN did still work.

D state means that the process is waiting for disk IO, right? I know the technical definition is "uninterruptible IO wait" or something like that, but in practice, that usually means disk. In that case, with 800 processes stuck in D state, you should have seen slow progress:
There was no progress. However, the disks were fully accessible: on a still functional terminal I did dd if=/dev/adaN of=/dev/null, and at least all the SSD did reply normally, I didn't check the others.
I noticed the problem when in lldb suddenly a print command did not return. Then I noticed that while compiling on 16 cores, there was 99% idle. This is clearly deadlock. It seems to have gradually progressed for about an hour, while I was debugging inside a bhyve (which has it's own ZFS and isn't concerned until it needs disk access from the host).
I hadn't noticed that because the fans were all running, since heatctl was also blocked and wouldn't switch them off. (heatctl is running in rtprio - that didn't help.)

The deadlock was most likely in ZFS. You can try that: create a ZFS filesystem, mount it, unpack some archive with many files, then unmount it. The unmount will take a while because all has to be flushed first, and commands like df will not return during that time. This then may keep other things locked, and the stall progresses - until finally the initiating command gets through. (I didn't bother to search where it had started - reset was necessary anyway)

On the other hand, if the large number of outstanding IOs and waiting processes overwhelms some bug, then the system might have been completely hung. That would be VERY VERY bad: user processes must never be able to cause the OS to break, and a hang is the worst form of breakage (because the system doesn't even reboot and requires manual intervention to come back).
Yes, that's called pushbutton-service-requirement, and it happens.
From reading the above messages, it seems to me that the root cause could be either overly agressive ARC shrinkage, or a bug where the minimum ARC size gets stuck at 0, allowing the cache shrinkage to go insane.
Or whatever deadlock can happen there.

From a root cause analysis point of view, there is a huge difference between a bug (introduced when making OpenZFS more old-style-FreeBSD-ZFS compatible) and a deliberate choice to optimize the ZFS ARC management for the desktop use case.
Exactly my point. Bad tuning is one thing, and mendable. But hitting the Amtal rule is a different thing, and that worries me because it is spring and our garden is 300 miles away...
 
I observed the phenomenon now more closely: the ARC may still reach 10G, even while the first 16G guest is fully working.
But then, during start of a second guest, within a second it shrinks to 500M, and then takes about an hour to grow again. Starting the guest involvs creating a few filesystems and volumes, and that triggers the behaviour.

The problem seems to be with arc.meta_limit. In R.12.3 this was set to 1/4 of arc_max by default. I had this reduced to half of that, because otherwise on rarely used pools only metadata would be moved to l2arc, and then the disks would start spinning with every access (like postgres does every 5 minutes and only read its pidfile, for whatever kind of "safety").
Now this should no longer be necessary, because the l2arc is persistent.

And now the default meta_limit is 75%, and there seems to be a reason. Having it at 25% or lower gives just the described effects. It doesn't really make sense because 1.25G is still more than 500M, but that is what happens. And there is probably no need anymore to change the default.

Code:
last pid:  2934;  load averages: 15.81, 15.13, 14.65    up 0+03:29:55  22:54:02
532 processes: 2 running, 530 sleeping
CPU:  0.4% user,  0.0% nice, 88.2% system,  0.2% interrupt, 11.2% idle
Mem: 12G Active, 1245M Inact, 1938M Laundry, 12G Wired, 602M Buf, 1415M Free
ARC: 7482M Total, 1973M MFU, 4679M MRU, 29M Anon, 258M Header, 536M Other
     6016M Compressed, 8467M Uncompressed, 1.41:1 Ratio
Swap: 36G Total, 3964M Used, 32G Free, 10% Inuse, 316K In, 67M Out
 
… hidden behind man zfs there is also man 4 zfs.

? probably included with 13.1-RC1; not with 13.0-RELEASE.


1649005524533.png
 
Back
Top