ZFS zpool crashing: history full

PMc · Apr 6, 2019

From periodic/daily:

Code:

Scrubbing of zfs pools:
Assertion failed: (nvlist_add_nvlist_array(*nvhisp, "history record", records, numrecords) == 0), file /usr/src/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c, line 3957.
Abort trap
Assertion failed: (nvlist_add_nvlist_array(*nvhisp, "history record", records, numrecords) == 0), file /usr/src/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c, line 3957.
Abort trap
   skipping scrubbing of pool 'backup':
      can't get last scrubbing date

From syslog at the same time:

Code:

kernel: pid 44097 (zpool), uid 0 inumber 47400 on /: filesystem full
kernel: Failed to write core file for process zpool (error 28)
kernel: pid 44097 (zpool), uid 0: exited on signal 6
kernel: pid 44477 (zpool), uid 0 inumber 47400 on /: filesystem full
kernel: Failed to write core file for process zpool (error 28)

And from the commandline:

Code:

# zpool history backup
History for 'backup':
Assertion failed: (nvlist_add_nvlist_array(*nvhisp, "history record", records, numrecords) == 0), file /usr/src/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c, line 3957.
Abort (core dumped)

Very well. There is no option to recycle or remove that history.

D-FENS · Apr 6, 2019

The system is telling you it's full:

kernel: pid 44097 (zpool), uid 0 inumber 47400 on /: filesystem full

Maybe cleanup some snapshots and files?

PMc · Apr 6, 2019

roccobaroccoSC said:
The system is telling you it's full:

kernel: pid 44097 (zpool), uid 0 inumber 47400 on /: filesystem full

Maybe cleanup some snapshots and files?

Haha, not really, not that easy!

See here:

Code:

# mount
/dev/ada3s1a on / (ufs, local, soft-updates)
devfs on /dev (devfs, local)
/dev/ada3s1d on /usr (ufs, local, soft-updates)
/dev/ada3s1e on /var (ufs, local, soft-updates)
procfs on /proc (procfs, local)
im/ext on /ext (zfs, NFS exported, local, nfsv4acls)
[...]

The base OS is (deliberately) not on ZFS. (And the / filesystem is deliberately separate, so that no harm will be done by exactly such things happening.)
(BTW, There seems to be the first line missing in my syslog output - that one should also read "signal 6", but I couldn't find it. So that is a little misleading.)

What actually happens is:

daily zfs-scrub is run from /etc/periodic/daily/800.scrub-zfs
from there, zpool history is called.
if that fails, zpool history is called a second time. (thats why we see twice "disk full", twice "errno 28" - and there should also be twice "signal 6" beforehand, but that seems have gone lost in translation).
in each invocation, zpool runs into an "assertion failed" error (so this also appears twice.)
due to this, zpool writes core.
the core goes into / (because periodic/daily runs from there), and so / goes full (as it is not designed to catch core files - and we don't want corefiles, anyway).

Thats the first part of analysis: identify the first failure.

Now the next part: why is zpool hit by an assertion failed? For that we can look at the place in source that is pointed to. And there we see this:

The full history is read into memory.
After each and every record, there is a check for ENOMEM, i.e.: do we still have enough core memory to load all this data in?
After all the history is in memory, the whole data is given to nvlist_add_nvlist_array(9). Here is no check for failure - this function is expected to return 0, otherwise zpool aborts and crashes.
From the manpage of that function, we see that it does dup the received data.
So, after we have read in gigabytes of history, and carefully checked after each record that memory will not get exhausted, now we dupilcate all this data in memory without any checking?!?!?
This appears to be a perfect example of what is not German engineers' work.

Well, I hacked a printf into that place, to show what value comes back from nvlist_add_nvlist_array(), as it obviousely is not 0. And, as was to be expected, it is 12. (see /usr/include/sys/errno.h)
Corrollary: According to documentation, nvlist_add_nvlist_array(9) returns void. Nevertheless, an integer return value is expected and received.

So, now comes the next step. The actual problem is: on this specific pool the zpool history has gotten bigger than core memory.

That is not really a surprize. After I put a new disk into the desktop, I had two disks there, both spinning at 7.2k rpm - with an offset of about 1 rpm, resulting in an annoying oscillating noise. So I moved one of the disks to the other machine in the hallway, and do remote mirroring. So, this specific pool "backup" receives a couple of zfs receive and snapshots per minute. And that fills the history.

Questions:

how can the zfs history be limited in size, or periodically shortened, so that it doesn't overgrow core mem?
how can the memory allowance for a process be increased? (because there is enough virtual memory available). On i386 architecture, that is.
It is not ulimit, and it is not sysctl kern.maxdsiz - these are both increased far enough. There must be another parameter that limits memory usage for a process to about 500MB.
Addendum: figured out that one: the apparent problem is that maxdsiz is set to a big value, because malloc() then gets only what remains. Nevertheless, if that history can grow to an arbitrarily big value, this would just postpone the effect for a couple of weeks.

ralphbsz · Apr 6, 2019

A: You've found a bug. Actually a few of them. If the history can grow arbitrarily big, then reading it by assuming that all of it fits into memory will eventually lead to a bug. That piece of code needs to be redesigned in some fashion, for example reading it in installments, or limiting the size of history, or something else. You could do the community a service by opening a PR against the correct component. In the case of ZFS, I don't know whether a normal FreeBSD PR is the right mechanism or not, but it is a starting point.

B: Are you running on an x86 machine? If yes, one option might be to move to a 64-bit machine and trying again. The on-disk ZFS format should be compatible. On the 64-bit address space, the limitations for malloc() are more relaxed, so this might work for you.

C: On both 32- and 64-bit platforms, there are ways to allocate more memory than the standard library gives you, but they often do not involve calling the library's native malloc. Instead one can get memory from either calling sbrk() or mmap(), and then implement ones own malloc() (or new() for C++) using the memory block returned as an arena. But if would be a very large amount of work for you to do that to the ZFS user land utilities.

PMc · Apr 6, 2019

Problem solved.
There is different information about the zpool history circling on the net. This one seems accurate for FreeBSD: https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSZpoolHistorySizeLimit

Bottomline is: the history has a size limit, and that is 1/1000 of the pool size, and cannot be changed. For the pool in question it is 979252543 byte, and now 22% full:

Code:

# zdb -dddd backup 32
[...]
        phys_max_off = 979252543
        bof = 936
        eof = 225552208
        records_lost = 0

Which means, on i386 architecture any pool bigger than ~700GB will sooner or later run into this error.

Workaround: create the pool with a small size, then grow it by means of autoexpand=on and zpool online -e:

Code:

# gpart add -t freebsd-zfs -a 4096b -s 10g ada2
ada2p5 added
# zpool create xxx ada2p5
# zdb -dddd xxx 32
[...]
        phys_max_off = 10200547
        bof = 576
        eof = 576
        records_lost = 0
[...]
# gpart resize -a 4096b -s 900g -i 5 ada2
# zpool set autoexpand=on xxx
# zpool online -e xxx ada2p5
# zpool get all xxx
NAME  PROPERTY                       VALUE                          SOURCE
xxx   size                           945G                           -
[...]
# zdb -dddd xxx 32
[...]
                phys_max_off = 10200547
                bof = 576
                eof = 1608
                records_lost = 0

PMc · Apr 6, 2019

ralphbsz said:
A: You've found a bug. Actually a few of them. If the history can grow arbitrarily big, then reading it by assuming that all of it fits into memory will eventually lead to a bug. That piece of code needs to be redesigned in some fashion, for example reading it in installments, or limiting the size of history, or something else. You could do the community a service by opening a PR against the correct component. In the case of ZFS, I don't know whether a normal FreeBSD PR is the right mechanism or not, but it is a starting point.

The only apparent use-case for the history is that daily-scrub reads it to determine time of last scrub. So this may be considered of minor relevance.
Concerning bug reports, the story is quite simple: when using FreeBSD, I was promised I would get a daemonette as girl Friday who would do that ITIL stuff (like blazoning bug reports). This appeaered not to be fully true, so no ITIL stuff here.

B: Are you running on an x86 machine? If yes, one option might be to move to a 64-bit machine and trying again. The on-disk ZFS format should be compatible. On the 64-bit address space, the limitations for malloc() are more relaxed, so this might work for you.

On amd64 there appear to be no such limits. I have just seen my little malloc() tester happily pass the 500GB (when I hit ^C). And yes, new hardware has appeared and is on the shelf. But then, I did a lot of work to get that old server machine to run properly (as it does now), and when switching, I could throw away all my beautiful PCI-X-64 cards and buy new ones in PCIe. (Currently I don't even have a proper source for used amortized hardware.)

And overall, I think, as a single living in 45 sqm appartment, i386 architecture should provide enough space for the housekeeping tasks. (My desktop for the real work is a bit newer.)

C: On both 32- and 64-bit platforms, there are ways to allocate more memory than the standard library gives you, but they often do not involve calling the library's native malloc.

With ZFS this is an ambitioned matter - because it does not get easily clear which libaries (and which header files, to begin with) it actually uses. I do understand that, since integrating such a beast is really hard work.
OTOH, with the old machine the math now appears quite simple: there are 4G, of which 2G are needed for the kernel to do ZFS proper. Then, the data seg size ( ulimit -d) is by default 500M, but I need this significantly above 1G, because postgres can put 1G into a single database field and that seems to depend on that value (not perfectly sure, matter is still unter investigation - when trying to upload an audio CD from a Windows machine with no other connectivity than a web browser it blew at 500M). And finally, malloc() gets what remains.

But tne most important point here is the value of having accurate sources on-site. This thing will be fixed at 3AM when the next daily runs, and that wouldn't be possible otherwise.

Joe · Apr 6, 2019

This is awesome info! Thank you!

PMc · Feb 9, 2020

Addendum:
Adjusting the history size (respectively, the pool size) to your needs is a bit more difficult when there is geli encryption in between gpart and zfs.
Here is a nice article on how that works in principle.

So the best plan is, to first create the pool already with geli underneath. Then
* expand the gpart
* expand the geli
* expand the zpool.
It's good to do this anyway, because when you did it, you know how to do it later when it may be necessary to expand some pool with vital data.