This is so closely related, I don't want to start a new thread, yet it has been 3 months....
So, by way of background, I have a bunch of 11.2 production machines that are falling over (hanging, unresponsive) at semi-regular intervals, after bing upgraded from previous 10.x configurations. These hangs were always preceded by "out of swap" messages in the logs, but never any significant amount of swap being used, yada yada yada. All are booted off a UFS formatted SSD, running varying configurations of zfs for backup. Most backup is via rsync (rsnapshot) tasks on the freebsd boxes, plus NFSv3 exports, to support ghettoVCB writing from esxi servers. I spent a lot of time troubleshooting, with no joy, so eventually built some scripts that watched for the first precursors to the hangs (the out-of-swap messages) and rebooted the servers. With logging. I didn't learn much, except that it seemed to happen for some of the machines during the maintenance/checking tasks. The hangs/out-of-swap messages never happened during rsnapshot or while nfs writes were in progress. No upgrade to the 11.2 stream fixed the issue.
To get to the point of this thread, I've sucessfully upgraded my way out of *that* morass with a couple of boxes, going to 12.0-RELEASE p1 and p3. After waiting more than 3x the longest no-hanging period, I concluded I am back on solid ground. Now I've bumped into this issue, best described by
pva. (although I also have an older machine that behaves like the OP reported). For this new machine (only in production for about 6 months), I decided to see if the issue was hardware related. I'm thinking not. After upgrading the kernel (
freebsd-update -r 12.0-RELEASE upgrade
) and rebooting, the boot loader stops with the enumeration of the drives. There are 5 drives total (all SATA).
If I unplug one of the drives, the system gets to the boot menu, and will even boot (zpool is exported and zfs is turned off in rc.conf). It doesn't matter which drive I unplug, or which sata port I plug the drives into. These are all WD Red Pro 4TB drives (except the SSD, of course).
Only the kernel has been upgraded so far.
I did notice that the upgrade changed
/etc/defaults/rc.conf
, without giving me any chance to edit it beforehand (
Does this look reasonable (y/n)?
), so I'm wondering if that changed something that relates to the number of drives. Especially the
cfumass_
entries. I've attached the output of that as a file
The earlier boxes that upgraded successfully were all raidz-1 or mirrors (only 3 or 4 drives, total). .
So, since I have backup, etc, and can boot from an external usb stick to 11.2 and import/export the zpool and all the config info, I'm going to try the route of replacing the apparently-faulty bootloader in 12.0-RELEASE with the working one from 11.2.
My biggest question at the moment is whether this is worthwile making a problem report, or if the other issues have it covered already. This is not a machine that I can experiment with -- it needs to go back into production tomorrow. I'm betting that I can reproduce the issue, however, with another machine.
---EDIT---
So this is really odd... (OK -- more like WTF?!) When I disconnect the drives making the zfs pool, and boot with just the SSD...
Code:
$ uname -a
FreeBSD nas0.domainnme.tld 12.0-RELEASE-p3 FreeBSD 12.0-RELEASE-p3 GENERIC i386
Clearly something really funky happened with the upgrade process. This *was* an amd64 install when it was 11.2. Just want to mention that I have a recipe for these upgrades, and I copy/pasted the same update command string that worked two prior times.