FreeBSD fails to boot after upgrade to 8.4

Hello everyone,

I'm really hoping someone has some advice here. I've got a FreeBSD file server running with root on ZFS. The original how-to I followed was https://wiki.freebsd.org/RootOnZFS/GPTZFSBoot/RAIDZ1

The setup worked for years. I upgraded recently from 8.2 to 8.4, and during a large file copy the system went unresponsive. Thinking it was just slow from memory pressure, I let it sit for about a day. When it was clear that it wasn't coming back, I rebooted it.
And then the problems started. It wouldn't boot. "Okay", I thought, "I've had problems like this before, I can fix this".

The system was coming up to a black screen with a single "/" in the corner. So, it seems like a bootloader problem. No biggie. Must have forgotten to update the boot code after my latest update. In the livefs the zpools are fine. A full scrub of both returned no errors. I went ahead and reinstalled the boot code on each disk in the RAIDZ1:

gpart bootcode -b /mnt2/boot/pmbr -p /mnt2/boot/gptzfsboot -i 1 mfid0
gpart bootcode -b /mnt2/boot/pmbr -p /mnt2/boot/gptzfsboot -i 1 mfid1
gpart bootcode -b /mnt2/boot/pmbr -p /mnt2/boot/gptzfsboot -i 1 mfid2


Still nothing. Okay, maybe something went wrong in the update. It did update ZFS quite a bit. Still, everything on the CD works, so (with the zpool mounted under /mnt):
mkdir /mnt/boot/BROKE
mv /mnt/boot/* /mnt/boot/BROKE/
cp -pr /mnt2/boot/* /mnt/boot/

Still nothing. In the middle of this, the server hardware itself decided it was done, so I had to move to a new server. No problem. Had to recreate GPT, but at least I know that all shows up properly. Did another full scrub of zroot in case something else was funky. Still no errors. I then retried everything above, multiple times in case I had a typo.

I've found and followed:
http://forums.freebsd.org/showthread.php?t=42361

I was pretty sure I installed the correct boot code, but went ahead, wrote down exactly what they used in that thread and did it again. Once again, no dice.

Hoping to get any output, I added boot_verbose="YES" to /boot/loader.conf on the ZFS filesystem, but it doesn't even appear to get that far.

What I have:
  • I can boot off of USB and get into the livefs environment without any problems.
  • The zpool imports just fine, and all the file systems appear to be fine
  • gpart show shows the correct layout. 64k freebsd-boot, 12GB freebsd-swap, the rest as freebsd-zfs

Another thread here suggested that vfs.root.mountfrom="zfs:zroot" was no longer needed, to I tried taking that out to see if it made any difference (it didn't). I attempted recreating the zpool.cache file in case it needed a new one after all the updates. That didn't help.

My knowledge of how FreeBSD boots is limited. I've let it sit on the black screen with the / overnight before in case it was being slow, but to no avail. I'm sure I'm missing something small, but I've been tearing my hair out trying to figure out what. Can anyone shed some light on what I can do to figure this out?
 
Okay. One of the following fixed it:

  • Removing boot.config
  • Setting zroot failmode to continue with zpool set failmode=continue zroot
  • Clearing and resetting zroot bootfs with zpool set bootfs='' ; zpool set bootfs=zroot

I took a shotgun approach, sat down in the basement and powered through until it would boot. Those were the last four things I tried, all in the same livefs session - so I don't know exactly which one fixed it. boot.config had some old configuration in it from when I had a serial console (RIP) still. I wouldn't expect it to break booting. The only line in it was:
# cat /boot.config.old
-S115200 -Dh


As for the others, I don't know. The data zpool still isn't mounting on boot, so now I'm working on that. However, it boots. That has made me immensely happy. Were I not on call, I would probably be getting hammered in celebration. :)
 
Well, I don't quite understand what's going on here. Setting zpool set failmode=continue data allowed that one to mount at boot time as well.

Both zpools have all their disks online, and as far as the documentation goes, that setting should only come into play when all the disks (or more than the maximum number allowed) have failed in a pool From: http://docs.oracle.com/cd/E19253-01/819-5461/gftgp/index.html
Code:
The failmode property – This property determines the behavior of a catastrophic pool failure due to a loss of device connectivity or the failure of all devices in the pool. The failmode property can be set to these values: wait, continue, or panic. The default value is wait, which means you must reconnect the device or replace a failed device, and then clear the error with the zpool clear command.

Does anyone have any ideas why the pools wouldn't mount without this set to continue, even though all the disks show as online?
The only errors I see related to the disks in dmesg is that they're not certified drives, and that the battery on the raid card is bad (which I'm ordering a new one to fix that).
 
Back
Top