Solved 11.0 ZFS panic on boot

Recently I installed an 11.0 server with two drives and chose the install option that mirrored them using ZFS. The machine ran fine in the colo rack for a few weeks, then became unresponsive.

Ran fine for a few weeks sitting in the rack, then became unresponsive. Plugging in a console showed it was in a panic / boot loop due to ZFS.

Would like suggestions on what is wrong that caused this problem as well as how to proceed to recover from it.

I took a picture of the console screen, attached.
ZFS_crash.8x6.jpg
 
Panic occurs when "trying to mount root"

I would suggest booting from install media and then got to shell. From shell try to find the zpool:
Code:
# zpool import
Should hopefully show something like "zj6". If so, try to import (beware not to remount / !):
Code:
# zpool import -R /mnt zj6
If this works, then export the zpool:
Code:
# zpool export zj6
and reboot from the disks.
One possible cause is that recovering from such an error requires more stack space then the boot environment gives you. I've seen this in 32bit boot environments. When running the installer CD the kernel has finished booting and has much more space to resolve zfs problems.
 
So this points to on disk data corruption. I see 2 hurdles to jump. Some of this is mentioned in Thread 44949
First you need to get around the zfs_panic_recover call. For this I suggest setting the tunable vfs.zfs.recover to 1 - and while you're at it set vfs.zfs.debug too.
When booting from install, entering shell, the command is
Code:
# sysctl vfs.zfs.recover=1
vfs.zfs.recover: 0 -> 1
# sysctl vfs.zfs.debug=1
vfs.zfs.debug: 0 -> 1
With the earlier panic disabled, try to dryrun-import the pool in recovery mode:
Code:
# zpool import -fFn -R /mnt -o failmode=continue zj6
see zpool(8) for option F (Recovery), n (dry-run, dont do, just test) and failmode.
For repairing data corruption there is also a command to clear the on-disk zfs logs. By issuing this you will loose the last couple of transactions on your pool.
Code:
# zpool clear -Fn zj6
Based on the result you will need to make decision to go for real on the above commands (leaving out the n option). This involves weighing the value of any data that is supposed to be in the pool.

Legal Notice:
To operate safely it is possible to simply copy the disks over to another storage location. Something along the lines of:
Code:
# dd if=/dev/ada0p3 bs=1048576 | ssh remoteuser@remote.storage.location dd bs=1048576 of=ada0p3.safeimage
Do this for both disks and wait for someone smarter then me to give you better advice ;)
 
Progress. I can live with losing the data. I have grave concerns about how the pool got corrupted.
Code:
# zpool import -fFn -R /mnt -o failmode=continue zj6
complained with "failed to create mountpoints"
Code:
# zpool clear -Fn zj6
complained "no such pool"
So I ran the first command without -n and saw a lot of problems:
20161230_zfs_3.8x6.jpg

20161230_zfs_4.8x6.jpg

but then the second command ran quietly.

Rebooted to single user mode did a
Code:
zfs mount -a
and things looks OK. Rebooted to multiuser mode, got to the point where my jails are mounted then got another panic (the panic was in dva_get_dsize_nc) :
20161230_zfs_5.8x6.jpg


As I mentioned above, I can live with losing this pool but the cause of the corruption worries me greatly. I could easily recreate the pool and put the file systems back from backups but what can I do to prevent recurrance?
 
Hard one. It looks like a hardware failure. I would start by checking the RAM and then move on the the controller. Can you describe what type of hardware you are using?
 
It's an Asus server. I read your response and said "Well, it can't be the RAM since I ran Memtest on it about a month ago". For the sake of completeness though, I ran it again and got errors. Thank you! I would not have tried that without another perspective.
Replacement RAM in the box right now, running Memtest again. Will just reformat and reinstall. Gives me a chance to test recovery from backups.
 
It's an Asus server. I read your response and said "Well, it can't be the RAM since I ran Memtest on it about a month ago". For the sake of completeness though, I ran it again and got errors. Thank you! I would not have tried that without another perspective.
Replacement RAM in the box right now, running Memtest again. Will just reformat and reinstall. Gives me a chance to test recovery from backups.

If the RAM is ECC then your data should be fine. It would be worth booting with the new RAM and run a scrub, assuming you can boot!
 
Motherboard is a P5MT-R (RAID disabled).
RAM is labeled DDRII 128X4(2) P667 2GAUM1709 which, AFAICAT, is not ECC so will be replaced.
 
Back
Top