Solved 11.0 ZFS panic on boot

wayne47 · Dec 28, 2016

Recently I installed an 11.0 server with two drives and chose the install option that mirrored them using ZFS. The machine ran fine in the colo rack for a few weeks, then became unresponsive.

Ran fine for a few weeks sitting in the rack, then became unresponsive. Plugging in a console showed it was in a panic / boot loop due to ZFS.

Would like suggestions on what is wrong that caused this problem as well as how to proceed to recover from it.

I took a picture of the console screen, attached.

aribi · Dec 30, 2016

Panic occurs when "trying to mount root"

I would suggest booting from install media and then got to shell. From shell try to find the zpool:

Code:

# zpool import

Should hopefully show something like "zj6". If so, try to import (beware not to remount / !):

Code:

# zpool import -R /mnt zj6

If this works, then export the zpool:

Code:

# zpool export zj6

and reboot from the disks.
One possible cause is that recovering from such an error requires more stack space then the boot environment gives you. I've seen this in 32bit boot environments. When running the installer CD the kernel has finished booting and has much more space to resolve zfs problems.

wayne47 · Dec 30, 2016

No luck. I booted from 11.0 AMD, and got the following results:

aribi · Dec 30, 2016

So this points to on disk data corruption. I see 2 hurdles to jump. Some of this is mentioned in Thread 44949
First you need to get around the zfs_panic_recover call. For this I suggest setting the tunable vfs.zfs.recover to 1 - and while you're at it set vfs.zfs.debug too.
When booting from install, entering shell, the command is

Code:

# sysctl vfs.zfs.recover=1
vfs.zfs.recover: 0 -> 1
# sysctl vfs.zfs.debug=1
vfs.zfs.debug: 0 -> 1

With the earlier panic disabled, try to dryrun-import the pool in recovery mode:

Code:

# zpool import -fFn -R /mnt -o failmode=continue zj6

see zpool(8) for option F (Recovery), n (dry-run, dont do, just test) and failmode.
For repairing data corruption there is also a command to clear the on-disk zfs logs. By issuing this you will loose the last couple of transactions on your pool.

Code:

# zpool clear -Fn zj6

Based on the result you will need to make decision to go for real on the above commands (leaving out the n option). This involves weighing the value of any data that is supposed to be in the pool.

Legal Notice:
To operate safely it is possible to simply copy the disks over to another storage location. Something along the lines of:

Code:

# dd if=/dev/ada0p3 bs=1048576 | ssh remoteuser@remote.storage.location dd bs=1048576 of=ada0p3.safeimage

Do this for both disks and wait for someone smarter then me to give you better advice

wayne47 · Dec 30, 2016

Progress. I can live with losing the data. I have grave concerns about how the pool got corrupted.

Code:

# zpool import -fFn -R /mnt -o failmode=continue zj6

complained with "failed to create mountpoints"

Code:

# zpool clear -Fn zj6

complained "no such pool"
So I ran the first command without -n and saw a lot of problems:

but then the second command ran quietly.

Rebooted to single user mode did a

Code:

zfs mount -a

and things looks OK. Rebooted to multiuser mode, got to the point where my jails are mounted then got another panic (the panic was in dva_get_dsize_nc) :

As I mentioned above, I can live with losing this pool but the cause of the corruption worries me greatly. I could easily recreate the pool and put the file systems back from backups but what can I do to prevent recurrance?

gkontos · Jan 1, 2017

Hard one. It looks like a hardware failure. I would start by checking the RAM and then move on the the controller. Can you describe what type of hardware you are using?

wayne47 · Jan 1, 2017

It's an Asus server. I read your response and said "Well, it can't be the RAM since I ran Memtest on it about a month ago". For the sake of completeness though, I ran it again and got errors. Thank you! I would not have tried that without another perspective.
Replacement RAM in the box right now, running Memtest again. Will just reformat and reinstall. Gives me a chance to test recovery from backups.

gofer_touch · Jan 1, 2017

Hmm. Would you mind posting the exact board and RAM type? I have a feeling this might be useful to others.

gkontos · Jan 2, 2017

wayne47 said:
It's an Asus server. I read your response and said "Well, it can't be the RAM since I ran Memtest on it about a month ago". For the sake of completeness though, I ran it again and got errors. Thank you! I would not have tried that without another perspective.
Replacement RAM in the box right now, running Memtest again. Will just reformat and reinstall. Gives me a chance to test recovery from backups.

If the RAM is ECC then your data should be fine. It would be worth booting with the new RAM and run a scrub, assuming you can boot!

wayne47 · Jan 3, 2017

Motherboard is a P5MT-R (RAID disabled).
RAM is labeled DDRII 128X4(2) P667 2GAUM1709 which, AFAICAT, is not ECC so will be replaced.