ZFS System panic with zpool corruption

Tracker · Dec 11, 2022

My system is unable (causes panic while booting) to boot because apparently zpool shows some corruption in zroot

Here's the output

Code:

zpool status -v
pool: zroot
state: ONLINE
status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8008-8A
scan: scrub repaired 8B in 00:20:19 with 2 errors on Sun Dec 11 17:51:59 2822

config:
NAME STATE
zroot ONLINE
ada8p3.eli ONLINE
READ WRITE CKSUM
0          0            0
0         0            4
errors: Permanent errors have been detected in the following files:
zroot/tmp: <0x3>

How do I fix this error?

I tried zpool scrub zroot multiple times now, but that hasn't helped

PS : System boots into multiuser mode just fine when zfs_enable="YES" is commented out of rc.conf. But obviously not of any use without the underlying file system.

cracauer@ · Dec 11, 2022

What does this show?

Code:

zdb -c <poolname>

Tracker · Dec 11, 2022

cracauer@ said:
What does this show?

Code:

zdb -c <poolname>

Same as this thread reply here https://forums.freebsd.org/threads/system-panic.87387/post-591149

It shows some error apparently, check image below please

Sorry had to make this thread post here because that one went pretty long just to get upto this point

Andriy · Dec 11, 2022

zpool status output in your message has some weird formatting, but it does look like you have just one vdev in your pool. So, zpool scrub cannot do much for you, because the configuration is not redundant.

Also, you did not say anything about the actual crash, so I have to trust you that it is caused by the checksum error(s) in the pool.

Finally, the error seems to be in a special object in zroot/tmp filesystem. So, you can try to destroy (and re-create) that filesystem. Or mark it as unmountable and rename it just in case too.

Tracker · Dec 11, 2022

Andriy said:
zpool status output in your message has some weird formatting, but it does look like you have just one vdev in your pool. So, zpool scrub cannot do much for you, because the configuration is not redundant.

You mean in the image? Yes it's a laptop with just a hard drive.

Interesting to know that scrub can't do much - does scrub only work in redundant settings?

Andriy said:
Also, you did not say anything about the actual crash, so I have to trust you that it is caused by the checksum error(s) in the pool.

The crash happens only when booting and only when I enable zfs_enable="YES" in rc.conf

I'm still figuring out how to share the crash details but difficult from my mobile device.

Andriy said:
Finally, the error seems to be in a special object in zroot/tmp filesystem. So, you can try to destroy (and re-create) that filesystem. Or mark it as unmountable and rename it just in case too.

Not sure I understand this part at all and what I need to do here. What do you mean by destroy and recreate? And the mountable/rename part?

I had tried to list contents of/temp/ but it didn't show any such file. There was also a link that scrub pointed to, not sure I have it handy be but the other thread about this issue mentions it. I gathered it's metadata corruption probably but not sure about it or what to do about it.

Tracker · Dec 12, 2022

Btw just want to report a strange but somewhat documented behavior:

When I run scrub the error didn't seem to go away but when I ran scrub and stopped it after a few minutes that 'zroot/temp:<0x3> error went away upon checking status again of the pool.

However that doesn't solve the issue because rebooting causes panic again. Just thought maybe it's relevant and I should jot it down.

SirDice · Dec 12, 2022

Tracker said:
However that doesn't solve the issue because rebooting causes panic again.

It's /tmp/, anything that's in there is not worth saving, just destroy the whole zroot/tmp dataset and create a new one. Boot to single user mode to do this.

Tracker · Dec 13, 2022

SirDice said:
It's /tmp/, anything that's in there is not worth saving, just destroy the whole zroot/tmp dataset and create a new one. Boot to single user mode to do this.

Yes - this is what I ended up going as _martin suggested yesterday. I set mountpoint to `none` - which let me boot into multiuser mode. But I still can't get some applications to run properly (Chrome doesn't run/FF wants me to start with new profile).

The original zpool error however persists though - no fix for that yet.

SirDice · Dec 13, 2022

Tracker said:
I set mountpoint to `none` - which let me boot into multiuser mode.

This doesn't remove the dataset, it's still there. It's just not being accessed anymore. Accessing that corrupted dataset seems to illicit a nice panic(9) which is never good. It shouldn't panic, no matter how corrupted it is.

Tracker said:
The original zpool error however persists though - no fix for that yet.

Don't think it's fixable, the panic might be but the data in that dataset is probably lost forever. Which is fine, as I said, it's /tmp, files there aren't even guaranteed to exist after a reboot.

Code:

     /tmp/      temporary files that are not guaranteed to persist across
                system reboots

hier(7)

covacat · Dec 13, 2022

SirDice said:
This doesn't remove the dataset, it's still there. It's just not being accessed anymore. Accessing that corrupted dataset seems to illicit a nice panic(9) which is never good. It shouldn't panic, no matter how corrupted it is.

its a zfs self induced panic because of metadata corruption / its not triggered by a bug

_martin · Dec 13, 2022

Tracker This is what I mentioned in other thread you posted: don't fork this issue to separate threads. The main thread you started has all information in.

SirDice said:
This doesn't remove the dataset, it's still there. It's just not being accessed anymore. Accessing that corrupted dataset seems to illicit a nice panic(9) which is never good. It shouldn't panic, no matter how corrupted it is.

Correct, and it was done intentionally to do as little as possible to the pool to make it bootable. When I looked at his crashdump I saw that crash was happening when /tmp was cleaned up. It was worth a shot to disable just the /tmp and hope bug "was not spread" on other datasets too.

I don't want to repaste all information I did there. PR 268333, oracle KB, ZFS-8000-8A, original thread
edit: that oracle's KB was hinted by covacat

Tracker · Dec 13, 2022

_martin said:
Tracker This is what I mentioned in other thread you posted: don't fork this issue to separate threads. The main thread you started has all information in.

Correct, and it was done intentionally to do as little as possible to the pool to make it bootable. When I looked at his crashdump I saw that crash was happening when /tmp was cleaned up. It was worth a shot to disable just the /tmp and hope bug "was not spread" on other datasets too.

I don't want to repaste all information I did there. PR 268333, oracle KB, ZFS-8000-8A, original thread

Yes, thaks a ton for all the help and assistance you've provided. Just want to point out for the PR - that error _temporarily_ vanishes if I turn off scrub (using -s) after a couple of mins (vs running the whole process) - but then it comes back again somehow - that's what I remember when I had tried it.

_martin · Dec 13, 2022

I'd trust that pool enough to grab as much data as you need (data which is not reported corrupted is ok) but I would not trust that pool any more. You should definitely do a fresh install of the system.
Once you have the backup (which you don't) you can experiment. You can then remove the rpool/tmp and create a new one. Note though there was reported corruption at your ~ path so other dataset is corrupted too.
As I said you'll save yourself from headache if you do fresh install.

PMc · Dec 13, 2022

Tracker said:
Yes, thaks a ton for all the help and assistance you've provided. Just want to point out for the PR - that error _temporarily_ vanishes if I turn off scrub (using -s) after a couple of mins (vs running the whole process) - but then it comes back again somehow - that's what I remember when I had tried it.

That doesn't matter too much - if you terminate the scrub in midflight, it may not log proper status info.

Then, yes, it is possible that a kernel crash happens from defective metadata - I have seen that before. Not sure if this is a bug, as unix was never supposed to run with a defective disk (that's why extensive surface analysis was done in former times).

So, as I understand this, the issue is to get the pool intact. And if there is important data in the pool, I would remove that disk and not do any further experiments, get a fresh install on a replacement disk, and then analyze that broken pool in read-only.

The next question then is, how could this happen? I.e. is there some defective hardware involved?

ZFS System panic with zpool corruption

Tracker

cracauer@

Tracker

Attachments

Andriy

Tracker

Tracker

SirDice

Administrator

Tracker

SirDice

Administrator

covacat

_martin

Tracker

_martin

PMc