ZFS System panic with zpool corruption

My system is unable (causes panic while booting) to boot because apparently zpool shows some corruption in zroot

Here's the output
Code:
zpool status -v
pool: zroot
state: ONLINE
status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8008-8A
scan: scrub repaired 8B in 00:20:19 with 2 errors on Sun Dec 11 17:51:59 2822

config:
NAME STATE
zroot ONLINE
ada8p3.eli ONLINE
READ WRITE CKSUM
0          0            0
0         0            4
errors: Permanent errors have been detected in the following files:
zroot/tmp: <0x3>


How do I fix this error?

I tried zpool scrub zroot multiple times now, but that hasn't helped

PS : System boots into multiuser mode just fine when zfs_enable="YES" is commented out of rc.conf. But obviously not of any use without the underlying file system.
 
Last edited by a moderator:

Attachments

  • 1670763282370.jpg
    1670763282370.jpg
    534.5 KB · Views: 84
zpool status output in your message has some weird formatting, but it does look like you have just one vdev in your pool. So, zpool scrub cannot do much for you, because the configuration is not redundant.

Also, you did not say anything about the actual crash, so I have to trust you that it is caused by the checksum error(s) in the pool.

Finally, the error seems to be in a special object in zroot/tmp filesystem. So, you can try to destroy (and re-create) that filesystem. Or mark it as unmountable and rename it just in case too.
 
zpool status output in your message has some weird formatting, but it does look like you have just one vdev in your pool. So, zpool scrub cannot do much for you, because the configuration is not redundant.
You mean in the image? Yes it's a laptop with just a hard drive.

Interesting to know that scrub can't do much - does scrub only work in redundant settings?
Also, you did not say anything about the actual crash, so I have to trust you that it is caused by the checksum error(s) in the pool.
The crash happens only when booting and only when I enable zfs_enable="YES" in rc.conf

I'm still figuring out how to share the crash details but difficult from my mobile device.
Finally, the error seems to be in a special object in zroot/tmp filesystem. So, you can try to destroy (and re-create) that filesystem. Or mark it as unmountable and rename it just in case too.
Not sure I understand this part at all and what I need to do here. What do you mean by destroy and recreate? And the mountable/rename part?

I had tried to list contents of/temp/ but it didn't show any such file. There was also a link that scrub pointed to, not sure I have it handy be but the other thread about this issue mentions it. I gathered it's metadata corruption probably but not sure about it or what to do about it.
 
Btw just want to report a strange but somewhat documented behavior:

When I run scrub the error didn't seem to go away but when I ran scrub and stopped it after a few minutes that 'zroot/temp:<0x3> error went away upon checking status again of the pool.

However that doesn't solve the issue because rebooting causes panic again. Just thought maybe it's relevant and I should jot it down.
 
However that doesn't solve the issue because rebooting causes panic again.
It's /tmp/, anything that's in there is not worth saving, just destroy the whole zroot/tmp dataset and create a new one. Boot to single user mode to do this.
 
It's /tmp/, anything that's in there is not worth saving, just destroy the whole zroot/tmp dataset and create a new one. Boot to single user mode to do this.
Yes - this is what I ended up going as _martin suggested yesterday. I set mountpoint to `none` - which let me boot into multiuser mode. But I still can't get some applications to run properly (Chrome doesn't run/FF wants me to start with new profile).

The original zpool error however persists though - no fix for that yet.
 
I set mountpoint to `none` - which let me boot into multiuser mode.
This doesn't remove the dataset, it's still there. It's just not being accessed anymore. Accessing that corrupted dataset seems to illicit a nice panic(9) which is never good. It shouldn't panic, no matter how corrupted it is.

The original zpool error however persists though - no fix for that yet.
Don't think it's fixable, the panic might be but the data in that dataset is probably lost forever. Which is fine, as I said, it's /tmp, files there aren't even guaranteed to exist after a reboot.
Code:
     /tmp/      temporary files that are not guaranteed to persist across
                system reboots
hier(7)
 
This doesn't remove the dataset, it's still there. It's just not being accessed anymore. Accessing that corrupted dataset seems to illicit a nice panic(9) which is never good. It shouldn't panic, no matter how corrupted it is.
its a zfs self induced panic because of metadata corruption / its not triggered by a bug
 
Tracker This is what I mentioned in other thread you posted: don't fork this issue to separate threads. The main thread you started has all information in.

This doesn't remove the dataset, it's still there. It's just not being accessed anymore. Accessing that corrupted dataset seems to illicit a nice panic(9) which is never good. It shouldn't panic, no matter how corrupted it is.
Correct, and it was done intentionally to do as little as possible to the pool to make it bootable. When I looked at his crashdump I saw that crash was happening when /tmp was cleaned up. It was worth a shot to disable just the /tmp and hope bug "was not spread" on other datasets too.

I don't want to repaste all information I did there. PR 268333, oracle KB, ZFS-8000-8A, original thread
edit: that oracle's KB was hinted by covacat
 
Tracker This is what I mentioned in other thread you posted: don't fork this issue to separate threads. The main thread you started has all information in.


Correct, and it was done intentionally to do as little as possible to the pool to make it bootable. When I looked at his crashdump I saw that crash was happening when /tmp was cleaned up. It was worth a shot to disable just the /tmp and hope bug "was not spread" on other datasets too.

I don't want to repaste all information I did there. PR 268333, oracle KB, ZFS-8000-8A, original thread
Yes, thaks a ton for all the help and assistance you've provided. Just want to point out for the PR - that error _temporarily_ vanishes if I turn off scrub (using -s) after a couple of mins (vs running the whole process) - but then it comes back again somehow - that's what I remember when I had tried it.
 
I'd trust that pool enough to grab as much data as you need (data which is not reported corrupted is ok) but I would not trust that pool any more. You should definitely do a fresh install of the system.
Once you have the backup (which you don't) you can experiment. You can then remove the rpool/tmp and create a new one. Note though there was reported corruption at your ~ path so other dataset is corrupted too.
As I said you'll save yourself from headache if you do fresh install.
 
Yes, thaks a ton for all the help and assistance you've provided. Just want to point out for the PR - that error _temporarily_ vanishes if I turn off scrub (using -s) after a couple of mins (vs running the whole process) - but then it comes back again somehow - that's what I remember when I had tried it.
That doesn't matter too much - if you terminate the scrub in midflight, it may not log proper status info.

Then, yes, it is possible that a kernel crash happens from defective metadata - I have seen that before. Not sure if this is a bug, as unix was never supposed to run with a defective disk (that's why extensive surface analysis was done in former times).

So, as I understand this, the issue is to get the pool intact. And if there is important data in the pool, I would remove that disk and not do any further experiments, get a fresh install on a replacement disk, and then analyze that broken pool in read-only.

The next question then is, how could this happen? I.e. is there some defective hardware involved?
 
Back
Top