System panic

cracauer@ · Dec 10, 2022

Tracker said:
This is interesting. Will try to keep in mind. Hopefully I solve this issue for now by scrubbing but yes crashing upon compiling earlier does make me still wonder if it could be a RAM issue.

Could also be the CPU mixing up a bit here and there. That's what mprime/prime95 tests.

Even if you clear the ZFS errors the question remains how it got corrupted in the first place.

Tracker · Dec 10, 2022

elgrande said:
I am not a zfs pro, but from the link I sent you, I have the impression that the 'clear' is required to reset the error stats.
Anyhow if it is just a Chrome directory, why not delete the whole directory and scrub/clear once after this?

Deleted the files that were showing up as permanent errors under scrub - then did zpool clear zroot.

Rebooted. Same panic error

, unable to reboot.

Will get some sleep now and try again on few other suggestions and report back.

Thanks everyone.

_martin · Dec 10, 2022

In my opinion you should not be chasing stale issues after scrub on files you most likely don't care as they are in tmp or are not important (chrome related).

If you are able to panic system without graphics driver that panic is what you need to be after. Show us what is the panic about.
Yes, passing memtest is not 100% assurance all is ok but it's always a good place to start. Other tests that stress the system are always a good idea. But here you can always reliably trigger a panic.
RAM is usually the best way to start. CPU is next. That's why I asked about the MCE errors in the syslog.

Tracker · Dec 11, 2022

cracauer@ said:
Could also be the CPU mixing up a bit here and there. That's what mprime/prime95 tests.

Even if you clear the ZFS errors the question remains how it got corrupted in the first place.

Ok so I installed mprime and trying stress testing with default options now (2 cores instead of 4 i believe - this machine is pretty old so I guess that should be reasonable? Was already having temp issues when overloaded). Just hoping CPU is running reasonably fine.

How long does this usually run?

And what should I do to test RAM?

Tracker · Dec 11, 2022

Alain De Vos said:
It's a good idea to remove all "browser-related-data".

elgrande said:
I am not a zfs pro, but from the link I sent you, I have the impression that the 'clear' is required to reset the error stats.
Anyhow if it is just a Chrome directory, why not delete the whole directory and scrub/clear once after this?

Tried this I think - by deleting the files scrub was complaining about manually under Chromium config. Didn't seem to work and still causing issues. Panic at boot.

So right now I need to fix the zpool errors with scrub - maybe run it a couple more times? Already ran it twice I think.

And need to test hardware - running mprime for CPU currently.

_martin said:
Third point: configure dump device. Do you have swap partition on your system? If you're not sure please show us the gpart show so we can check. If yes we need to do what I mentioned above. Once you have a crash in /var/crash we are ready to open PR.

How do I configure dump device? I have swap on the system, yes.

FYI any output I share might have typos because I'm not able to do it correctly from my phone.

Tracker · Dec 11, 2022

Update: Not sure how long this mprime CPU test is supposed to run. But been running since a couple of hours and the (rolling, difficult to follow) output doesn't seem to be indicating errors whenever I glance at it.

See image below. How long do I let this run?

I'm assuming couple of hours should have been good enough to catch glaring errors?

UPDATE 2 : Closing the mprime testing now, let it run for 3 hrs I think. Wasn't indicating any failures whenever I looked at it. Attaching a second image below to show how it was going, please let me know if there's any errors you spot (mprime noob here). I reckon this should be sufficient to conclude that CPU isn't the primary fault here?

Oh also including a 3rd image which says 0 errors 0 warnings after 2:45 hrs run

Tracker · Dec 11, 2022

I *think* this is the root problem now. When I do `zpool status -v` it's again showing me the permanent errors to be in the following files
`zroot/tmp:<0x3>` like I mentioned earlier.

I can't see any such file when I try to `ls /tmp` it outputs some other files (that seem temporary in nature) . I think maybe if I could remove this.file then things could possibly work?

The output of
zpool status -v
Also points to this link https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A/ . When I read it it makes me think this might be a metadata level corruption of the zpool? Just not sure how to fix this.

Anyone?

Alain De Vos · Dec 11, 2022

That's not a nice message. It means you need to start from scratch ...

_martin · Dec 11, 2022

Tracker said:
How do I configure dump device? I have swap on the system, yes.

Please look at my comments above, I've already mention it twice. /etc/rc.conf: dumpdev="AUTO" is sufficient. Check with dumpdev -l if you have dump device enabled. If not execute dumpon /dev/diskXpN to use that swap.

If this is the only computer you have and you are posting from cell phone I get it, that's pain to do. But you never showed the stack trace of the crash when you had only zfs_enable in rc.conf.

For the sake of clarity I'll reiterate: HW stressing is a good way to figure out if your HW is ok. More tests the better. In your case testing RAM and checking syslog or MCE should be enough. You are reliably able to reproduce the issue.
Is it ZFS related? Could be. The only crash you showed is not complete (screen scroll), it seems to have nested issues (vpanic already happening when drm ran into another issue). Also caused by a rm process is a bit weird (maybe pointing back to fs ?).

in either case (ZFS or graphics driver) you'd need PR as those projects are really big and informing developers about the issue is the best way to go.

Tracker · Dec 11, 2022

Alain De Vos said:
That's not a nice message. It means you need to start from scratch ...

Holy cow. Do you mean _ALL_ my data is lost? Is there no way to recover? (Snapshots/BEs?)

PS: I am able to see the files in single user mode

with ls on home directory

Tracker · Dec 11, 2022

_martin said:
Please look at my comments above, I've already mention it twice. /etc/rc.conf: dumpdev="AUTO" is sufficient. Check with dumpdev -l if you have dump device enabled. If not execute dumpon /dev/diskXpN to use that swap.

If this is the only computer you have and you are posting from cell phone I get it, that's pain to do. But you never showed the stack trace of the crash when you had only zfs_enable in rc.conf.

For the sake of clarity I'll reiterate: HW stressing is a good way to figure out if your HW is ok. More tests the better. In your case testing RAM and checking syslog or MCE should be enough. You are reliably able to reproduce the issue.
Is it ZFS related? Could be. The only crash you showed is not complete (screen scroll), it seems to have nested issues (vpanic already happening when drm ran into another issue). Also caused by a rm process is a bit weird (maybe pointing back to fs ?).

in either case (ZFS or graphics driver) you'd need PR as those projects are really big and informing developers about the issue is the best way to go.

Thanks for walking me through this and sorry for being unable to pay enough attention. I only have a mobile device now - so I'm not sure how I can accomplish this. I think I can say with some confidence though that the panic is reproduced when zfs_enable is present uncommented in the rc.conf file. Whenever it's commented out the system boots normally like before (but unable to login due to fs missing, even if I manually try to mount the zfs system with readonly=off the GUI login doesn't work and goes blank)

Regarding stacktrace - the pictures I posted earlier too were at the bottom of the panic apparently. It's difficult to get the scrolling output given I only have a phone. If there's something easy that I can do then please let me know.

Thanks again.

_martin · Dec 11, 2022

If you have important data on this pool doing less is better. You can always boot the USB/cd and recover the data from there (i.e. activate pool in recovery/live mode, mount fs, copy data to an external disk,etc.).
As it's not that hard to enable those dumps please do that. We can only guess what's happening but without being able to see the crash and/or logs it's pretty much a guess work.

cracauer@ · Dec 11, 2022

Tracker said:
Ok so I installed mprime and trying stress testing with default options now (2 cores instead of 4 i believe - this machine is pretty old so I guess that should be reasonable? Was already having temp issues when overloaded). Just hoping CPU is running reasonably fine.

How long does this usually run?

And what should I do to test RAM?

I run mprime for 24 hours, but 3 should be fine, too.

For RAM I use the Linux binary of SuperPi.

cracauer@ · Dec 11, 2022

Tracker said:
I *think* this is the root problem now. When I do `zpool status -v` it's again showing me the permanent errors to be in the following files
`zroot/tmp:<0x3>` like I mentioned earlier.

Try this:

Code:

zdb -c poolname

Tracker · Dec 11, 2022

cracauer@ said:
Try this:

Code:

zdb -c poolname

Seems to show error counts
Error No count
97. 1

Please check image below for full screen output

Tracker · Dec 11, 2022

_martin said:
If you have important data on this pool doing less is better. You can always boot the USB/cd and recover the data from there (i.e. activate pool in recovery/live mode, mount fs, copy data to an external disk,etc.).
As it's not that hard to enable those dumps please do that. We can only guess what's happening but without being able to see the crash and/or logs it's pretty much a guess work.

Ok, makes sense, I'm trying to figure out the basics of how to use zfs snapshots/BEs to recover data to another disk. Need to buy that too. Was hoping this would get fixed by software without need for additional hardware.

I just checked 'dumpdev="AUTO"' was present all this while! However there's no dumpdev installed on the system. Assuming I have to install it and it works - how will I possibly share it here? On mobile.

cracauer@ said:
I run mprime for 24 hours, but 3 should be fine, too.

For RAM I use the Linux binary of SuperPi.

Thanks. Somehow pkg search SuperPi doesn't return any results.

elgrande · Dec 11, 2022

It cannot be emphasized enough.
If you have important data only on this device, NOW is the time to backup as much as you can before fiddling further with ZFS.
Since you can still read the data a backup should be possible.

_martin · Dec 11, 2022

Tracker said:
However there's no dumpdev installed on the system.

Opps, my typo. dumpon -l to list the dump device, dumpon /dev/diskNpX to use the swap as swapdevice. AUTO part of the rc.conf should then automatically use that device.

Tracker · Dec 11, 2022

Than

elgrande said:
It cannot be emphasized enough.
If you have important data only on this device, NOW is the time to backup as much as you can before fiddling further with ZFS.
Since you can still read the data a backup should be possible.

Good point. Just trying to figure out how to save the data using snapshots? onto another hard drive. Never really faced this situation.

A) Is it possible to use snapshots/BE to get the exact replica of current data?

B) If A Is possible, would such a setup require an exact set-up as the current machine on the second hard drive? ( It's GELI encrypted)

elgrande · Dec 11, 2022

Tracker said:
Than

Good point. Just trying to figure out how to save the data using snapshots? onto another hard drive. Never really faced this situation.

A) Is it possible to use snapshots/BE to get the exact replica of current data?

B) If A Is possible, would such a setup require an exact set-up as the current machine on the second hard drive? ( It's GELI encrypted)

Since you can mount the zfs volume, you can just copy the data to another device I guess?

free-and-bsd · Dec 11, 2022

Tracker said:
I was actually using `zfs list` output, all this while, to set readonly=off variable to be able to edit files in single user mode.

This command actually gives some errors that are related to Chrome!!! Had a sneaky feeling something had to do with Chromium

See attached image below
zpool status -v
It asks to restore the file on question if possible or to restore entire pool from backup. What should I do?

No I had the standard zfs, might have switched automatically if freebsd changed it with versions 12.x to 13.1.

I however remember doing some operations with zfs and asking about it when it wasn't working earlier. Maybe I messed up something then (however it worked fine for a couple of months) that's come back to bite me now

What kind of hard drive are you using? Not SSD perchance?

Well anyway, I have experienced these permanent errors in a ZFS pool. This has nothing to do with RAM, evidently. In my case (2 cases actually) it was failing hard drive. Yes, this happens to hard drives and nothing can be done about it.

You can very well copy whatever files you need from your old pool onto a new hard drive by either rsync or by zfs send sending a snapshot of your entire pool. Read-only mode is no problem in that case as "read" is all you will need. If you choose zfs send command, then your pool errors will, of course, be copied over as well, but on a NEW hard drive they will be easily fixed. Because their not being fixed is likely caused by failing hard drive.

free-and-bsd · Dec 11, 2022

Tracker said:
Than

Good point. Just trying to figure out how to save the data using snapshots? onto another hard drive. Never really faced this situation.

A) Is it possible to use snapshots/BE to get the exact replica of current data?

B) If A Is possible, would such a setup require an exact set-up as the current machine on the second hard drive? ( It's GELI encrypted)

Yes, that's the beauty of zfs & snapshots, it will be exactly all the data you have there, including your setup. Not sure about GELI or other kind of encryption though, never tried that. However, you can mount your decrypted stuff read-only and then copy over all the data using rsync. Then you will use whatever encryption method you prefer on the new hard drive.

But if I were you, I would first get things settled with data, then worry about encryption etc. Unless , of course, we're talking here about TBs of data, which you never mentioned in your messages ))

free-and-bsd · Dec 11, 2022

BTW, you can use that Ubuntu USB stick you've mentioned earlier and use the HDD diagnostic tool (forgot its name, sorry) they have in every distro. At least you'll see if your disk shows errors...

Tracker · Dec 11, 2022

free-and-bsd said:
What kind of hard drive are you using? Not SSD perchance?

Yes, SSD, Samsung's 860. And I was also using swap on it a fair bit due to system load (8 gb ram which used to fall short so kept 8gb swap- used to run near capacity most of the time). Possibly that might have caused the hard drive to wear down faster? I mean it's still not clear definitely that it's a hard drive issue but I used to always think about the stress on hard drive with swap being almost fully used.

free-and-bsd said:
You can very well copy whatever files you need from your old pool onto a new hard drive by either rsync or by zfs send sending a snapshot of your entire pool. Read-only mode is no problem in that case as "read" is all you will need. If you choose zfs send command, then your pool errors will, of course, be copied over as well, but on a NEW hard drive they will be easily fixed. Because their not being fixed is likely caused by failing hard drive.

Interesting. So IF it's due to a failing hard drive then the error on the new one should be fixable? How would I go about fixing it? Using scrub?

free-and-bsd said:
Not sure about GELI or other kind of encryption though, never tried that. However, you can mount your decrypted stuff read-only and then copy over all the data using rsync. Then you will use whatever encryption method you prefer on the new hard drive.

Yes so I guess I'll have to install a vanilla Freebsd 13.1 , with zfs, then go about copying using zfs send?

free-and-bsd said:
BTW, you can use that Ubuntu USB stick you've mentioned earlier and use the HDD diagnostic tool (forgot its name, sorry) they have in every distro. At least you'll see if your disk shows errors.

I'll try to boot using that stick. I'm not sure if it would be able to properly check hard drive for errors given it's zfs+encrypted, would it?

free-and-bsd · Dec 11, 2022

Tracker said:
Yes, SSD, Samsung's 860. And I was also using swap on it a fair bit due to system load. Possibly that might have caused the hard drive to wear down faster? I mean it's still not clear definitely that it's a hard drive issue but I used to always think about the stress on hard drive with swap being almost fully used.

Interesting. So IF it's due to a failing hard drive then the error on the new one should be fixable? How would I go about fixing it? Using scrub?

Yes so I guess I'll have to install a vanilla Freebsd 13.1 , with zfs, then go about copying using zfs send?

I'll try to boot using that stick. I'm not sure if it would be able to properly check hard drive for errors given it's zfs+encrypted, would it?

Ok, point by point)))

1) Actually Samsung is a good reputation SSD manufacturer. Still, things happen...

2) Yes, the error will be copied over to the new pool, but there it WILL be fixed with scrub, and you also WILL be able to delete the unfortunate files.

3) Actually, if you're "expert" enough, you may boot from a 13.1 installation media, create a zpool on your new HDD, then use zfs send command to send your old pool (snapshot) to the new zpool. That will restore ALL you have in the old pool to the new pool. You will then be able to boot from that stuff same as you booted from the old one. Well, some basic setup for booting will be necessary, of course.

4) That tool checks HDD on the low hardware level. It cares nothing about the stuff that's written to the disk. We're supposedly dealing here with hardware level problems. And ZFS + its encryption is, as you would know, software level

System panic

cracauer@

Tracker

_martin

Tracker

Tracker

Tracker

Attachments

Tracker

Alain De Vos

_martin

Tracker

Tracker

_martin

cracauer@

cracauer@

Tracker

Attachments

Tracker

elgrande

_martin

Tracker

elgrande

free-and-bsd

free-and-bsd

free-and-bsd

Tracker

free-and-bsd