Solved ZFS Pool got corrupted, kernel panic after import

I'm just guessing now but I probably had an issue with my main storage machine, probably a faulty power supply.
The issue is that it rebooted multiple times despite connected to UPS.
I believe my pool got corrupted, and the machine just constantly rebooted, tried to import/mount then panic, then rebooted again.
I tried some zdb commands I found on the net that seem to show the pool data, and I found out there might be a script which can roll back transactional logs that might help me.
What would be the best way to start rescuing such a faulty pool?
It was running 12.1 .
This is a two disk mirror, I'm backing up one of the disks just to be safe, I put away the other side of the mirror for now.
I did a screenshot, because I could not get my box to not reboot after the panic in the rescue shell.
 

Attachments

  • zpool_import_web.jpg
    zpool_import_web.jpg
    321.8 KB · Views: 344
After ensuring that you've got a back up, probably the next thing to try would be something along the lines of zpool import -Fn , that should tell you if discarding the last few transactions will make it openable. Obviously, any data represented by those transaction would be permanently lost though. But, I'm guessing that there probably isn't anything of importance.

The other thing to try would probably be to use a different install.
 
Thanks for your reply, but that also made the system panic, also a new install with all the fresh updates. Some of my uberblocks got damaged due to the infinite restarting.
I made it work, and I will write down the process here later, but now I'm still backing up my stuff.
 
Some background info on my issue:
I observed the hardware issue I had during the restore operations as well, namely my machine just did a hard reset out of nowhere, no panic no alerts, just a reset, probably a power supply issue, or the start of some kind of fault in my motherboard. As I already wrote the initial state was that the system did a boot tried to import zroot that did a panic did a reboot and it got back to square one, at this point zpool import was not possible.

Importing the zroot pool (when it was working) and then doing a reset sometime and doing this multiple times took a toll on my pool. My part in the fault was that I did not know about this, so if I would had some kind of watchdog that sends a mail maybe that the system just booted, or last reboot was not done by a user, or something like that this could have been avoided.
I did some research on zdb and looked into rescuing zfs pools in the last couple of days, and hopefully that was enough to bring back my data.

I just copied my most valuable stuff to safety, and I'm running an all inclusive sync to another pool just to keep me sane. I'm decomissioning the system that had this issue, however probably I'm keeping the disks because they operate just fine and show no issues with smartctl.

The rescue process:
I have a standard ZFS auto installation which has a freebsd-zfs type "parition" on ada#p3
I exported the uberblocks to a text file:
zdb -ul /dev/ada0p3 > /tmp/uberblocks.txt
I looked into the timestamps there were mostly two timestamps, some from 5:22 AM and some from 03:29AM , because I had no big disk ops the last day I went with the transactional group that shown 03:29AM , because I though the farther away it was done from the issue the healthier the pool was. The time difference probably counts because it took almost a day to roll back the pool to that point. So I picked a txg that I used with the pool import (-T txg#) and it started the process, my commands was:

zpool import -N -o readonly=on -f -R /mnt -F -T 21347495 zroot

-N : import without mounting
-o readonly=on : pool will be imported in readonly mode
-f : force import, this is needed if there is a sign that the pool is in use or there was a crash so export was not done properly, also needed if there are missing devices in the pool afaik
-R /mnt : altroot , and disables cachefile
-F : Recovery mode for a non-importable pool, zfs will try to discard the last few transactions
-T : I found no entry for this switch in the manual, but I found it in some articles and maillist threads that zfs will try to roll back to this transaction if given

I monitored disk usage with gstat and most of the time disk usage was 100%, no other zfs related command went through which is understandable (I tried to create a new volume on another pool just for kicks). This took around 16 or 17 hours for me on a 3TB disk.

And voila my pool was there in read-only mode (it's degraded because I just plugged in half of the mirror so if I mess up I still have a chance):

Code:
root@back1:~ # zpool status
  pool: zroot
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 0 days 06:57:25 with 0 errors on Sun Aug  2 10:10:54 2020
config:

        NAME                      STATE     READ WRITE CKSUM
        zroot                     DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            gpt/zfs2              ONLINE       0     0     0
            15526455252259167746  UNAVAIL      0     0     0  was /dev/gpt/zfs1


errors: No known data errors


I'm pretty impressed how this came out, I thought my pool was toast... I had multiple disk failures since I'm using ZFS (usually every 2 or 3 years), and I just popped in new bigger disks resilvered the mirror and it was good to go, I never had a catastrophic failure like this before. I learned my lesson, I will categorize my data, and do some kind of offsite or cloud backup for the stuff that I value the most. I will also monitor my machine for any strange hw issue.
 
Oh I almost forgot, that I tried to export the pool data structure into a file with zdb before importing (lots of data on a 3TB disk), but this produced a core after the first 80MB of text data, probably for the same reasons the import did not work.
 
I don't understand how a sequence of resets could have damaged the pool metadata that much. But we'll probably never find out what really happened.
 
Do we know what can cause such an issue? I have the data from /var/crash now and I have three dumps with "Panic String: page fault" from the last night, which means the system managed to import the pool at least three times the last night.
The times I see are:
Aug 3 02:43
Aug 3 03:22
Aug 3 03:26
So 39 minutes passed after the first and the second, and the third panic came just after 4 minutes.

I had multiple stripe pools set along with the faulty mirror disk when I tried to save it, and when the system went into a kernel panic after an import try on the corrupted one the other newly created stripe can only be imported with the force option due to inconsistency. I got the message that it's still being used by the same system. I can imagine there were some kind of operation in the background during the restarts. This is a desktop configuration with non-ECC memory, my first hunch was that I had memory issue, but memtest ran fine multiple passes, although it is possible that there was some other underlying hardware problem that caused this...
 
Do we know what can cause such an issue?
On the contrary. Your story starts with the system crashing several times in a row. Disks, the IO stack between the disks and the host, and file systems are carefully designed to not corrupt anything when the system crashes, even if it stops in the middle of an operation. So I have to assume that the data on disk may be a little stale (if a user operation was in progress or had recently happened when the crash happened, that operation may not have been hardened to disk), but the content of the disk should be consistent.

Underlying this thought is the question of what caused your kernel panics. I'm assuming that the panics happened after the reset, and they happened because some data on disk is bad. The original kernel panic of which you posted a photograph shows ZFS on booting reading the log (the ZIL), and that leading to a memory error, probably due to an invalid address. Since we know that ZIL reading works correctly when the ZIL on disk is formatted correctly, my assumption is that your ZIL is corrupted, so badly that it exposed a bug in the ZFS code. What is the bug? Normally, bad data from disk should never cause a memory error.

But if you read these two things carefully, you see that they contradict each other. The whole storage stack does NOT corrupt data on crash. Yet data was found corrupted. So you must have found a rare bug that makes the first half of the statement untrue.
 
On the contrary. Your story starts with the system crashing several times in a row. Disks, the IO stack between the disks and the host, and file systems are carefully designed to not corrupt anything when the system crashes, even if it stops in the middle of an operation. So I have to assume that the data on disk may be a little stale (if a user operation was in progress or had recently happened when the crash happened, that operation may not have been hardened to disk), but the content of the disk should be consistent.

Underlying this thought is the question of what caused your kernel panics. I'm assuming that the panics happened after the reset, and they happened because some data on disk is bad. The original kernel panic of which you posted a photograph shows ZFS on booting reading the log (the ZIL), and that leading to a memory error, probably due to an invalid address. Since we know that ZIL reading works correctly when the ZIL on disk is formatted correctly, my assumption is that your ZIL is corrupted, so badly that it exposed a bug in the ZFS code. What is the bug? Normally, bad data from disk should never cause a memory error.

But if you read these two things carefully, you see that they contradict each other. The whole storage stack does NOT corrupt data on crash. Yet data was found corrupted. So you must have found a rare bug that makes the first half of the statement untrue.
When you put it like that, it reminds me of a similar problem that I had ages ago where I was trying to import a pool that hadn't been offlined and the computer kept panicing. IIRC, what I wound up doing was having it ignore the previous cache and when the pool loaded, sent it to a new pool. It was ages ago and I didn't think to take any notes about what I was doing or why.
 
Top