ZFS Major ZFS data corruption

Hi. I am experiencing pretty major corruption on a large-ish ZFS pool. And am wondering if anybody has seen similar or has some tips to aid recovery.

I have rolling snapshots setup using zfsnap. But I only have backups of my most critical data from this pool.

I'm running FreeBSD 12.2-RELEASE (only just upgraded while migrating the OS and disks to new hardware to test whether it was a SATA controller causing the disk read errors).


I have a 9-disk RAIDz2 array and one of my drives has been failing (it powers on and can be added to the pool but during a resilver there were so many read errors that it gets kicked out of the pool. And today the resilver hung so long that I unplugged the faulted drive.

Afterwards I performed a scrub on the remaining 8 drives and that produced well over 100,000 data errors. In the verbose output I saw that these were mostly in the snapshots. I destroyed some of these snapshots to help clear the errors (which was likely a mistake) and I re-scrubbed.

I’ve never seen more than a few corrupt files after a scrub, but seeing as though most of these were snapshots I didn’t think it was too big a deal. I have seen errors in a degraded pool disappear once all drives were present, so was assuming that's what would happen here.



I also had a few corrupt whole directories and I tried replacing these with the contents of their snapshots but during the “cp” operation cp gave an “integrity check failure” for each file and so wasn’t able to restore any files out of the snapshots.


I had to reboot my machine and afterwards another resilver was triggered and this seems to have turned a bad situation disastrous because I now seem to have corruption in all of my main/root directories of each zfs dataset (meaning that I can’t list the contents of ANY files in my pool - they just appear to be empty directories).

I also now can't seem to list the ".zfs" on "snapshot" directory in many of the zfs datasets, so it doesn't seem as if I can even restore from these.



I’m hoping that if I can somehow clone my failed disk to a replacement and online that cloned disk then the pool can rebuild itself, but I’d like to hear some expert opinions before I do anything worse and irreversible.




Here's the current state of the pool...


Code:
NAME                     STATE     READ WRITE CKSUM
        pool                 DEGRADED     0     0 33.6K
          raidz2-0               DEGRADED     0     0 88.3K
            gpt/badger0          ONLINE       0     0     0
            gpt/rabbit0          ONLINE       0     0     0
            gpt/mastadon0        ONLINE       0     0     0
            gpt/panda0           ONLINE       0     0     0
            gpt/pony0            ONLINE       0     0     0
            gpt/poodle0          ONLINE       0     0     0
            2064227176711486949  REMOVED      0     0     0  was /dev/gpt/swine0
            gpt/zebra0           ONLINE       0     0     0
            gpt/opossum0         ONLINE       0     0     0

errors: 23889 data errors, use '-v' for a list



zfs list shows the data still on disk (with 17.8T used)

Code:
pool                                 289G  17.8T         0    219K              0      17.8T
pool/assets                          289G  57.8G         0    201K              0      57.8G
pool/assets/stacks               289G  57.8G         0   57.8G              0          0
pool/domain                        289G  2.23T         0    201K              0      2.23T
pool/domain/storage          289G  2.23T     4.83M   2.02T              0       222G
pool/flatdata                        289G  15.2T         0    201K              0      15.2T
pool/project                         289G  14.2G         0    201K              0      14.2G
pool/project/docs                289G  14.2G         0   14.2G              0          0
pool/tablespace                  289G   282G         0    265K              0       282G




Here's an example of a corrput directory root (no files or directories are visible)

Code:
errors: Permanent errors have been detected in the following files:

   pool/domain/storage:<0x0>
   pool/domain/storage:<0xbe2f>
   pool/domain/storage/imagery:<0x0>
   pool/domain/storage/imagery:<0x1b208>
   pool/domain/storage/imagery:<0x18890>
   pool/domain/storage/imagery:<0x68db8>

   <metadata>:<0x12be>
   <metadata>:<0x1bf>
   <metadata>:<0x18cb>

   etc.



Any help with this would be greatly appreciated. Sorry for the essay. I don't really understand how the corruption can get so bad with only one lost disk.
 
You are using RAID-Z2. That can tolerate any two faults. You have one failed disk, now removed, the one called "swine". If that is the only failure, what you are observing makes no sense.

Question: Are you sure there are no IO errors on the other drives? Do you know the state of the failed drive? Have you run smartctl on it? Do you understand the failure mechanism. There has to be more to the story.

I would stop worrying about directories, snapshots, and files. Copying between a snapshot and the file is pointless ... in most cases, they share the same underlying data anyway (if the content hasn't changed). To begin with, I would stop doing any operations that change the state of the the file system, because right now it is clearly very sick. I would begin by diagnosing the actual faults. If you find that you really had only one disk with errors / that is gone, then what you are describing indicates severe bugs in ZFS, which is nearly unfathomable. If you find that you had wide-spread errors, on multiple disks (3 or more), then you have a problem. Also, begin trying to obtain a spare disk, to resilver onto (to restore full redundancy), in case there is something to be saved.
 
Thanks so much. Yeah it seems pretty insane to me. This seems like exactly the sort of thing ZFS is supposed to protect against.

It almost seems that a scrub with one drive missing, then another scrub with a (possibly) different drive missing has led to metadata corruption.

I would have thought that any faults should have been resolved by the scrubbing process – rather than making things worse. I think (and hope) that you're right about these error messages being misleading.




In my previous machine, the symptoms were that 1 drive was frequently being kicked out of the pool (and sometimes 2 drives were lost). It wasn't always the same drive, but it was difficult to isolate the fault on that box so I'm not 100% sure.


I was usually able to complete full scrubs or resilvers though after adding the drives back. It might be weeks or months before seeing the issue again.


I ran smartctl -t long on those drives and that didn't pick up any errors. It wasn’t always producing smart errors in the disk logs when running smartctl –a


At first I thought it was a controller issue because the previous machine was giving similar read faults across a couple of drives - and since moving those drives to a different box they haven't given any issues. I've had overheating issues with my (old) add-on controller before (but installing an overkill fan seemed to fix that). I replaced/swapped cables as well. The drives themselves have had good airflow and weren't overheating.


I never managed to fully diagnose the issue, but thought the best thing was to move all drives to a new host for testing. The new machine has a 10 on-board SATA ports and much better cooling.



smartctl -t short /dev/{failed-drive}

Results in an inability to complete the test. There are numerous recent errors listed on that drive.


I’ve also found a second drive with similar recent errors. But this drive seems to be participating in the pool without issue.



I have an additional drive with a similar set of errors from 80 days ago, and another from even longer. So it’s not the same drive(s) causing the issue. That’s why I thought it was the controller not the drives.



Anyway I’m running tests on my remaining drives and am planning to try and clone the failed drive (if possible).

Although I really don't understand how there can be so much widespread corruption.
 
UPDATE.

Okay so it seems pretty obvious now that I have 2 faulty drives.

When I mount the zpool with the 7 working drives I still see the mass corruption. The thing that worries me most that during a zpool import it says

Code:
cannot mount ‘pool/domain/storage’: Input/output error
cannot mount ‘pool/flatdata’: Input/output error

etc.

So effectively I can no longer mount 95% of the pool’s data.

When I run a zpool scrub with 7 drives it fixes some errors BUT it only scrubs a small portion of the pool because most of the data can’t be mounted (i.e. it won’t scrub through the datasets it refuses to mount).



I’m worried that if I now add 2 new empty drives then it’s just going to resilver this corrupt state across to the new drives with no chance of recovery.



Is there a way to manually/force mount a dataset, or are there any debugging or other tools that I can use to get better visibility with what’s happening?
 
So just for more info. When I attempt to "zfs mount" these messages show up in my logs.

Code:
Nov  5 05:18:40 HOSTNAME ZFS[56447]: pool I/O failure, zpool=$pool error=$97
Nov  5 05:18:40 HOSTNAME ZFS[56448]: checksum mismatch, zpool=$pool path=$/dev/gpt/poodle0 offset=$2436279345152 size=$4096
Nov  5 05:18:40 HOSTNAME ZFS[56449]: checksum mismatch, zpool=$pool path=$/dev/gpt/pony0 offset=$2436279345152 size=$4096
Nov  5 05:18:40 HOSTNAME ZFS[56450]: checksum mismatch, zpool=$pool path=$/dev/gpt/panda0 offset=$2436279345152 size=$8192
Nov  5 05:18:40 HOSTNAME ZFS[56451]: checksum mismatch, zpool=$pool path=$/dev/gpt/mastadon0 offset=$2436279345152 size=$8192
Nov  5 05:18:40 HOSTNAME ZFS[56452]: checksum mismatch, zpool=$pool path=$/dev/gpt/rabbit0 offset=$2436279345152 size=$8192
Nov  5 05:18:40 HOSTNAME ZFS[56453]: checksum mismatch, zpool=$pool path=$/dev/gpt/badger0 offset=$2436279345152 size=$8192
Nov  5 05:18:40 HOSTNAME ZFS[56454]: checksum mismatch, zpool=$pool path=$/dev/gpt/poodle0 offset=$1654562660352 size=$4096
Nov  5 05:18:40 HOSTNAME ZFS[56455]: checksum mismatch, zpool=$pool path=$/dev/gpt/pony0 offset=$1654562660352 size=$4096
Nov  5 05:18:40 HOSTNAME ZFS[56456]: checksum mismatch, zpool=$pool path=$/dev/gpt/panda0 offset=$1654562660352 size=$8192
Nov  5 05:18:40 HOSTNAME ZFS[56457]: checksum mismatch, zpool=$pool path=$/dev/gpt/mastadon0 offset=$1654562660352 size=$8192
Nov  5 05:18:40 HOSTNAME ZFS[56458]: checksum mismatch, zpool=$pool path=$/dev/gpt/rabbit0 offset=$1654562660352 size=$8192
Nov  5 05:18:40 HOSTNAME ZFS[56459]: checksum mismatch, zpool=$pool path=$/dev/gpt/badger0 offset=$1654562660352 size=$8192
Nov  5 05:18:40 HOSTNAME ZFS[56460]: checksum mismatch, zpool=$pool path=$/dev/gpt/poodle0 offset=$2436279345152 size=$4096
Nov  5 05:18:40 HOSTNAME ZFS[56461]: checksum mismatch, zpool=$pool path=$/dev/gpt/pony0 offset=$2436279345152 size=$4096
Nov  5 05:18:40 HOSTNAME ZFS[56462]: checksum mismatch, zpool=$pool path=$/dev/gpt/panda0 offset=$2436279345152 size=$8192
Nov  5 05:18:40 HOSTNAME ZFS[56463]: checksum mismatch, zpool=$pool path=$/dev/gpt/mastadon0 offset=$2436279345152 size=$8192
Nov  5 05:18:40 HOSTNAME ZFS[56464]: checksum mismatch, zpool=$pool path=$/dev/gpt/rabbit0 offset=$2436279345152 size=$8192
Nov  5 05:18:40 HOSTNAME ZFS[56465]: checksum mismatch, zpool=$pool path=$/dev/gpt/badger0 offset=$2436279345152 size=$8192
Nov  5 05:18:40 HOSTNAME ZFS[56466]: checksum mismatch, zpool=$pool path=$/dev/gpt/poodle0 offset=$1654562660352 size=$4096
Nov  5 05:18:40 HOSTNAME ZFS[56467]: checksum mismatch, zpool=$pool path=$/dev/gpt/pony0 offset=$1654562660352 size=$4096
Nov  5 05:18:40 HOSTNAME ZFS[56468]: checksum mismatch, zpool=$pool path=$/dev/gpt/panda0 offset=$1654562660352 size=$8192
Nov  5 05:18:40 HOSTNAME ZFS[56469]: checksum mismatch, zpool=$pool path=$/dev/gpt/mastadon0 offset=$1654562660352 size=$8192
Nov  5 05:18:40 HOSTNAME ZFS[56470]: checksum mismatch, zpool=$pool path=$/dev/gpt/rabbit0 offset=$1654562660352 size=$8192
Nov  5 05:18:40 HOSTNAME ZFS[56471]: checksum mismatch, zpool=$pool path=$/dev/gpt/badger0 offset=$1654562660352 size=$8192
 
RAID-Z2 should handle 2 faulty drives (or any two faulty sectors). I suspect that you really have 3 faults. Perhaps some of the faults are moving around? At any given moment, you only see 1 or 2, but there is other data that was unreadable before and has been declared faulty?
 
Thanks again.

Yeah my other drives really do “seem” to be okay. But just because they are passing smart tests doesn’t guarantee that.

I’m beginning to think that I actually did have a controller issue, which was causing zpool faults but not disk faults. BUT then I also had one disk that hard-failed recently and another that was soft-failing for a while (i.e. it mostly works but seems to be failing consistently near the end of the drive). I think this is why my scrubs recently seemed to hang at 93% ish.



The fact remains that I currently have a degraded but working pool that can only be partially mounted. It really does seem that the zpool scrubbing or resilver process introduced the corruption rather than healing it. Hopefully with all 9 drives working it will have the ability to recover itself.

I’ve read some other posts with ZFS metadata corruption and it does seem possible but rare.



I’m going to clone all my drives and work from copies before I even try and perform a recovery. But that will take several days.

I’ll report back if I find anything that might be of interest to the community.
 
Back
Top