Hi. I am experiencing pretty major corruption on a large-ish ZFS pool. And am wondering if anybody has seen similar or has some tips to aid recovery.
I have rolling snapshots setup using zfsnap. But I only have backups of my most critical data from this pool.
I'm running FreeBSD 12.2-RELEASE (only just upgraded while migrating the OS and disks to new hardware to test whether it was a SATA controller causing the disk read errors).
I have a 9-disk RAIDz2 array and one of my drives has been failing (it powers on and can be added to the pool but during a resilver there were so many read errors that it gets kicked out of the pool. And today the resilver hung so long that I unplugged the faulted drive.
Afterwards I performed a scrub on the remaining 8 drives and that produced well over 100,000 data errors. In the verbose output I saw that these were mostly in the snapshots. I destroyed some of these snapshots to help clear the errors (which was likely a mistake) and I re-scrubbed.
I’ve never seen more than a few corrupt files after a scrub, but seeing as though most of these were snapshots I didn’t think it was too big a deal. I have seen errors in a degraded pool disappear once all drives were present, so was assuming that's what would happen here.
I also had a few corrupt whole directories and I tried replacing these with the contents of their snapshots but during the “cp” operation cp gave an “integrity check failure” for each file and so wasn’t able to restore any files out of the snapshots.
I had to reboot my machine and afterwards another resilver was triggered and this seems to have turned a bad situation disastrous because I now seem to have corruption in all of my main/root directories of each zfs dataset (meaning that I can’t list the contents of ANY files in my pool - they just appear to be empty directories).
I also now can't seem to list the ".zfs" on "snapshot" directory in many of the zfs datasets, so it doesn't seem as if I can even restore from these.
I’m hoping that if I can somehow clone my failed disk to a replacement and online that cloned disk then the pool can rebuild itself, but I’d like to hear some expert opinions before I do anything worse and irreversible.
Here's the current state of the pool...
zfs list shows the data still on disk (with 17.8T used)
Here's an example of a corrput directory root (no files or directories are visible)
Any help with this would be greatly appreciated. Sorry for the essay. I don't really understand how the corruption can get so bad with only one lost disk.
I have rolling snapshots setup using zfsnap. But I only have backups of my most critical data from this pool.
I'm running FreeBSD 12.2-RELEASE (only just upgraded while migrating the OS and disks to new hardware to test whether it was a SATA controller causing the disk read errors).
I have a 9-disk RAIDz2 array and one of my drives has been failing (it powers on and can be added to the pool but during a resilver there were so many read errors that it gets kicked out of the pool. And today the resilver hung so long that I unplugged the faulted drive.
Afterwards I performed a scrub on the remaining 8 drives and that produced well over 100,000 data errors. In the verbose output I saw that these were mostly in the snapshots. I destroyed some of these snapshots to help clear the errors (which was likely a mistake) and I re-scrubbed.
I’ve never seen more than a few corrupt files after a scrub, but seeing as though most of these were snapshots I didn’t think it was too big a deal. I have seen errors in a degraded pool disappear once all drives were present, so was assuming that's what would happen here.
I also had a few corrupt whole directories and I tried replacing these with the contents of their snapshots but during the “cp” operation cp gave an “integrity check failure” for each file and so wasn’t able to restore any files out of the snapshots.
I had to reboot my machine and afterwards another resilver was triggered and this seems to have turned a bad situation disastrous because I now seem to have corruption in all of my main/root directories of each zfs dataset (meaning that I can’t list the contents of ANY files in my pool - they just appear to be empty directories).
I also now can't seem to list the ".zfs" on "snapshot" directory in many of the zfs datasets, so it doesn't seem as if I can even restore from these.
I’m hoping that if I can somehow clone my failed disk to a replacement and online that cloned disk then the pool can rebuild itself, but I’d like to hear some expert opinions before I do anything worse and irreversible.
Here's the current state of the pool...
Code:
NAME STATE READ WRITE CKSUM
pool DEGRADED 0 0 33.6K
raidz2-0 DEGRADED 0 0 88.3K
gpt/badger0 ONLINE 0 0 0
gpt/rabbit0 ONLINE 0 0 0
gpt/mastadon0 ONLINE 0 0 0
gpt/panda0 ONLINE 0 0 0
gpt/pony0 ONLINE 0 0 0
gpt/poodle0 ONLINE 0 0 0
2064227176711486949 REMOVED 0 0 0 was /dev/gpt/swine0
gpt/zebra0 ONLINE 0 0 0
gpt/opossum0 ONLINE 0 0 0
errors: 23889 data errors, use '-v' for a list
zfs list shows the data still on disk (with 17.8T used)
Code:
pool 289G 17.8T 0 219K 0 17.8T
pool/assets 289G 57.8G 0 201K 0 57.8G
pool/assets/stacks 289G 57.8G 0 57.8G 0 0
pool/domain 289G 2.23T 0 201K 0 2.23T
pool/domain/storage 289G 2.23T 4.83M 2.02T 0 222G
pool/flatdata 289G 15.2T 0 201K 0 15.2T
pool/project 289G 14.2G 0 201K 0 14.2G
pool/project/docs 289G 14.2G 0 14.2G 0 0
pool/tablespace 289G 282G 0 265K 0 282G
Here's an example of a corrput directory root (no files or directories are visible)
Code:
errors: Permanent errors have been detected in the following files:
pool/domain/storage:<0x0>
pool/domain/storage:<0xbe2f>
pool/domain/storage/imagery:<0x0>
pool/domain/storage/imagery:<0x1b208>
pool/domain/storage/imagery:<0x18890>
pool/domain/storage/imagery:<0x68db8>
<metadata>:<0x12be>
<metadata>:<0x1bf>
<metadata>:<0x18cb>
etc.
Any help with this would be greatly appreciated. Sorry for the essay. I don't really understand how the corruption can get so bad with only one lost disk.