ZFS Permanent Errors on Disk Replacement Resilvering

dave · Sep 5, 2020

Code:

pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 1.67T in 2 days 02:43:35 with 1 errors on Fri Sep  4 17:13:53 2020
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     1
      raidz1-0                  DEGRADED     0     0     2
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             UNAVAIL      0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        tank/video@autosnap-weekly.2020-08-09.00:00:00:/tank/Blah/Blah/Blah/FooBar.mp4

What is the correct way forward here? I have read the illumos link, but it leaves me scratching my head. Questions:

Can I try deleting without breaking things?
Should I try to rm the file, or destroy the snapshot?
Or am I looking at restoring from backup either way?

Lamia · Sep 5, 2020

dave said:

Code:

pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 1.67T in 2 days 02:43:35 with 1 errors on Fri Sep  4 17:13:53 2020
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     1
      raidz1-0                  DEGRADED     0     0     2
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             UNAVAIL      0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        tank/video@autosnap-weekly.2020-08-09.00:00:00:/Blah/Blah/Blah/FooBar.mp4

What is the correct way forward here? I have read the illumos link, but it leaves me scratching my head. Questions:

Can I try deleting without breaking things?
Should I try to rm the file, or destroy the snapshot?
Or am I looking at restoring from backup either way?

Snapshots, MySQL data files and at times images have been the culprits. Zfs destroy snapshots and restart the server should bring it back online.

dave · Sep 5, 2020

I destroyed the snapshot and rebooted. Now I have this:

Code:

  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Sep  4 18:44:08 2020
    42.1G scanned at 1.00G/s, 19.4M issued at 472K/s, 10.0T total
    0 resilvered, 0.00% done, no estimated completion time
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     0
      raidz1-0                  DEGRADED     0     0     0
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        <0x159>:<0x11bb9>

So I guess I will see how the resilver goes... ?

Thanks for the advice!

Lamia · Sep 5, 2020

dave said:

I destroyed the snapshot and rebooted. Now I have this:

Code:

  pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Sep  4 18:44:08 2020
    42.1G scanned at 1.00G/s, 19.4M issued at 472K/s, 10.0T total
    0 resilvered, 0.00% done, no estimated completion time
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     0
      raidz1-0                  DEGRADED     0     0     0
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        <0x159>:<0x11bb9>

So I guess I will see how the resilver goes... ?

Thanks for the advice!

The partitions will come back online though the errors may remain. If you moved the server, you may need unplug and plug back the drive.

dave · Sep 5, 2020

I assume I should wait for the resilver to finish...?

FYI, I still have the drive that is being replaced. I don't know if that could help or not.

Lamia · Sep 5, 2020

dave said:
I assume I should wait for the resilver to finish...?

FYI, I still have the drive that is being replaced. I don't know if that could help or not.

Wait and see.

dave · Sep 7, 2020

sudo zpool status -v tank

Code:

  pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 1.66T in 2 days 07:35:28 with 1 errors on Mon Sep  7 02:19:36 2020
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     1
      raidz1-0                  DEGRADED     0     0     2
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        tank/video@autosnap-weekly.2020-08-16.00:00:00:/tank/Blah/Blah/Blah/FooBar.mp4

I.e. same file, different snapshot. So...
sudo zfs destroy tank/video@autosnap-weekly.2020-08-16.00:00:00
sudo zpool status -v tank

Code:

  pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 1.66T in 2 days 07:35:28 with 1 errors on Mon Sep  7 02:19:36 2020
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     1
      raidz1-0                  DEGRADED     0     0     2
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        <0x73b>:<0x11bb9>

Finally,
sudo reboot

And...
sudo zpool status -v tank

Code:

  pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Sep  7 06:21:34 2020
    27.5G scanned at 783M/s, 1000K issued at 27.8K/s, 10.0T total
    0 resilvered, 0.00% done, no estimated completion time
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     0
      raidz1-0                  DEGRADED     0     0     0
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        <0x73b>:<0x11bb9>

So....

Lamia said:
Wait and see.

Should I go ahead and remove remaining snapshots on that dataset, or just go one by one?

Lamia · Sep 7, 2020

dave said:

sudo zpool status -v tank

Code:

  pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 1.66T in 2 days 07:35:28 with 1 errors on Mon Sep  7 02:19:36 2020
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     1
      raidz1-0                  DEGRADED     0     0     2
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        tank/video@autosnap-weekly.2020-08-16.00:00:00:/Blah/Blah/Blah/FooBar.mp4

I.e. same file, different snapshot. So...
sudo zfs destroy tank/video@autosnap-weekly.2020-08-16.00:00:00
sudo zpool status -v tank

Code:

  pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 1.66T in 2 days 07:35:28 with 1 errors on Mon Sep  7 02:19:36 2020
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     1
      raidz1-0                  DEGRADED     0     0     2
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        <0x73b>:<0x11bb9>

Finally,
sudo reboot

And...
sudo zpool status -v tank

Code:

  pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Sep  7 06:21:34 2020
    27.5G scanned at 783M/s, 1000K issued at 27.8K/s, 10.0T total
    0 resilvered, 0.00% done, no estimated completion time
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     0
      raidz1-0                  DEGRADED     0     0     0
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        <0x73b>:<0x11bb9>

So....

Should I go ahead and remove remaining snapshots on that dataset, or just go one by one?

Code:

zfs list -rt snapshot -s creation -o name tank | xargs -n 1 | zfs destroy -r

Try it without the zfs destroy command first.

Matlib · Sep 7, 2020

How come a file got corrupted on RAID5 when only 1 disk failed? Are you sure the pool was alright before the disk failed? Why resilvering started after the reboot?

After resilver completes, deleting the broken files and zpool scrub should fix the errors.

However I played a little with my test VM and I managed to actually create an error like this:

Code:

root@freebsd12:/tank# zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0 days 00:00:04 with 1 errors on Mon Sep  7 20:16:18 2020
config:

    NAME        STATE     READ WRITE CKSUM
    tank        ONLINE       0     0     1
      raidz1-0  ONLINE       0     0     2
        vtbd1   ONLINE       0     0     0
        vtbd2   ONLINE       0     0     0
        vtbd4   ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x48>

Removing the file linked to that metadata fixed the problem, but there is somehow less space available (the same files don't fit again).

Lamia · Sep 8, 2020

Matlib said:
Removing the file linked to that metadata fixed the problem,

If you don't mind me asking how did you locate the file linked to the metadata. No filename is shown in the above 'zpool status -v' command.

dave · Sep 8, 2020

The filename was shown originally in both cases. In both cases, the affected file was in a snapshot. Same file path, two different snapshots. I have destroyed the snapshots. Once the snapshot is destroyed, then the filename is no longer shown. Also, destroying the snapshot and rebooting seems to trigger a resilver after the reboot. The pool was scrubbing without error once a month before I started the drive replacement. That was the last of 6 replacements from 2TB to 4TB in order to expand the pool. I stared those replacements a couple years ago. It's an old pool, and as you can see it is not on a 4K alignment, so I will have to migrate the data and start again anyway. Meanwhile, waiting for the current resilvering which will take another day or so...

Matlib · Sep 8, 2020

Lamia said:
If you don't mind me asking how did you locate the file linked to the metadata.

I was removing until the error disappeared

Doesn't zdb retrieve such information though?

Lamia · Sep 9, 2020

Thanks; scrubbing did not help remove it here. And zpool status now shows removed despite that the hard disk in still machine and untouched for long.

dave · Sep 9, 2020

Code:

  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 1.64T in 2 days 05:56:39 with 1 errors on Wed Sep  9 12:18:13 2020
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     1
      raidz1-0                  DEGRADED     0     0     2
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        /tank/Blah/Blah/Blah/FooBar.mp4

Notice that the problem is now in the file itself and not a snapshot.

Will attempting to delete this file cause the zpool or dataset to become unavailable?

(Sure would've been nice to get a list of all these issues instead of having to do a 2.5 day resilver for each one.)

VladiBG · Sep 9, 2020

zpool detach tank 13239389112982662359

If you need this file you have to restore the entire pool from the backup, otherwise you can delete the file and scrub the pool but before that the pool must be healthy and without cksum errors.

Lamia · Sep 12, 2020

Scrubbing indeed cleared it long ago.

dave · Sep 12, 2020

sudo zpool detach tank 13239389112982662359

Code:

sudo zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 1.64T in 2 days 05:56:39 with 1 errors on Wed Sep  9 12:18:13 2020
config:

    NAME              STATE     READ WRITE CKSUM
    tank              ONLINE       0     0     1
      raidz1-0        ONLINE       0     0     2
        label/zdisk1  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk5  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6  ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        /tank/video/Blah/Blah/Blah.FooBar.mp4

sudo rm /tank/video/Blah/Blah/Blah/FooBar.mp4

sudo zpool scrub tank

Code:

  pool: tank
 state: ONLINE
status: One or more devices are configured to use a non-native block size.
    Expect reduced performance.
action: Replace affected devices with devices that support the
    configured block size, or migrate data to a properly configured
    pool.
  scan: scrub repaired 0 in 0 days 10:11:03 with 0 errors on Fri Sep 11 21:50:04 2020
config:

    NAME              STATE     READ WRITE CKSUM
    tank              ONLINE       0     0     1
      raidz1-0        ONLINE       0     0     2
        label/zdisk1  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk5  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6  ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: No known data errors

So that worked! Thanks, everyone for your input. Now I'm going to destroy it and restore the data anyways to fix the block size.

VladiBG · Sep 12, 2020

No it's NOT, you have cksum errors. Can you check your log using zpool history -i tank and see what's is logged there. Look for the read/write error on some hard-disk(maybe old one already removed) and if you can run some memory test to verify if there's no any bad RAM modules (i hope you are using ECC Ram)

Do NOT clear the chksum errors until you figured out from where they come first zpool clear tank raidz1-0

dave · Sep 12, 2020

Ah, yes, I see, you're right. Thanks for the heads up. Anyway, unfortunately, I didn't have any more time to mess with it. I destroyed the pool and am restoring the data from backup now.