ZFS strange zpool IO errors

VVelox · Dec 16, 2021

Any suggestions on this?

The clear won't work and I am just getting vague reports about IO errors, but dmesg is all clear.

That said recreating this is sadly not an option if possible given how long it would take to spool back up.

Code:

[root@nibbles1]0|/root>zpool status -v storage
  pool: storage
 state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Recovery is possible, but will result in some data loss.
        Returning the pool to its state as of Thu Dec 16 02:30:52 2021
        should correct the problem.  Approximately 5 seconds of data
        must be discarded, irreversibly.  Recovery can be attempted
        by executing 'zpool clear -F storage'.  A scrub of the pool
        is strongly recommended after recovery.
   see: [URL]http://illumos.org/msg/ZFS-8000-72[/URL]
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        storage     FAULTED      0     0     1
          raidz2-0  ONLINE       0     0     6
            da2     ONLINE       0     0     0  block size: 512B configured, 4096B native
            da3     ONLINE       0     0     1  block size: 512B configured, 4096B native
            da4     ONLINE       0     0     0  block size: 512B configured, 4096B native
            da5     ONLINE       0     0     0  block size: 512B configured, 4096B native
            da6     ONLINE       0     0     0  block size: 512B configured, 4096B native
            da7     ONLINE       0     0     1  block size: 512B configured, 4096B native
[root@nibbles1]0|/root>zpool clear -F storage
cannot clear errors for storage: I/O error

VVelox · Dec 16, 2021

Fixed via...

Code:

zpool export storage
zpool import -F storage

I wish I knew why this occurred though. That said replacing spinning rust with SSDs.

cmoerz · Dec 16, 2021

Maybe you want to run a zpool scrub; nothing kills the mood more than a pool filled to the brim with files, that has underlying hardware issues.

VVelox · Dec 16, 2021

cmoerz said:
Maybe you want to run a zpool scrub; nothing kills the mood more than a pool filled to the brim with files, that has underlying hardware issues.

Yeah. Was wondering about HW issues. But every test has sadly come back all good.

That said feeling paranoid and replacing the spinning rust with SSDs. Been on my todo list for awhile.

gpw928 · Dec 16, 2021

Something went wrong, and that's a worry.

Look in /var/log/messages for clues.
Run smartctl -a /dev/da? for each disk.
Scrub the pool after you have checked the smartctl(8) results.

A healthy disk looks approximately like this (197 and 198 really matter):

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   181   170   021    Pre-fail  Always       -       5908
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       346
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   034   034   000    Old_age   Always       -       48470
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       346
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       290
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       55
194 Temperature_Celsius     0x0022   112   097   000    Old_age   Always       -       38
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

mer · Dec 16, 2021

I agree with gpw928 and cmoerz. zpool export on a pool effectively does a 'scrub' before it completes (ok maybe a mini-scrub). Basically it makes sure everything is consistent.
zpool scrub I try to run every 3/4 months or so on consumer grade hardware.
I'd guess that running smartctl on the disks that make up the pool would show errors or failures.
I've always put the following in my /etc/periodic.conf (obviously change for your devices, multiple devices on the same line separated by a space)

Code:

daily_status_smart_enable="YES"
daily_status_smart_devices="/dev/ada0"

It gives you in email or log file:

Code:

SMART status:
Checking health of /dev/ada0: OK

VVelox · Dec 17, 2021

Yeah. Not finding anything of note prior to it going down. Logs are normal and quit till it suddenly reboots.

No ECC errors or anything else is logged of in the IPMI for when this happened.

And SMART all looks good. That said replacing the drives as replacing the spinning rust with SSDs was on my todo list any ways.

That said found the following below once I started poking at it this morning.

Code:

Dec 16 08:38:13 nibbles1 ZFS[10497]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$2604357899111283597
Dec 16 08:38:13 nibbles1 ZFS[10498]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$12216455470661598049
Dec 16 08:38:13 nibbles1 ZFS[10499]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$13482611975092723659
Dec 16 08:38:13 nibbles1 ZFS[10500]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$3798580156941754525
Dec 16 08:38:13 nibbles1 ZFS[10501]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$9999167707844717775
Dec 16 08:38:13 nibbles1 ZFS[10502]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$15763265354756365723
Dec 16 08:38:13 nibbles1 ZFS[10503]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$2604357899111283597
Dec 16 08:38:14 nibbles1 ZFS[10504]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$12216455470661598049
Dec 16 08:38:14 nibbles1 ZFS[10505]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$13482611975092723659
Dec 16 08:38:14 nibbles1 ZFS[10506]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$3798580156941754525
Dec 16 08:38:14 nibbles1 ZFS[10507]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$9999167707844717775
Dec 16 08:39:35 nibbles1 ZFS[10522]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$15763265354756365723
Dec 16 08:39:35 nibbles1 ZFS[10523]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$2604357899111283597
Dec 16 08:39:35 nibbles1 ZFS[10524]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$12216455470661598049
Dec 16 08:39:35 nibbles1 ZFS[10525]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$13482611975092723659
Dec 16 08:39:35 nibbles1 ZFS[10526]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$3798580156941754525
Dec 16 08:39:35 nibbles1 ZFS[10527]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$9999167707844717775
Dec 16 08:39:35 nibbles1 ZFS[10529]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$15763265354756365723
Dec 16 08:39:35 nibbles1 ZFS[10530]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$2604357899111283597
Dec 16 08:39:35 nibbles1 ZFS[10531]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$12216455470661598049
Dec 16 08:39:35 nibbles1 ZFS[10532]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$13482611975092723659
Dec 16 08:39:35 nibbles1 ZFS[10533]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$3798580156941754525
Dec 16 08:39:35 nibbles1 ZFS[10534]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$9999167707844717775
Dec 16 08:39:36 nibbles1 ZFS[10535]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$15763265354756365723
Dec 16 08:39:36 nibbles1 ZFS[10536]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$2604357899111283597
Dec 16 08:39:36 nibbles1 ZFS[10537]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$12216455470661598049
Dec 16 08:39:36 nibbles1 ZFS[10538]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$13482611975092723659
Dec 16 08:39:36 nibbles1 ZFS[10539]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$3798580156941754525
Dec 16 08:39:36 nibbles1 ZFS[10540]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$9999167707844717775
Dec 16 08:39:40 nibbles1 ZFS[10542]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$15763265354756365723
Dec 16 08:39:40 nibbles1 ZFS[10543]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$2604357899111283597
Dec 16 08:39:40 nibbles1 ZFS[10544]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$12216455470661598049
Dec 16 08:39:40 nibbles1 ZFS[10545]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$13482611975092723659
Dec 16 08:39:40 nibbles1 ZFS[10546]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$3798580156941754525
Dec 16 08:39:41 nibbles1 ZFS[10547]: vdev state changed, pool_guid=$2689154023643674564 vdev_guid=$9999167707844717775

Follows by this once I got it up post re-importing it...

Code:

Dec 16 14:21:39 nibbles1 ZFS[23033]: checksum mismatch, zpool=$storage path=$/dev/da3 offset=$42098257408 size=$1536
Dec 16 14:21:39 nibbles1 ZFS[23036]: checksum mismatch, zpool=$storage path=$/dev/da3 offset=$42098173952 size=$1536
Dec 16 14:21:39 nibbles1 ZFS[23040]: checksum mismatch, zpool=$storage path=$/dev/da3 offset=$42098343424 size=$1536
Dec 16 14:21:39 nibbles1 ZFS[23045]: checksum mismatch, zpool=$storage path=$/dev/da3 offset=$42098280448 size=$1536
Dec 16 14:21:39 nibbles1 ZFS[23049]: checksum mismatch, zpool=$storage path=$/dev/da3 offset=$42098178560 size=$2560
Dec 16 14:21:39 nibbles1 ZFS[23051]: checksum mismatch, zpool=$storage path=$/dev/da3 offset=$42098187776 size=$2048
Dec 16 14:21:39 nibbles1 ZFS[23053]: checksum mismatch, zpool=$storage path=$/dev/da3 offset=$42098591744 size=$1536
Dec 16 14:21:39 nibbles1 ZFS[23054]: checksum mismatch, zpool=$storage path=$/dev/da3 offset=$42098281984 size=$1536
Dec 16 14:21:39 nibbles1 ZFS[23056]: checksum mismatch, zpool=$storage path=$/dev/da3 offset=$42098184704 size=$1536
Dec 16 14:21:39 nibbles1 ZFS[23057]: checksum mismatch, zpool=$storage path=$/dev/da3 offset=$42098181120 size=$2048
Dec 16 14:21:39 nibbles1 ZFS[23063]: checksum mismatch, zpool=$storage path=$/dev/da3 offset=$42098177024 size=$1536
Dec 16 14:21:39 nibbles1 ZFS[23067]: checksum mismatch, zpool=$storage path=$/dev/da3 offset=$42098269696 size=$1536
Dec 16 14:21:39 nibbles1 ZFS[23070]: checksum mismatch, zpool=$storage path=$/dev/da3 offset=$42098287616 size=$1536
Dec 16 14:21:39 nibbles1 ZFS[23074]: checksum mismatch, zpool=$storage path=$/dev/da3 offset=$42098285568 size=$1024
Dec 16 14:21:40 nibbles1 ZFS[23402]: checksum mismatch, zpool=$storage path=$/dev/da6 offset=$42128605184 size=$1536
Dec 16 14:21:40 nibbles1 ZFS[23404]: checksum mismatch, zpool=$storage path=$/dev/da6 offset=$42128592384 size=$1536
Dec 16 14:21:40 nibbles1 ZFS[23405]: checksum mismatch, zpool=$storage path=$/dev/da6 offset=$42128640000 size=$2560
Dec 16 14:21:40 nibbles1 ZFS[23407]: checksum mismatch, zpool=$storage path=$/dev/da6 offset=$42128627200 size=$3072
Dec 16 14:21:40 nibbles1 ZFS[23408]: checksum mismatch, zpool=$storage path=$/dev/da6 offset=$42128602112 size=$3072
Dec 16 14:21:40 nibbles1 ZFS[23410]: checksum mismatch, zpool=$storage path=$/dev/da6 offset=$42128637952 size=$2048
Dec 16 14:21:40 nibbles1 ZFS[23413]: checksum mismatch, zpool=$storage path=$/dev/da6 offset=$42128598528 size=$3584
Dec 16 14:21:40 nibbles1 ZFS[23415]: checksum mismatch, zpool=$storage path=$/dev/da6 offset=$42128590848 size=$1536
Dec 16 14:21:40 nibbles1 ZFS[23417]: checksum mismatch, zpool=$storage path=$/dev/da6 offset=$42128596992 size=$1536
Dec 16 14:21:40 nibbles1 ZFS[23420]: checksum mismatch, zpool=$storage path=$/dev/da6 offset=$42128642560 size=$3584
Dec 16 14:21:40 nibbles1 ZFS[23423]: checksum mismatch, zpool=$storage path=$/dev/da6 offset=$42128630272 size=$7680
Dec 16 14:21:40 nibbles1 ZFS[23425]: checksum mismatch, zpool=$storage path=$/dev/da6 offset=$42128646144 size=$11264
Dec 16 14:21:40 nibbles1 ZFS[23426]: checksum mismatch, zpool=$storage path=$/dev/da6 offset=$42128657408 size=$3584
Dec 16 14:21:40 nibbles1 ZFS[23428]: checksum mismatch, zpool=$storage path=$/dev/da6 offset=$42128667136 size=$1536

VVelox · Dec 18, 2021

So as a follow up, I found two of the drives had strangely terrible performance, despite all 6 as having been the same make and model.

That said nothing of interest in SMART shows them as being an issue.

Crivens · Dec 18, 2021

That dies not mean much. Do a long self test on SMART and look again. I had drives which said AllOK on SMART but who got the error at 90% test completed. ZFS had told me beforehand that something fishy was up.

Cath O'Deray · Dec 19, 2021

gpw928 said:
… (197 and 198 really matter): …

What about Reported_Uncorrect (187), any general advice?

<https://bsd-hardware.info/?probe=045ffeb9b3&log=smartctl> for example, there's a colourful alert for a raw value of 47244771328.

Crivens said:
… Do a long self test on SMART …

Certainly useful, however it can happily repeatedly find no fault with a drive that's faulty.

ZFS strange zpool IO errors

Administrator