ZFS too many errors ... but

zader · Sep 5, 2021

Not sure where to go with this..

I have 12 disk pool in zraid 3 that has been running for a few years with out a single issue ..
then today my monitoring picks up a zpool error

Code:

$ zpool status
  pool: abyss
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 07:30:32 with 0 errors on Sun Jul 18 06:51:13 2021
config:

        NAME            STATE     READ WRITE CKSUM
        abyss           DEGRADED     0     0     0
          raidz3-0      DEGRADED     0     0     0
            da3p3.eli   ONLINE       0     0     0
            da7p3.eli   FAULTED     40    32     0  too many errors

ok no problem, so I run smartctl tests on the drive and check the logs .. and everything works fine .. I figured np,. it had some sort of hickup .. reboot..

pool comes back up, completes a resilver and presto fixed right?

nope.. now the exact same thing happens on another drive .. then another drive .. but they all pass smartctl tests ..

ok fine, unplug remove a tiny bit of dust, reseat all the connections both to the raid cards and drives ..
fire it up..

pool comes online no problem..

fine, let the pool resilver, fixes its self ..
time for a scrub right?

about 2 mins into the scrub.. I gett errors on another drive.,.
one of the other drives is immedialy removed from the pool and the zpool goes intro degraded state..

Code:

Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): Retrying command (per sense data)
Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): READ(10). CDB: 28 00 00 80 0a 10 00 00 10 00
Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): CAM status: SCSI Status Error
Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): SCSI status: Check Condition
Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): Error 5, Retries exhausted
Sep  5 09:46:16 abyss kernel: [490] GEOM_ELI: g_eli_read_done() failed (error=5) da7p3.eli[READ(offset=270336, length=8192)]
Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): READ(16). CDB: 88 00 00 00 00 03 a3 81 22 10 00 00 00 10 00 00
Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): CAM status: SCSI Status Error
Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): SCSI status: Check Condition
Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): Error 5, Retries exhausted
Sep  5 09:46:16 abyss kernel: [490] GEOM_ELI: g_eli_read_done() failed (error=5) da7p3.eli[READ(offset=7997266075648, length=8192)]
Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): READ(16). CDB: 88 00 00 00 00 03 a3 81 24 10 00 00 00 10 00 00
Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): CAM status: SCSI Status Error
Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): SCSI status: Check Condition
Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Sep  5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): Error 5, Retries exhausted
Sep  5 09:46:16 abyss kernel: [490] GEOM_ELI: g_eli_read_done() failed (error=5) da7p3.eli[READ(offset=7997266337792, length=8192)]
Sep  5 09:46:16 abyss ZFS[28685]: vdev probe failure, zpool=abyss path=/dev/da7p3.eli
Sep  5 09:46:16 abyss ZFS[29643]: vdev state changed, pool_guid=12071249439363906691 vdev_guid=1501231572552313979

smartctl -t short /dev/da1 - da11 all reports 0 problems and everything passes ...

currently running a long test ...

Code:

root@abyss:/var/log # zpool status
  pool: abyss
 state: ONLINE
  scan: resilvered 129M in 00:00:19 with 0 errors on Sun Sep  5 10:39:28 2021
config:

        NAME            STATE     READ WRITE CKSUM
        abyss           ONLINE       0     0     0
          raidz3-0      ONLINE       0     0     0
            da3p3.eli   ONLINE       0     0     0
            da7p3.eli   ONLINE       0     0     0
            da11p3.eli  ONLINE       0     0     0
            da8p3.eli   ONLINE       0     0     0
            da6p3.eli   ONLINE       0     0     0
            da4p3.eli   ONLINE       0     0     0
            da5p3.eli   ONLINE       0     0     0
            da0p3.eli   ONLINE       0     0     0
            da2p3.eli   ONLINE       0     0     0
            da1p3.eli   ONLINE       0     0     0
            da9p3.eli   ONLINE       0     0     0
            da10p3.eli  ONLINE       0     0     0
        logs
          mirror-1      ONLINE       0     0     0
            ada1        ONLINE       0     0     0
            ada2        ONLINE       0     0     0

not sure how long it will stay up.. but just the fact that its all over the map really makes it hard to pinpoint the problem ..

notes:
I uses /dev/da7 as an example, I got the exact same errors with /dev/da4 and /dev/da6 .. Im guessing 3+ drives didnt all magically fail at once so there is something wierd going on.. maybe hba card? .. idk .. I pulled and reseated everything .. so it's currently running.. but im guessing if I do a scrub on the pool it will error out in the say way and eject another drive into a degraded status..

Thanks!

cmoerz · Sep 5, 2021

I had similar issues once and it turned out to be faulty SATA cables. Then again, your system has been operating fine for a while, so I kind of doubt it's that easy.

Do I read this correctly - you are running with multiple controller cards? Were the failures all on the same one or spread across multiple ones?

You'll probably have to start eliminating things one by one; the simplest thing to try is running a memory check first; a faulty memory module can easily wreak havoc.

ralphbsz · Sep 6, 2021

First: Don't trust smartctl to tell you what a disk is healthy or sick. It does not reliably report problems with disk drives. Note that I didn't say that smartctl lies, or that it is useless: its error output is correlated with real errors. But not all disk errors can be found by smartctl, and not all errors are reported by it.

So you're saying that you are getting errors, but not always on the same drive, right? And the errors go away and fix themselves on one drive, just to reappear elsewhere? And you are getting detailed error reports (the SCSI ASC/ASCQ is decoded in the logs above), but they don't have a sensible cause? Actually, the error reports give you a lot of information: the drive is not ready (can't do IO right now), but can't tell you why? I would start by looking at the infrastructure that's shared by all 12 disks. I'm assuming all the disks are in a single box, and share a power supply? I would start there: Your power supply might be old, flaky, overheated, fan filled with dust, or something like that. Switch to a different power supply. If that doesn't help, look at any other failed infrastructure (HBA cards, SATA cabling, ...).

zader · Sep 6, 2021

yes the pool is spred across 2 hba flashed lsi 9311's.. I'm looking into the power supply and those cards as well... sofar all of the drives that have reported erros are on the same card .. I'm guessing thats most/all of the issue ..

the RCA of this event boils down to a plastic sata connector was actually broken.. I believe this caused either a loose connection or perhaps some small piece of dust got in the way .. little crazy glue on the connector and everything is fine untill the replacment arrives.

Thanks for the notes...

ZFS too many errors ... but

zader

cmoerz

ralphbsz

zader