Not sure where to go with this..
I have 12 disk pool in zraid 3 that has been running for a few years with out a single issue ..
then today my monitoring picks up a zpool error
ok no problem, so I run smartctl tests on the drive and check the logs .. and everything works fine .. I figured np,. it had some sort of hickup .. reboot..
pool comes back up, completes a resilver and presto fixed right?
nope.. now the exact same thing happens on another drive .. then another drive .. but they all pass smartctl tests ..
ok fine, unplug remove a tiny bit of dust, reseat all the connections both to the raid cards and drives ..
fire it up..
pool comes online no problem..
fine, let the pool resilver, fixes its self ..
time for a scrub right?
about 2 mins into the scrub.. I gett errors on another drive.,.
one of the other drives is immedialy removed from the pool and the zpool goes intro degraded state..
smartctl -t short /dev/da1 - da11 all reports 0 problems and everything passes ...
currently running a long test ...
not sure how long it will stay up.. but just the fact that its all over the map really makes it hard to pinpoint the problem ..
notes:
I uses /dev/da7 as an example, I got the exact same errors with /dev/da4 and /dev/da6 .. Im guessing 3+ drives didnt all magically fail at once so there is something wierd going on.. maybe hba card? .. idk .. I pulled and reseated everything .. so it's currently running.. but im guessing if I do a scrub on the pool it will error out in the say way and eject another drive into a degraded status..
Thanks!
I have 12 disk pool in zraid 3 that has been running for a few years with out a single issue ..
then today my monitoring picks up a zpool error
Code:
$ zpool status
pool: abyss
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 0B in 07:30:32 with 0 errors on Sun Jul 18 06:51:13 2021
config:
NAME STATE READ WRITE CKSUM
abyss DEGRADED 0 0 0
raidz3-0 DEGRADED 0 0 0
da3p3.eli ONLINE 0 0 0
da7p3.eli FAULTED 40 32 0 too many errors
ok no problem, so I run smartctl tests on the drive and check the logs .. and everything works fine .. I figured np,. it had some sort of hickup .. reboot..
pool comes back up, completes a resilver and presto fixed right?
nope.. now the exact same thing happens on another drive .. then another drive .. but they all pass smartctl tests ..
ok fine, unplug remove a tiny bit of dust, reseat all the connections both to the raid cards and drives ..
fire it up..
pool comes online no problem..
fine, let the pool resilver, fixes its self ..
time for a scrub right?
about 2 mins into the scrub.. I gett errors on another drive.,.
one of the other drives is immedialy removed from the pool and the zpool goes intro degraded state..
Code:
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): Retrying command (per sense data)
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): READ(10). CDB: 28 00 00 80 0a 10 00 00 10 00
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): CAM status: SCSI Status Error
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): SCSI status: Check Condition
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): Error 5, Retries exhausted
Sep 5 09:46:16 abyss kernel: [490] GEOM_ELI: g_eli_read_done() failed (error=5) da7p3.eli[READ(offset=270336, length=8192)]
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): READ(16). CDB: 88 00 00 00 00 03 a3 81 22 10 00 00 00 10 00 00
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): CAM status: SCSI Status Error
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): SCSI status: Check Condition
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): Error 5, Retries exhausted
Sep 5 09:46:16 abyss kernel: [490] GEOM_ELI: g_eli_read_done() failed (error=5) da7p3.eli[READ(offset=7997266075648, length=8192)]
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): READ(16). CDB: 88 00 00 00 00 03 a3 81 24 10 00 00 00 10 00 00
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): CAM status: SCSI Status Error
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): SCSI status: Check Condition
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Sep 5 09:46:16 abyss kernel: [490] (da7:mpr0:0:7:0): Error 5, Retries exhausted
Sep 5 09:46:16 abyss kernel: [490] GEOM_ELI: g_eli_read_done() failed (error=5) da7p3.eli[READ(offset=7997266337792, length=8192)]
Sep 5 09:46:16 abyss ZFS[28685]: vdev probe failure, zpool=abyss path=/dev/da7p3.eli
Sep 5 09:46:16 abyss ZFS[29643]: vdev state changed, pool_guid=12071249439363906691 vdev_guid=1501231572552313979
smartctl -t short /dev/da1 - da11 all reports 0 problems and everything passes ...
currently running a long test ...
Code:
root@abyss:/var/log # zpool status
pool: abyss
state: ONLINE
scan: resilvered 129M in 00:00:19 with 0 errors on Sun Sep 5 10:39:28 2021
config:
NAME STATE READ WRITE CKSUM
abyss ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
da3p3.eli ONLINE 0 0 0
da7p3.eli ONLINE 0 0 0
da11p3.eli ONLINE 0 0 0
da8p3.eli ONLINE 0 0 0
da6p3.eli ONLINE 0 0 0
da4p3.eli ONLINE 0 0 0
da5p3.eli ONLINE 0 0 0
da0p3.eli ONLINE 0 0 0
da2p3.eli ONLINE 0 0 0
da1p3.eli ONLINE 0 0 0
da9p3.eli ONLINE 0 0 0
da10p3.eli ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
ada1 ONLINE 0 0 0
ada2 ONLINE 0 0 0
not sure how long it will stay up.. but just the fact that its all over the map really makes it hard to pinpoint the problem ..
notes:
I uses /dev/da7 as an example, I got the exact same errors with /dev/da4 and /dev/da6 .. Im guessing 3+ drives didnt all magically fail at once so there is something wierd going on.. maybe hba card? .. idk .. I pulled and reseated everything .. so it's currently running.. but im guessing if I do a scrub on the pool it will error out in the say way and eject another drive into a degraded status..
Thanks!