ZFS Do my disks really have errors and should be tossed out?

fcn · May 5, 2024

I received an alert that one of my pools was degraded:

Code:

  pool: stargate
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: scrub in progress since Wed May  1 01:00:00 2024
    27.7T scanned at 1.31G/s, 22.5T issued at 1.07G/s, 34.6T total
    0B repaired, 64.99% done, 03:13:54 to go
config:

    NAME                        STATE     READ WRITE CKSUM
    stargate                    DEGRADED     0     0     0
      raidz2-0                  DEGRADED     0     0     0
        gpt/ST6000VN001-0321-0  ONLINE       0     0     0
        gpt/S6-JAN21-ZR12FZGT   ONLINE       0     0     0
        gpt/S6-JAN21-ZR12HQRR   FAULTED     35    77     0  too many errors
        gpt/S6-JAN21-ZR12HQZ0   ONLINE       0     0     0
        gpt/S6-JAN21-ZR12JAC5   ONLINE       0     0     0
        gpt/S6-JAN21-ZR12KB1A   ONLINE       0     0     0
        gpt/S6-JAN21-ZR12KBZB   ONLINE       0     0     0
        gpt/S6-JAN21-ZR12KC1A   FAULTED     35    77     0  too many errors
        gpt/S6-JAN21-ZR12KCEV   ONLINE       0     0     0

I replaced both disks, All good.

What is weird is that both disks showed the same exact number of errors. I did a smart test on one of the "failed" drives and everything looks fine.

Code:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     20018         -
# 2  Extended offline    Completed without error       00%     20002         -

These drives never had an issue before (I do a scrub 2x per month). I am trying to decide if they should go back in service or I there is too muck risk and I should them thrown out

any advice or further tests I can run?

Thanks

ralphbsz · May 6, 2024

Error logs? /var/log/messages or dmesg?

My suspicion: The disks themselves are fine, but there is a common subsystem that had errors. That could be power distribution, or the IO interconnect (for example SAS or SATA).

fcn · May 6, 2024

logs in messages show the same errors repeating for both disks

Code:

May  1 06:00:02 babylon5 ZFS[13073]: vdev probe failure, zpool=stargate path=/dev/gpt/S6-JAN21-ZR12HQRR
May  1 06:00:02 babylon5 kernel: (da10:mps1:0:15:0): READ(6). CDB: 08 00 00 80 10 00
May  1 06:00:02 babylon5 kernel: (da10:mps1:0:15:0): CAM status: SCSI Status Error
May  1 06:00:02 babylon5 kernel: (da10:mps1:0:15:0): SCSI status: Check Condition
May  1 06:00:02 babylon5 kernel: (da10:mps1:0:15:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
May  1 06:00:02 babylon5 kernel: (da10:mps1:0:15:0): Retrying command (per sense data)
May  1 06:00:02 babylon5 kernel: (da10:mps1:0:15:0): READ(6). CDB: 08 00 00 80 10 00
May  1 06:00:02 babylon5 kernel: (da10:mps1:0:15:0): CAM status: SCSI Status Error
May  1 06:00:02 babylon5 kernel: (da10:mps1:0:15:0): SCSI status: Check Condition
May  1 06:00:02 babylon5 kernel: (da10:mps1:0:15:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)

and then finally

Code:

May  1 06:00:02 babylon5 kernel: GEOM_PART: da10 was automatically resized.
May  1 06:00:02 babylon5 kernel:   Use `gpart commit da10` to save changes or `gpart undo da10` to revert them.
May  1 06:00:02 babylon5 kernel: GEOM_PART: integrity check failed (da10, GPT)
May  1 06:00:02 babylon5 ZFS[13085]: vdev state changed, pool_guid=3969311604504026462 vdev_guid=12122461168081551195

After the failure the block device in /dev/gpt was gone. It came back after a reboot.

ralphbsz · May 6, 2024

The first log output makes sense: Some common root cause made both disks become "not ready", meaning for traditional disks not spinning. I have no idea what that common root cause was tough. It could have been a power glitch, it could have been some piece of software or firmware ordering them to spin down.

The second part is hard to understand: Why did the kernel change the partition tables on them, causing the ZFS partitions to vanish? Strange.

What kind of disks are these? The first one is a Seagate spinning rust disk, judging by the model number. The next half dozen I have no idea.

tingo · May 6, 2024

Are the disks connected to a backplane, or sata cables? If there are cables, they could have gone bad.

msplsh · May 6, 2024

I would fault their connection to the system if SMART is showing nothing and all parameters are within spec (they weren't posted).

fcn · May 6, 2024

msplsh said:
I would fault their connection to the system if SMART is showing nothing and all parameters are within spec (they weren't posted).

let me know what command I can run / or data I can post

Thanks

fcn · May 6, 2024

tingo said:
Are the disks connected to a backplane, or sata cables? If there are cables, they could have gone bad.

yes M1015 card

msplsh · May 7, 2024

fcn said:
let me know what command I can run / or data I can post

The stuff that showed up right after the drive serial number and before this line:

fcn said:
Code:
SMART Self-test log structure revision number 1

fcn · May 7, 2024

msplsh said:
The stuff that showed up right after the drive serial number and before this line:

Code:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   064   006    Pre-fail  Always       -       5062
  3 Spin_Up_Time            0x0003   092   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       19
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   083   060   045    Pre-fail  Always       -       188124720
  9 Power_On_Hours          0x0032   078   078   000    Old_age   Always       -       20090
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       19
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   064   056   040    Old_age   Always       -       36 (Min/Max 32/42)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       795
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1520
194 Temperature_Celsius     0x0022   036   044   000    Old_age   Always       -       36 (0 21 0 0 0)
195 Hardware_ECC_Recovered  0x001a   100   064   000    Old_age   Always       -       5062
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       19889h+47m+24.537s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       9353468262
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       463302808955

SMART Error Log Version: 1
No Errors Logged

msplsh · May 7, 2024

That Seek_Error_Rate is not looking good, but it's within spec I guess. I would still look at the connections to the drive.

tingo · May 7, 2024

fcn said:
yes M1015 card

You can try to re-seat the disks. Power off the machine, take out and insert every disk drive, when finished power on the machine again. This might clean oxidation off the contacts, if there is any.