Other Failing SAS drive not removed from zfs mirror

i.am.the.problem · Jan 19, 2023

Hi,

I have a zfs mirror with two drives. Nagios is checking the smart health of the drives and show up the following error:

Code:

SMART Health Status: FAILURE PREDICTION THRESHOLD EXCEEDED: ascq=0x5 [asc=5d, ascq=5]

The following shows the smart information for the drive.

Code:

smartctl -x /dev/da0
smartctl 7.3 2022-02-28 r5338 [FreeBSD 12.3-RELEASE amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST1200MM0129
Revision:             C004
Compliance:           SPC-4
User Capacity:        1,200,243,695,616 bytes [1.20 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        10500 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x5000c500dd65f097
Serial number:        WFK8TTBF0000K1308ZH0
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Thu Jan 19 13:23:41 2023 GMT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: FAILURE PREDICTION THRESHOLD EXCEEDED: ascq=0x5 [asc=5d, ascq=5]

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 7875
Power on minutes since format <not available>
Current Drive Temperature:     31 C
Drive Trip Temperature:        60 C

Manufactured in week 05 of year 2021
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  33
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  354
Elements in grown defect list: 8162

Vendor (Seagate Cache) information
  Blocks sent to initiator = 186240
  Blocks received from initiator = 18136392
  Blocks read from cache and sent to initiator = 15298
  Number of read and write commands whose size <= segment size = 343195
  Number of read and write commands whose size > segment size = 17

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 4363.88
  number of minutes until next internal SMART test = 51

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:      63039        0         0     63039          0          0.045           0
write:         0        0         0         0          0          9.701           0
verify:    10580       61         0     10641         61          0.050           0

Non-medium error count:        0


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -    4150                 - [-   -    -]

Long (extended) Self-test duration: 6723 seconds [112.0 minutes]

Background scan results log
  Status: no scans active
    Accumulated power on time, hours:minutes 4363:53 [261833 minutes]
    Number of background scans performed: 0,  scan progress: 0.00%
    Number of background medium scans performed: 0

   #  when        lba(hex)    [sk,asc,ascq]    reassign_status
   1 4116:01  000000004f0361f8  [1,17,1]   Recovered via rewrite in-place
 ----removed lines for brevity----  
 671 4118:22  000000004f11a960  [1,17,1]   Recovered via rewrite in-place
 >>>> log truncated, fetched 16124 of 49172 available bytes


Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 0
  number of phys = 1
  phy identifier = 0
    attached device type: SAS or SATA device
    attached reason: loss of dword synchronization
    reason: power on
    negotiated logical link rate: phy enabled; 12 Gbps
    attached initiator port: ssp=1 stp=1 smp=1
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000c500dd65f095
    attached SAS address = 0x52cea7f059305600
    attached phy identifier = 6
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization count = 0
    Phy reset problem count = 0
relative target port id = 2
  generation code = 0
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000c500dd65f096
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization count = 0
    Phy reset problem count = 0

The services running on the machine which access data on the mirror where all crawling, presumably as the drive was handling errors.

Is it possible to configure the zpool to see the drive as bad so it removes it from the pool?

I've read about "SCT Error Recovery Control" but the drives don't allow reading/writing that value via smartctl.

Code:

smartctl -l scterc /dev/da0
smartctl 7.3 2022-02-28 r5338 [FreeBSD 12.3-RELEASE amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

SCSI device successfully opened

Use 'smartctl -a' (or '-x') to print SMART (and more) information

I've also read about using camcontrol to alter pertinent settings but I'm unsure which ones.

Code:

camcontrol modepage /dev/da0 -m 0x07
EER:  0
PER:  0
DTE:  0
DCR:  0
Verify Retry Count:  20
Verify Correction Span:  255
Verify Recovery Time Limit:  65535

i.am.the.problem · Jan 19, 2023

Some additional information from /var/log/messages

Code:

Jan 18 21:45:11 myserver kernel:    (da0:mpr0:0:0:0): WRITE(10). CDB: 2a 00 4f 02 f5 40 00 00 08 00 length 4096 SMID 658 Command timeout on target 0(0x000a), 60000 set, 60.285631707 elapsed
Jan 18 21:45:11 myserver kernel: mpr0: At enclosure level 1, slot 0, connector name (    )
Jan 18 21:45:11 myserver kernel: mpr0: Sending abort to target 0 for SMID 658
Jan 18 21:45:11 myserver kernel:    (da0:mpr0:0:0:0): WRITE(10). CDB: 2a 00 4f 02 f5 40 00 00 08 00 length 4096 SMID 658 Aborting command 0xfffffe00fc4c2670
Jan 18 21:45:11 myserver kernel: (da0:mpr0:0:0:0): WRITE(10). CDB: 2a 00 4f 02 f5 40 00 00 08 00 
Jan 18 21:45:11 myserver kernel: mpr0: (da0:mpr0:0:0:0): CAM status: Command timeout
Jan 18 21:45:11 myserver kernel: Finished abort recovery for target 0
Jan 18 21:45:11 myserver kernel: (da0:mpr0:0:0:0): Retrying command, 3 more tries remain
Jan 18 21:45:11 myserver kernel: mpr0: Unfreezing devq for target ID 0
Jan 18 21:45:11 myserver kernel: (da0:mpr0:0:0:0): WRITE(10). CDB: 2a 00 4f 02 f5 40 00 00 08 00 
Jan 18 21:45:11 myserver kernel: (da0:mpr0:0:0:0): CAM status: SCSI Status Error
Jan 18 21:45:11 myserver kernel: (da0:mpr0:0:0:0): SCSI status: Check Condition
Jan 18 21:45:11 myserver kernel: (da0:mpr0:0:0:0): SCSI sense: UNIT ATTENTION asc:29,cd (Vendor Specific ASCQ)
Jan 18 21:45:11 myserver kernel: (da0:mpr0:0:0:0): Field Replaceable Unit: 189
Jan 18 21:45:11 myserver kernel: (da0:mpr0:0:0:0): Retrying command (per sense data)
Jan 19 02:07:26 myserver kernel:    (da0:mpr0:0:0:0): WRITE(10). CDB: 2a 00 4f 40 a5 50 00 00 20 00 length 16384 SMID 847 Command timeout on target 0(0x000a), 60000 set, 60.865657661 elapsed
Jan 19 02:07:26 myserver kernel: mpr0: At enclosure level 1, slot 0, connector name (    )
Jan 19 02:07:26 myserver kernel: mpr0: Sending abort to target 0 for SMID 847
Jan 19 02:07:26 myserver kernel:    (da0:mpr0:0:0:0): WRITE(10). CDB: 2a 00 4f 40 a5 50 00 00 20 00 length 16384 SMID 847 Aborting command 0xfffffe00fc4d3c08
Jan 19 02:07:26 myserver kernel: mpr0: mprsas_action_scsiio: Freezing devq for target ID 0
Jan 19 02:07:26 myserver kernel: (da0:mpr0:0:0:0): WRITE(10). CDB: 2a 00 6f 5a 8c d8 00 00 08 00 
Jan 19 02:07:26 myserver kernel: (da0:mpr0:0:0:0): CAM status: CAM subsystem is busy
Jan 19 02:07:26 myserver kernel: (da0:mpr0:0:0:0): Retrying command, 3 more tries remain
Jan 19 02:07:26 myserver kernel: (da0:mpr0:0:0:0): WRITE(10). CDB: 2a 00 4f 40 a5 50 00 00 20 00 
Jan 19 02:07:26 myserver kernel: (da0:mpr0:0:0:0): CAM status: Command timeout
Jan 19 02:07:26 myserver kernel: (da0:mpr0:0:0:0): Retrying command, 3 more tries remain
Jan 19 02:07:26 myserver kernel: mpr0: Finished abort recovery for target 0
Jan 19 02:07:26 myserver kernel: mpr0: Unfreezing devq for target ID 0
Jan 19 02:07:26 myserver kernel: (da0:mpr0:0:0:0): WRITE(10). CDB: 2a 00 4f 40 a5 50 00 00 20 00 
Jan 19 02:07:26 myserver kernel: (da0:mpr0:0:0:0): CAM status: SCSI Status Error
Jan 19 02:07:26 myserver kernel: (da0:mpr0:0:0:0): SCSI status: Check Condition
Jan 19 02:07:26 myserver kernel: (da0:mpr0:0:0:0): SCSI sense: UNIT ATTENTION asc:29,cd (Vendor Specific ASCQ)
Jan 19 02:07:26 myserver kernel: (da0:mpr0:0:0:0): Field Replaceable Unit: 189
Jan 19 02:07:26 myserver kernel: (da0:mpr0:0:0:0): Retrying command (per sense data)
Jan 19 02:12:25 myserver kernel:    (da0:mpr0:0:0:0): WRITE(10). CDB: 2a 00 4f 40 aa 30 00 00 40 00 length 32768 SMID 361 Command timeout on target 0(0x000a), 60000 set, 60.590219633 elapsed
Jan 19 02:12:25 myserver kernel: mpr0: At enclosure level 1, slot 0, connector name (    )
Jan 19 02:12:25 myserver kernel: mpr0: Sending abort to target 0 for SMID 361
Jan 19 02:12:25 myserver kernel:    (da0:mpr0:0:0:0): WRITE(10). CDB: 2a 00 4f 40 aa 30 00 00 40 00 length 32768 SMID 361 Aborting command 0xfffffe00fc4a7238
Jan 19 02:12:25 myserver kernel: mpr0: mprsas_action_scsiio: Freezing devq for target ID 0
Jan 19 02:12:25 myserver kernel: (da0:mpr0:0:0:0): WRITE(10). CDB: 2a 00 6f 5a 95 28 00 00 08 00 
Jan 19 02:12:25 myserver kernel: (da0:mpr0:0:0:0): CAM status: CAM subsystem is busy
Jan 19 02:12:25 myserver kernel: (da0:mpr0:0:0:0): Retrying command, 3 more tries remain
Jan 19 02:12:25 myserver kernel: (da0:mpr0:0:0:0): WRITE(10). CDB: 2a 00 4f 40 aa 30 00 00 40 00 
Jan 19 02:12:25 myserver kernel: (da0:mpr0:0:0:0): CAM status: Command timeout
Jan 19 02:12:25 myserver kernel: (da0:mpr0:0:0:0): Retrying command, 3 more tries remain
Jan 19 02:12:25 myserver kernel: mpr0: Finished abort recovery for target 0
Jan 19 02:12:25 myserver kernel: mpr0: Unfreezing devq for target ID 0
Jan 19 02:12:25 myserver kernel: (da0:mpr0:0:0:0): WRITE(10). CDB: 2a 00 4f 40 aa 30 00 00 40 00 
Jan 19 02:12:25 myserver kernel: (da0:mpr0:0:0:0): CAM status: SCSI Status Error
Jan 19 02:12:25 myserver kernel: (da0:mpr0:0:0:0): SCSI status: Check Condition
Jan 19 02:12:25 myserver kernel: (da0:mpr0:0:0:0): SCSI sense: UNIT ATTENTION asc:29,cd (Vendor Specific ASCQ)
Jan 19 02:12:25 myserver kernel: (da0:mpr0:0:0:0): Field Replaceable Unit: 189
Jan 19 02:12:25 myserver kernel: (da0:mpr0:0:0:0): Retrying command (per sense data)

Alain De Vos · Jan 19, 2023

There is "zpool remove" in case.

i.am.the.problem · Jan 19, 2023

Alain De Vos said:
There is "zpool remove" in case.

I was able to "zpool offline" the drive and migrate services/data to another server as I couldn't get physical access to replace the drive. Not that I had a drive.

cracauer@ · Jan 19, 2023

Hmmm, it isn't clear to me whether thee retry on failed command succeeds (in the log).

If not this would be a serious bug in ZFS.

What does `zpool status` say about the array?

i.am.the.problem · Jan 19, 2023

cracauer@ said:
Hmmm, it isn't clear to me whether thee retry on failed command succeeds (in the log).

If not this would be a serious bug in ZFS.

What does `zpool status` say about the array?

At the time the zpool was reporting no issues at all.

I've since onlined the drive again to do some investigations. Here is the output of zpool status as it is currently.

Code:

  pool: nsgroot
 state: ONLINE
  scan: resilvered 4.72G in 0 days 00:02:17 with 0 errors on Wed Jan 18 21:45:58 2023
config:

        NAME                             STATE     READ WRITE CKSUM
        nsgroot                          ONLINE       0     0     0
          mirror-0                       ONLINE       0     0     0
            gpt/zroot_pooldisk_/dev/da0  ONLINE       0     0     0
            gpt/zroot_pooldisk_/dev/da1  ONLINE       0     0     0

errors: No known data errors

Other Failing SAS drive not removed from zfs mirror

i.am.the.problem

i.am.the.problem

Alain De Vos

i.am.the.problem

cracauer@

i.am.the.problem