Hi,
I have a zfs mirror with two drives. Nagios is checking the smart health of the drives and show up the following error:
The following shows the smart information for the drive.
The services running on the machine which access data on the mirror where all crawling, presumably as the drive was handling errors.
Is it possible to configure the zpool to see the drive as bad so it removes it from the pool?
I've read about "SCT Error Recovery Control" but the drives don't allow reading/writing that value via smartctl.
I've also read about using camcontrol to alter pertinent settings but I'm unsure which ones.
I have a zfs mirror with two drives. Nagios is checking the smart health of the drives and show up the following error:
Code:
SMART Health Status: FAILURE PREDICTION THRESHOLD EXCEEDED: ascq=0x5 [asc=5d, ascq=5]
The following shows the smart information for the drive.
Code:
smartctl -x /dev/da0
smartctl 7.3 2022-02-28 r5338 [FreeBSD 12.3-RELEASE amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST1200MM0129
Revision: C004
Compliance: SPC-4
User Capacity: 1,200,243,695,616 bytes [1.20 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 10500 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x5000c500dd65f097
Serial number: WFK8TTBF0000K1308ZH0
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Thu Jan 19 13:23:41 2023 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Disabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: FAILURE PREDICTION THRESHOLD EXCEEDED: ascq=0x5 [asc=5d, ascq=5]
Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 7875
Power on minutes since format <not available>
Current Drive Temperature: 31 C
Drive Trip Temperature: 60 C
Manufactured in week 05 of year 2021
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 33
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 354
Elements in grown defect list: 8162
Vendor (Seagate Cache) information
Blocks sent to initiator = 186240
Blocks received from initiator = 18136392
Blocks read from cache and sent to initiator = 15298
Number of read and write commands whose size <= segment size = 343195
Number of read and write commands whose size > segment size = 17
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 4363.88
number of minutes until next internal SMART test = 51
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 63039 0 0 63039 0 0.045 0
write: 0 0 0 0 0 9.701 0
verify: 10580 61 0 10641 61 0.050 0
Non-medium error count: 0
[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 4150 - [- - -]
Long (extended) Self-test duration: 6723 seconds [112.0 minutes]
Background scan results log
Status: no scans active
Accumulated power on time, hours:minutes 4363:53 [261833 minutes]
Number of background scans performed: 0, scan progress: 0.00%
Number of background medium scans performed: 0
# when lba(hex) [sk,asc,ascq] reassign_status
1 4116:01 000000004f0361f8 [1,17,1] Recovered via rewrite in-place
----removed lines for brevity----
671 4118:22 000000004f11a960 [1,17,1] Recovered via rewrite in-place
>>>> log truncated, fetched 16124 of 49172 available bytes
Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 0
number of phys = 1
phy identifier = 0
attached device type: SAS or SATA device
attached reason: loss of dword synchronization
reason: power on
negotiated logical link rate: phy enabled; 12 Gbps
attached initiator port: ssp=1 stp=1 smp=1
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5000c500dd65f095
attached SAS address = 0x52cea7f059305600
attached phy identifier = 6
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization count = 0
Phy reset problem count = 0
relative target port id = 2
generation code = 0
number of phys = 1
phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: unknown
negotiated logical link rate: phy enabled; unknown
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5000c500dd65f096
attached SAS address = 0x0
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization count = 0
Phy reset problem count = 0
The services running on the machine which access data on the mirror where all crawling, presumably as the drive was handling errors.
Is it possible to configure the zpool to see the drive as bad so it removes it from the pool?
I've read about "SCT Error Recovery Control" but the drives don't allow reading/writing that value via smartctl.
Code:
smartctl -l scterc /dev/da0
smartctl 7.3 2022-02-28 r5338 [FreeBSD 12.3-RELEASE amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
SCSI device successfully opened
Use 'smartctl -a' (or '-x') to print SMART (and more) information
I've also read about using camcontrol to alter pertinent settings but I'm unsure which ones.
Code:
camcontrol modepage /dev/da0 -m 0x07
EER: 0
PER: 0
DTE: 0
DCR: 0
Verify Retry Count: 20
Verify Correction Span: 255
Verify Recovery Time Limit: 65535