OS: 10.3-RELEASE amd64
I am running several raidz3 zpools, each consisting of several 15-drive vdevs. The drives are all Hitachi 6 TB SAS drives: HGST HUS726060ALS640, installed in Supermicro JBODs and attached to a: Avago Technologies (LSI) SAS2308 (mps).
I am seeing random repeated instances across similarly-configured machines where a drive seems to drag down the performance of the entire zpool. Smartctl and zpool status report no issues, and there's nothing on console indicating an issue.
The only ways to detect and identify a problem, aside from the lowered I/O performance, are things like:
And this is repeatable- the "bad" drive is always very slow to return smartctl data; the smartctl output spits out a few lines at a time and pauses in between- and it doesn't just pause when it gets to read/write stats, the delay is immediate:
The other thing I note is that gstat generally shows all disks idle (whereas normal activity is busy and uniform):
And this idle state alternates between the above and a briefly (as in 1 iteration of gstat) *very* busy (>200% busy) "bad" drive (da24):
So the drives are essentially idle for most of the time with brief bursts of high-busy activity, primarily on one (bad) drive.
In the past, and this has happened at least 5 times now, I resolve the situation by removing the drive from the pool:
and after the zpool flushes writes to the drive (2-5min) once the drive is REMOVED the zpool *instantly* resumes normal, busy, fast activity (back to normal). Interestingly, the smartctl -i test (output speed) instantly also returns to normal, quick output on the "bad" drive. Badblocks tests against the "bad" drive turn up nothing.
Only once has a drive thrown an error to console post removal:
but that is rare.
HGST has brushed this off by saying the drive is deprecated, go get a new model. That's fine, but with many of these in use it's not so simple to go replacing them all.
What I'm trying to do here is:
- understand what's going on
- devise alternate methods to diagnose, predict, test, identify the bad drive while in/out of production
- prevent it from happening, if possible
I appreciate your thoughts!
I am running several raidz3 zpools, each consisting of several 15-drive vdevs. The drives are all Hitachi 6 TB SAS drives: HGST HUS726060ALS640, installed in Supermicro JBODs and attached to a: Avago Technologies (LSI) SAS2308 (mps).
I am seeing random repeated instances across similarly-configured machines where a drive seems to drag down the performance of the entire zpool. Smartctl and zpool status report no issues, and there's nothing on console indicating an issue.
The only ways to detect and identify a problem, aside from the lowered I/O performance, are things like:
Code:
time smartctl -i [every other drive]
0.01 real 0.00 user 0.00 sys
time smartctl -i [bad drive]
7.77 real 0.00 user 0.00 sys
And this is repeatable- the "bad" drive is always very slow to return smartctl data; the smartctl output spits out a few lines at a time and pauses in between- and it doesn't just pause when it gets to read/write stats, the delay is immediate:
Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-RELEASE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: HGST
Product: HUS726060ALS640
Revision: A280
Compliance: SPC-4
[4sec pause]
User Capacity: 6,001,175,126,016 bytes [6.00 TB]
Logical block size: 512 bytes
[4sec pause]
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
[4sec pause]
Logical Unit id: 0x5000cca232653c04
Serial number: 1EHTNY1C
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Thu Sep 7 14:45:39 2017 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
The other thing I note is that gstat generally shows all disks idle (whereas normal activity is busy and uniform):
Code:
dT: 0.002s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 0 0 0 0.0 0 0 0.0 0.0| da13
0 0 0 0 0.0 0 0 0.0 0.0| da14
0 0 0 0 0.0 0 0 0.0 0.0| da15
0 0 0 0 0.0 0 0 0.0 0.0| da0
0 0 0 0 0.0 0 0 0.0 0.0| da1
0 0 0 0 0.0 0 0 0.0 0.0| da2
0 0 0 0 0.0 0 0 0.0 0.0| da3
0 0 0 0 0.0 0 0 0.0 0.0| da4
0 0 0 0 0.0 0 0 0.0 0.0| da5
0 0 0 0 0.0 0 0 0.0 0.0| da7
0 0 0 0 0.0 0 0 0.0 0.0| da8
0 0 0 0 0.0 0 0 0.0 0.0| da9
0 0 0 0 0.0 0 0 0.0 0.0| da10
0 0 0 0 0.0 0 0 0.0 0.0| da11
0 0 0 0 0.0 0 0 0.0 0.0| da12
0 0 0 0 0.0 0 0 0.0 0.0| da16
0 0 0 0 0.0 0 0 0.0 0.0| da17
0 0 0 0 0.0 0 0 0.0 0.0| da18
0 0 0 0 0.0 0 0 0.0 0.0| da19
0 0 0 0 0.0 0 0 0.0 0.0| da20
0 0 0 0 0.0 0 0 0.0 0.0| da22
0 0 0 0 0.0 0 0 0.0 0.0| da23
21 0 0 0 0.0 0 0 0.0 0.0| da24
And this idle state alternates between the above and a briefly (as in 1 iteration of gstat) *very* busy (>200% busy) "bad" drive (da24):
Code:
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 21 21 259 1.3 0 0 0.0 1.3| da13
0 18 18 239 2.1 0 0 0.0 1.8| da14
0 20 20 148 5.3 0 0 0.0 6.7| da15
1 21 21 243 3.5 0 0 0.0 4.4| da0
1 22 22 224 3.4 0 0 0.0 4.6| da1
0 25 25 251 3.2 0 0 0.0 6.1| da2
0 22 22 255 4.8 0 0 0.0 8.5| da3
0 18 18 220 1.9 0 0 0.0 3.3| da4
0 17 17 212 3.8 0 0 0.0 4.3| da5
2 21 21 259 4.1 0 0 0.0 6.9| da7
0 19 19 259 2.4 0 0 0.0 3.0| da8
0 20 20 212 2.8 0 0 0.0 4.2| da9
0 24 24 224 5.3 0 0 0.0 9.6| da10
1 21 21 220 4.1 0 0 0.0 5.5| da11
0 22 22 243 4.4 0 0 0.0 5.2| da12
1 18 18 144 6.1 0 0 0.0 8.6| da16
0 20 20 148 6.7 0 0 0.0 8.3| da17
0 17 17 136 7.4 0 0 0.0 8.5| da18
0 16 16 132 6.7 0 0 0.0 7.3| da19
1 17 17 144 5.1 0 0 0.0 7.0| da20
2 22 22 160 4.7 0 0 0.0 7.1| da22
2 22 22 156 5.5 0 0 0.0 8.6| da23
23 46 35 224 1620 11 307 5698 223.1| da24
So the drives are essentially idle for most of the time with brief bursts of high-busy activity, primarily on one (bad) drive.
In the past, and this has happened at least 5 times now, I resolve the situation by removing the drive from the pool:
Code:
zpool offline tank da24
and after the zpool flushes writes to the drive (2-5min) once the drive is REMOVED the zpool *instantly* resumes normal, busy, fast activity (back to normal). Interestingly, the smartctl -i test (output speed) instantly also returns to normal, quick output on the "bad" drive. Badblocks tests against the "bad" drive turn up nothing.
Only once has a drive thrown an error to console post removal:
Code:
(da65:mps4:0:25:0): READ(16). CDB: 88 00 00 00 00 02 ba a0 f0 20 00 00 00 e0 00 00
(da65:mps4:0:25:0): CAM status: SCSI Status Error
(da65:mps4:0:25:0): SCSI status: Check Condition
(da65:mps4:0:25:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da65:mps4:0:25:0):
(da65:mps4:0:25:0): Field Replaceable Unit: 0
(da65:mps4:0:25:0): Command Specific Info: 0
(da65:mps4:0:25:0): Actual Retry Count: 153
(da65:mps4:0:25:0): Descriptor 0x80: f7 2d
(da65:mps4:0:25:0): Descriptor 0x81: 04 b3 15 00 0(da65:mps4:0:25:0): Error 5, Unretryable error
but that is rare.
HGST has brushed this off by saying the drive is deprecated, go get a new model. That's fine, but with many of these in use it's not so simple to go replacing them all.
What I'm trying to do here is:
- understand what's going on
- devise alternate methods to diagnose, predict, test, identify the bad drive while in/out of production
- prevent it from happening, if possible
I appreciate your thoughts!