Odd drive failures: zfs + mp3 + HGST HUS726060ALS640

OS: 10.3-RELEASE amd64

I am running several raidz3 zpools, each consisting of several 15-drive vdevs. The drives are all Hitachi 6 TB SAS drives: HGST HUS726060ALS640, installed in Supermicro JBODs and attached to a: Avago Technologies (LSI) SAS2308 (mps).

I am seeing random repeated instances across similarly-configured machines where a drive seems to drag down the performance of the entire zpool. Smartctl and zpool status report no issues, and there's nothing on console indicating an issue.

The only ways to detect and identify a problem, aside from the lowered I/O performance, are things like:

Code:
time smartctl -i [every other drive] 
        0.01 real         0.00 user         0.00 sys

time smartctl -i [bad drive]
        7.77 real         0.00 user         0.00 sys

And this is repeatable- the "bad" drive is always very slow to return smartctl data; the smartctl output spits out a few lines at a time and pauses in between- and it doesn't just pause when it gets to read/write stats, the delay is immediate:

Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-RELEASE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS726060ALS640
Revision:             A280
Compliance:           SPC-4

[4sec pause]

User Capacity:        6,001,175,126,016 bytes [6.00 TB]
Logical block size:   512 bytes

[4sec pause]

LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches

[4sec pause]

Logical Unit id:      0x5000cca232653c04
Serial number:        1EHTNY1C
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Thu Sep  7 14:45:39 2017 MDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

The other thing I note is that gstat generally shows all disks idle (whereas normal activity is busy and uniform):
Code:
dT: 0.002s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      0      0      0    0.0      0      0    0.0    0.0| da13
    0      0      0      0    0.0      0      0    0.0    0.0| da14
    0      0      0      0    0.0      0      0    0.0    0.0| da15
    0      0      0      0    0.0      0      0    0.0    0.0| da0
    0      0      0      0    0.0      0      0    0.0    0.0| da1
    0      0      0      0    0.0      0      0    0.0    0.0| da2
    0      0      0      0    0.0      0      0    0.0    0.0| da3
    0      0      0      0    0.0      0      0    0.0    0.0| da4
    0      0      0      0    0.0      0      0    0.0    0.0| da5
    0      0      0      0    0.0      0      0    0.0    0.0| da7
    0      0      0      0    0.0      0      0    0.0    0.0| da8
    0      0      0      0    0.0      0      0    0.0    0.0| da9
    0      0      0      0    0.0      0      0    0.0    0.0| da10
    0      0      0      0    0.0      0      0    0.0    0.0| da11
    0      0      0      0    0.0      0      0    0.0    0.0| da12
    0      0      0      0    0.0      0      0    0.0    0.0| da16
    0      0      0      0    0.0      0      0    0.0    0.0| da17
    0      0      0      0    0.0      0      0    0.0    0.0| da18
    0      0      0      0    0.0      0      0    0.0    0.0| da19
    0      0      0      0    0.0      0      0    0.0    0.0| da20
    0      0      0      0    0.0      0      0    0.0    0.0| da22
    0      0      0      0    0.0      0      0    0.0    0.0| da23
   21      0      0      0    0.0      0      0    0.0    0.0| da24

And this idle state alternates between the above and a briefly (as in 1 iteration of gstat) *very* busy (>200% busy) "bad" drive (da24):

Code:
L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0     21     21    259    1.3      0      0    0.0    1.3| da13
    0     18     18    239    2.1      0      0    0.0    1.8| da14
    0     20     20    148    5.3      0      0    0.0    6.7| da15
    1     21     21    243    3.5      0      0    0.0    4.4| da0
    1     22     22    224    3.4      0      0    0.0    4.6| da1
    0     25     25    251    3.2      0      0    0.0    6.1| da2
    0     22     22    255    4.8      0      0    0.0    8.5| da3
    0     18     18    220    1.9      0      0    0.0    3.3| da4
    0     17     17    212    3.8      0      0    0.0    4.3| da5
    2     21     21    259    4.1      0      0    0.0    6.9| da7
    0     19     19    259    2.4      0      0    0.0    3.0| da8
    0     20     20    212    2.8      0      0    0.0    4.2| da9
    0     24     24    224    5.3      0      0    0.0    9.6| da10
    1     21     21    220    4.1      0      0    0.0    5.5| da11
    0     22     22    243    4.4      0      0    0.0    5.2| da12
    1     18     18    144    6.1      0      0    0.0    8.6| da16
    0     20     20    148    6.7      0      0    0.0    8.3| da17
    0     17     17    136    7.4      0      0    0.0    8.5| da18
    0     16     16    132    6.7      0      0    0.0    7.3| da19
    1     17     17    144    5.1      0      0    0.0    7.0| da20
    2     22     22    160    4.7      0      0    0.0    7.1| da22
    2     22     22    156    5.5      0      0    0.0    8.6| da23
   23     46     35    224   1620     11    307   5698  223.1| da24

So the drives are essentially idle for most of the time with brief bursts of high-busy activity, primarily on one (bad) drive.

In the past, and this has happened at least 5 times now, I resolve the situation by removing the drive from the pool:

Code:
zpool offline tank da24

and after the zpool flushes writes to the drive (2-5min) once the drive is REMOVED the zpool *instantly* resumes normal, busy, fast activity (back to normal). Interestingly, the smartctl -i test (output speed) instantly also returns to normal, quick output on the "bad" drive. Badblocks tests against the "bad" drive turn up nothing.

Only once has a drive thrown an error to console post removal:

Code:
(da65:mps4:0:25:0): READ(16). CDB: 88 00 00 00 00 02 ba a0 f0 20 00 00 00 e0 00 00
(da65:mps4:0:25:0): CAM status: SCSI Status Error
(da65:mps4:0:25:0): SCSI status: Check Condition
(da65:mps4:0:25:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da65:mps4:0:25:0):
(da65:mps4:0:25:0): Field Replaceable Unit: 0
(da65:mps4:0:25:0): Command Specific Info: 0
(da65:mps4:0:25:0): Actual Retry Count: 153
(da65:mps4:0:25:0): Descriptor 0x80: f7 2d
(da65:mps4:0:25:0): Descriptor 0x81: 04 b3 15 00 0(da65:mps4:0:25:0): Error 5, Unretryable error

but that is rare.

HGST has brushed this off by saying the drive is deprecated, go get a new model. That's fine, but with many of these in use it's not so simple to go replacing them all.

What I'm trying to do here is:

- understand what's going on
- devise alternate methods to diagnose, predict, test, identify the bad drive while in/out of production
- prevent it from happening, if possible

I appreciate your thoughts!
 
Yikes. This is very hard.

To begin with: Are you sure the problem is the drive itself? It could also be that the problem is in the LSI (=Avago =Broadcom) firmware, which reacts "in an interesting fashion" to some unusual but legal behavior (interesting: euphemism for de-facto broken, but we don't want to insult anyone). The problem could even be the firmware in the SAS expander that's provided by SuperMicro. Why do I say that? Because "usually" (meaning 90% of the time), slow drives are caused by platter/head/actuator problems; but the operation on which you happen to measure slow performance (smartctl) probably doesn't require accessing the mechanism, only talking to the drive firmware, and get cached data. Therefore the slowdown can be caused by something on the SAS network. By the way, I've been on way too many conference calls with Broadcom and XXX (some vendor of enclosures that includes SAS expanders) to debug issues like that, and too often the cause is some subtle incompatibility of OS driver -> SAS HBA firmware -> SAS expander firmware -> drive.

Unfortunately, you are not a large enough customer to beat on the table and get the support engineering teams from all vendors onto the phone (BTDT, got the T-shirt). You have even failed with WD/Hitachi already. Even more unfortunately, you are a large enough customer that throwing many dozens of drives into the trash because of one undebuggable problem is a very painful financial hit. Sorry, can't help there. My only suggestion is: Try beating up support of all vendors involved. And try to debug it yourself by moving pieces around or replacing pieces, one at a time, until it gets better. For example PMC-Sierra also makes HBAs, and NetApp and Seagate make SAS enclosures, and Seagate makes SAS drives. You might want to procure one of each, and then when you call vendor X technical support, you can point out that the problem does *not* occur with hardware from vendor Y, which makes your argument somewhat stronger.

Next observation: You find that most drives are idle. That makes perfect sense: In RAID, most IO operations require accessing many (or all) drives at once. If one drive is slow, it will cause a convoy effect and a pileup, and all threads that can start these group IO operations will be stuck waiting for that one slow drive. That makes perfect sense. In large storage systems (with several hundred drives per host), I've seen most drives at an average queue depth below 1, and one drive with an average queue depth of several hundred. This is obviously performance killing. Fortunately, it makes debugging the problem quite easy: identify the drive that is occasionally or always busy (let's hope it is *exactly* one drive), disconnect it, and the rest should start working much better (perhaps in degraded RAID mode requiring resilvering). But before you pull drives, make sure you have *exactly* one drive that is causing trouble. I've seen situations where various drives take turns being the "slowpoke that's holding up traffic", and there are even sensible reasons from queueing theory for that.

How to debug? One hint is above: the drive on which smartctl is slow. Or the drive that is occasionally busy while the others are idle. Here is another idea: Find some reasonable workload generator and drive performance measurement tool, and quickly iterate over all drives; if you get lucky, one drive sticks out with laughably bad performance. Unfortunately, a straight sequential dd is about the worst performance tool, since even quite broken drives can have good sequential performance, until they hit the area with read errors. I don't have specific advice for which performance measurement tool to use, since I used to have one of my own (which I can not legally share, and even if I could, it is known to not even compile on *BSD). Another reasonable (but not perfect) predictor of drive problems is to look inside the SMART data for read or write operations that "were completed without error but with delay", or "with error"; often (but not always) drives begin failing slowly, by having internal errors that they can correct. Again, this does not match the observation that smartctl invocations are slow, but it is worth checking. You obviously need to check that the drives are not reporing PFA (predictive failure analysis) using SMART too. Again, make sure you find *the* problem drive; it would be very embarassing if you identify the wrong drive and throw a perfectly good or marginal drive in the trash, leaving the really bad one in the system. And make sure you have spare drives on hand before starting to remove suspect drives (duh).

You say you've seen this multiple times. Given that you are using several dozen disks, that is an unusually high failure rate; Normally, the AFR (annual failure rate) of drives should be below a percent, which is equivalent to a MTBF of over a million hours. I don't know how to explain this. Maybe you got a bad batch of drives (in which case Hitachi would probably know about it, and your serial numbers would raise alarms in their support system, but they might not be willing to talk about that). Maybe you have a common failure cause, like overheating, vibration, or some bug in the SAS and firmware stack. Maybe you're just really unlucky guy (in which case I would recommend buying a lottery ticket or going on a date tonight, statistics tells use that you'll probably get lucky, I kid I kid).

The ideal solution would be to measure disk performance during normal file system and RAID operations, by continuously monitoring, and comparing drives to each other or to known standards. Unfortunately, ZFS on FreeBSD does not have functionality like a "disk hospital" or "slow disk detection" which allows doing this. In large systems, identifying functioning but slow drives is not easy. I've spent much time on it, and in my experience, a significant fraction of drive "failures" are caused by exactly what you're observing: drives that work (can read and write), don't report imminent death using SMART, yet are so slow that they make the whole system de-facto unusable.
 
%busy in gstat spiking above 100% is often a good indicator of something being wrong (there are some exceptions, such as disk spin-up or some other legally too long operations). It means that some commands take more time then a single gstat update period, that make it jump between 0 and >100%. In fact there are no such spikes, the drive is just busy all the time, but due to very big command execution statistics looks so odd.
 
Back
Top