Solved Is this HDD usable?

Hello,

I ran a badblocks on spare HDD i had and got this
Code:
# badblocks -wsv -b 4096 /dev/da11
Checking for bad blocks in read-write mode
From block 0 to 146515445
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: 4089366 done, 5:11:11 elapsed. (0/0/0 errors)
4089974 done, 5:11:14 elapsed. (1/0/0 errors)
4090171 done, 5:11:17 elapsed. (2/0/0 errors)
4090367 done, 5:11:18 elapsed. (3/0/0 errors)
4090564 done, 5:11:20 elapsed. (4/0/0 errors)
4146285 done, 5:11:25 elapsed. (5/0/0 errors)
4146481
4147071 done, 5:11:27 elapsed. (7/0/0 errors)
4148073 done, 5:11:28 elapsed. (8/0/0 errors)
4148662 done, 5:11:29 elapsed. (9/0/0 errors)
4148859
4149055 done, 5:11:31 elapsed. (11/0/0 errors)
4149251
4149448 done, 5:11:33 elapsed. (13/0/0 errors)
4149644
4149841 done, 5:11:35 elapsed. (15/0/0 errors)
4150253
4150450 done, 5:11:37 elapsed. (17/0/0 errors)
4150646
4150843 done, 5:11:38 elapsed. (19/0/0 errors)
done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 20 bad blocks found. (20/0/0 errors)

smartctl returns
Code:
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     44 C
Drive Trip Temperature:        85 C

Manufactured in week 33 of year 2011
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  194
Elements in grown defect list: 10

Vendor (Seagate) cache information
  Blocks sent to initiator = 22299609736937472

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0   926754         0    926754    2851834      17772.447         143
write:         0  8543488         0   8543488    6976216      63413.454           0
verify:        0        0         0         0      90375          0.000           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->      24   46593          34090189 [0x3 0x5d 0x1]
# 2  Background short  Completed                  16   46592                 - [-   -    -]
# 3  Background short  Completed                  16       0                 - [-   -    -]
# 4  Background short  Completed                  16       0                 - [-   -    -]

Long (extended) Self Test duration: 4981 seconds [83.0 minutes]

Can i use this disk in a mirror zpool?

Thanks
 
The answer to "can you use this disk" is itself a series of question: Do you like your data and want it back? Do you know where your backup tapes are, at all times? How much down time and hassle do you want to deal with?

Seriously: This disk had 143 cases where it was unable to read, in 17.7 TB read (says the smartctl data). The specified uncorrectable error rate (called UBER or URE) that comes from the manufacturer's specification is probably 10^-14 or 10^-15 (the former is typical for consumer grade disks, the latter for enterprise-grade). So let's calculate that: 143 / 17.7e12 / 8 (the eight is for bits versus bytes) is 1 * 10^-12. So about 100 or 1000 times worse than the manufacturer's specification for a good disk. Meaning your disk is pretty bad.

The best predictor of future errors (not a perfect predictor, but the best one we have) is errors in the past. Personally, I would not use this disk to store data on, except for stuff where loss of the data and the amount of work required to get the system back together are not a problem; the likelihood is high that it will get more errors, or fail soon.

Now, you might say: it is in a mirror pair, so in case it fails, the other mirror has a good copy. Nice theory. If the disk fails, ZFS will have to resilver everything onto a spare, which is a lot of extra work for it. During the rebuild (which takes many hours), performance of the system will stink. This disk might also fail in a way that crashes the system, or makes the system incapable of operating with the disk attached (that's not all that unusual), so you may have to spend some time diagnosing and removing the disk, and deal with downtime. All this is just a hassle. Now let's talk about the risk to the data: I don't know how big your disk is, let's say it's 1.25TB (to make the math easier). That's 10*10^12 bits. Remember, I said above that the predicted error rate *for good disks* is 10^-14 per bit. If you multiply those out, the probability that you find at least one error when having to resilver the mirror pair is 10% !! This is a very important observation: With the size of modern disk drives, the probability that one complete read of a drive (which is needed when doing a resilver of a RAID group) will have at least one error is rapidly approaching 1 !!! This is why the (now retired) CTO of NetApp, who has forgotten more about storage than you and me together know, said a while ago: Selling RAID that can only tolerate a single disk error is like professional malpractice.

With single-fault tolerante RAID (like a simple mirror), I would try to use good quality disks, and still have a backup strategy. Doing it with a known bad disk is ... dumb (unless it's just scratch space and you don't mind playing with your computer). To get to the point where disk failures are not a significant source of data loss, you need to use multi-fault tolerant RAID schemes.
 
What ralphbsz said. 6 year old drive, tons of errors, MTBF for that drive is what? It's easy math, buy another drive. Some stuff just begs to get thrown in the trash. After a date with a sledge hammer of course.
 
All harddisks have a 'spare' bit of space. Bad blocks happen but most of the time the drive's firmware will automatically map the bad bit to the extra bit of spare space. The fact you're actually seeing bad blocks means this 'spare' bit of space is full. Which simply means it cannot be trusted any more and the number of bad blocks will increase. Time for a new disk.
 
To provide a counter-example. Here I have a SAS Toshiba Enterprise disk with 13 unrecoverable errors, out of 544TBs of access. The BER of this drive is apparently 10 per 10^17 (https://store.cmsdistribution.com/hard-drives-internal/toshiba-600gb-10000rpm-2.5-sas/).

So here:
Code:
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        7        13         0          0     545001.488          13
write:         0        0         0         0          0     327798.157           0
verify:        0        0         0         0          0       4698.167           0

So performing the same math: that's 13 / 544×10^12/ 8 = ~2.9x10^9, slightly over half that of the manufacturer's failure rate. Still not great, I should think about replacing this drive soon, but for now its OK.

To go back, ZFS has reported some issues with the drive:
Code:
$ sudo zpool status
  pool: zmain
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 80K in 9h12m with 0 errors on Sat Mar 16 11:42:52 2019
config: 

        NAME                              STATE     READ WRITE CKSUM
        zmain                             ONLINE       0     0     0
          mirror-0                        ONLINE       0     0     0
            scsi-350000393b81066d8-part3  ONLINE       0     0     0
            scsi-350000393b8106828-part3  ONLINE      27     0     7

After I determined the failure rate of the drive, compared to the BER expectations, I decided to clear the errors out for now and wait and see if the next ZFS scrub fails again.

I also initiated the long SMART tests, and it does fail :(
Code:
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->       7   39430            209811 [0x3 0x13 0x0]
# 2  Background short  Failed in segment -->       7   39430            209811 [0x3 0x13 0x0]
 
While BER does account for some fraction of defective data, a greater source of data corruption is the magnetic recording media coating the disks.
Uncorrectable errors showing up means the spare bit of space is used up, i.e. there's a problem with the magnetic recording on the platter itself. Replace the disk.

In the 60 days following the first uncorrectable error on a drive (S.M.A.R.T. attribute 0xC6 or 198) detected as a result of an offline scan, the drive was, on average, 39 times more likely to fail than a similar drive for which no such error occurred.
 
If you are using SCSI disks (for example SAS), then SMART is already much more useful than on (S)ATA, because in SCSI the disk can tell you its PFA = Predicitve Failure Analysis: SCSI disks can report that they want to be taken out of service, because they know that their health is impaired. This is way better than the (S)ATA version of SMART, where disks simply report error counts, and the sys admin has to reach conclusions themselves (which is somewhere between hard and impossible, depending on how much the admin knows about the internals of the drives they're using).
 
So. If SMART on SAS is the bomb, and it reports:
Code:
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

, but also 13 total unrecovered errors,

, AND also reports failed SMART tests
Code:
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->       7   39430            209811 [0x3 0x13 0x0]
# 2  Background short  Failed in segment -->       7   39430            209811 [0x3 0x13 0x0]

Do most users not find it unclear/conflicting? Whether or not the operator should replace the disk "immediately", "soon", or "wait for more errors" ...
SMART status in general is mis-leading no?
 
SMART doesn't exactly say "this disk will die in exactly this time period". Predicting the future isn't that easy. In a nutshell, the (S)ATA version of SMART simply reports counters of errors, while the SCSI=SAS version also reports a flag that the disk itself expects to have a higher probability of failing soon (that's what the PFA flag means).

When should the user replace the disk? Excellent question. There is no simple answer to that question, because it depends on factors outside of the disk itself. Namely on how the disk is being used: how redundant is the data on it, what are the other disks to the data, and how long would it take to reconstruct?

In the world of production storage systems, nearly all data is stored redundantly, to guard against failure of any individual disk drive, and sometimes against failure of other components: with the big cloud providers like Amazeon/Google/Microsoft, customers can request that multiple copies of their data be stored on different continents, so it is protected even against large-scale network outages. The number of copies is an adjustable parameter, and ranges from roughly 1.05 (spread the data over about 20 disks and write one extra parity-like encoding of the data on an extra disk) to roughly 10 (write multiple copies of the data to many disks, and then copy that to sets of disks in different locations). All this is summarized under the term "RAID", although the encodings used today have little to do with the traditional RAID schemes from 30 years ago. Once you have RAIDed your data, you can estimate the probability that the data will be "damaged" (inaccessible temporarily or permanently). The four important ingredients into that calculation are: the probability that one copy of the data is damaged or that a while disk fails, the probability that during re-copying of the data another error is found or another failure occurs, how long it takes to perform the re-copying (because that increases the chance that another disk fails in that time), and how the data was encoded, meaning how much data needs to be read to do the re-copying. These calculations are very difficult, and for some of the more interesting encodings virtually impossible to do, so the results have to be estimated.

Then on the other hand, one can predict the failure probability of disk drives from the SMART data. There is at least one published paper that directly links measured failure rates to SMART giving alarms (it was in FAST, in the early 2000s, by an author from Google), and there is a whole series of papers by a professor from Toronto about disk failure rates, with some discussion of how well SMART predictions correlate with actual errors and failures.

So now you have two ingredients: The second one lets you predict how likely a failure (or error) of the disk in the future is, the first allows you to estimate the impact of that on the survival or accessibility of your data. Taking those two together, you can create a mathematical model that allows making the tradeoff: is it (economically and from a risk-analysis) better to keep this sick disk running for now, and rely on redundant disks in case it fails, or is it time to take it out of service? This is an economic analysis, and the value of the data (how bad would the financial damage be when it isn't there) and the cost of maintaining and repairing the computer needs to be factored into that.

For individual users, this risk analysis is practically impossible to make; for the people who implement large storage systems (companies like DDN, EMC, Hitachi, HP, IBM, ..., and cloud providers like the aforementioned Amazon/Google/Microsoft), that analysis is difficult but doable. My personal conclusion is: Store all important data redundantly (at least 2 copies, better 3-4), and replace a disk as soon as the error counters increase significantly, or the SCSI PFA flag is set. YMMV.

No, SMART is not mis-leading. It may seem mis-leading if you expect a simple answer to a complex question, but the problem there is unrealistic expectations.
 
Back
Top