ZFS Help with failing HDDs

stratacast1 · Sep 8, 2017

I have an HP Proliant ML110 G6, with a 120GB SSD for my OS and 2 2TB WD Enterprise edition HDDs in a zpool mirror. Well, I was having these messages come up on my screen tonight:

Code:

smartd[11102]: Device: /dev/ada2, 10 currently unreadable (pending) sectors
smartd[11102]: Device: /dev/ada2, 44 Offline uncorrectable sectors

This message, along with preceding CAM control errors pertaining to my backup HDD that I had plugged in via USB. After some digging, I saw my /dev/ada2 was about to kick the bucket, so I plugged in a new identical HDD and ran

zpool replace mypool ada2 ada3

This then resilvered my disk and dropped ada2 from my pool. I then proceeded to do a scrub of my pool and got the same smartd error as above on my brand new ada3 drive, but with 3 unreadable (pending) sectors, but no speak of offline uncorrectable sectors.

Maybe I'm fooling myself, but I have a hard time believing I got 1 drive dying after 4 months and one dying in a matter of minutes. This is the new readout of my pool:

Code:

  pool: mypool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
   attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
   using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub in progress since Fri Sep  8 02:28:27 2017
        43.8G scanned out of 191G at 78.5M/s, 0h32m to go
        0 repaired, 22.89% done
config:

   NAME        STATE     READ WRITE CKSUM
   mypool      ONLINE       0     0     0
     mirror-0  ONLINE       0     0     0
       ada3    ONLINE       0     0     1
       ada1    ONLINE       0     0     0

errors: No known data errors

I have been thinking that maybe this is because I put in 2 more 4GB sticks of ECC memory, because none of this has been an issue since I did that. I have since taken out those 2 new sticks to see what'll happen. Any thoughts?

SirDice · Sep 8, 2017

If memory was an issue I would have expected the errors to be more random. You would also be notified of this, ECC will "fix" memory errors but does alert you to it. Memory would have to be severely broken if ECC wasn't able to catch those errors. Not impossible but quite unlikely..

stratacast1 said:
Maybe I'm fooling myself, but I have a hard time believing I got 1 drive dying after 4 months and one dying in a matter of minutes.

I've seen brand new disks with errors. You'd expect some quality assurance would have caught that before they were sold. It doesn't happen often but it's certainly a possibility. Especially if that "new" disk happens to be a refurbished disk.

ralphbsz · Sep 8, 2017

You seem to be getting low-level errors, which are reported by CAM and by Smart. To diagnose that, the right tool is to look at low-level things, exactly the logs (with CAM errors and Smart messages), not the ZFS status. As SirDice said: The memory (which may be a problem too) would not show up as those errors.

I would start by looking at the complete output from smartctl -a on both disks. How old are they, how many internal errors have they had, and so on.

stratacast1 · Sep 8, 2017

Okay, I ruled out the memory with what you said, and also running the server without my 2 new sticks of RAM for the last 12 hours, and the same errors arise. So, definitely dead drives. I got my hands on 2 4TB HGST drives to swap out my failing drives. So my plan now is to:

Plug in my 2 4TB drives
I'm guessing issue the replace command again but point the resilvering to my 4TB drive
Knock out my other function 2TB drive
Create the mirror again using both my 4TB drives
Scrub and...

Last night when I finished replacing ada2 with ada3, I shutdown my server, unplugged ada2, booted, and my pool booted to being degraded and zpool status showed my new drive's ID along with the line (was ada3). I expect that to happen again, maybe I was just following the docs wrong

stratacast1 · Sep 9, 2017

Okay, I am officially all good. My old HDDs had 512B sectors, and my new drives have 4KiB block sectors, so I had to create a new zpool, snapshot the old one, send/receive the data to the new zpool, export the old zpool, reboot the server (and add my 2nd HDD for resilvering), export my new pool and import it as the old pool's name, attach my new drive to create the new mirror, let it resilver and then scrub it.

So sad I had 2 dead drives, but I'm glad to have other IT friends so I can nab drives from them and pay them back later

ralphbsz · Sep 9, 2017

SirDice said:
... Especially if that "new" disk happens to be a refurbished disk.

A year ago, I bought a Hitachi 3TB disk for home. At that point it was already an older model; I bought it because it exactly matched the other disk in the RAID-1 pair. Found an internet vendor with really low prices. Disk shows up at home, and has a partition table! Weird. Quick dd if=/dev/adaXX ... | hexdump -C | more shows that it is full of data. I even found text files on it. A quick peek with smartctl showed that the disk had been powered up for about 20,000 hours already.

The good news: The dealer gave me back my money. The bad news: They actually made me ship it back (although they gave me a shipping label so I didn't have to pay postage), but I had to spend 5 minutes packing it back up, and another 10 minutes driving to the post office. The good news: I found a 4TB disk for just a few dollars more within an hour of searching, and when it showed up, it had nothing on it, and just a few hours of power-up time.

Terry_Kennedy · Sep 9, 2017

stratacast1 said:
Maybe I'm fooling myself, but I have a hard time believing I got 1 drive dying after 4 months and one dying in a matter of minutes.

Some drive sellers (both new and used) use utterly inadequate packaging when shipping drives. One well-known seller of new drives used to (maybe they still do, I don't buy from them any more) just break up bulk 20-packs and put an individual drive in its antistatic bag in a carton with some "air pillow" cushions, which pop the first time the package gets jostled. Every bump after that has the potential to damage the drive before you even get it out of its factory antistatic bag.

There have also been bad batches of drives from various manufacturers - firmware bugs, media problems, or just poor quality control.

Some people suggest buying based on the BackBlaze drive reliability reports - the only problems with that is that by the time they have enough data on one model of drive, it is well on its way to being obsolete. And their installation and usage is completely different from how most users will use the drives.

stratacast1 · Sep 11, 2017

Terry_Kennedy said:
Some people suggest buying based on the BackBlaze drive reliability reports - the only problems with that is that by the time they have enough data on one model of drive, it is well on its way to being obsolete. And their installation and usage is completely different from how most users will use the drives.

The 2 4TB drives that I got as replacement are based on that, very low fail rate with the largest sample pool (some sort of HGST, I forgot off the top of my head).

I took a shot at the cheap hard drives, and they worked for a bit. But next time I'll just pay the extra cash for something that I know works. I walked in not expecting much, but I at least hoped for a year so I could later upgrade to 4TB disks after saving up some cash. Well, you live and learn, right?

I theoretically have a 1 year warranty on these drives, not sure I'll get lucky and get $$ back. Maybe on one

ralphbsz · Sep 11, 2017

You want reliable drives? Terry pointed out the problems with the BackBlaze list (which I also look at heavily): It is backwards looking, by the time the drives are on top of the list they are 3 year old models, and therefore obsolete. So here is what you can do: Look for trends in the list. For example, do disks that last typically come from certain vendors, and/or from a particular product line?

The other way to buy reliable drives: Buy only drives where the vendor gives a 5-year warranty.

phoenix · Sep 11, 2017

stratacast1 said:
I theoretically have a 1 year warranty on these drives, not sure I'll get lucky and get $$ back. Maybe on one

You won't get money back; you'll get a replacement drive.

stratacast1 · Sep 13, 2017

Getting money back on one

the one that I plugged in a month later that was immediately found to be dead.

Thank you for the insight ralphbsz, I typically go for WD Reds as my default, but if I can afford it (which 99% of the time I can't), I'll go for the WD Gold datacenter disks

ZFS Help with failing HDDs

Administrator