HDD dead

mururoa · Jan 27, 2014

Hi there,
About 3 days after installing 10.0 RC5 on my server the 2"1/2 WD drive failed. With about no I/O the drive died in 3 days. And I planned to use it for backups !
So my question is : Is ZFS known to use HDD drives very hard since the filesystem itself is very efficient or it is just a coincidence ?

BTW I just have to drop the HDD in a trash can since the computer is no more guaranted and the HDD itself is 'out of region'. I guess WD wont replace here HDD that has been bought in China by the PC manufacturer.
So I ordered a WD Red to replace it.

wblock@ · Jan 27, 2014

That recent article on drive failure mentioned WD and early failures. If the drive makes it past a month or so, odds are much better that it will live a full life. Consider some redundancy, also. Two drives in a mirror is a lot less likely to lose data than a single drive.

ralphbsz · Jan 27, 2014

First: Definitely get a spare drive (as you are doing), and start using it. Everything else you do only serves your curiosity (do an autopsy of what went wrong), ethics (making WD pay for the replacement), or fear about the future (do better backups in the future).

Here is a question: How did the drive fail? Did it start reporting an excessive number of IO errors? Did internal automatic error recovery slow the drive down so much that it became unusable? Did SMART report too many errors and impending drive failure? Is the drive no longer spinning up, and just making clicking noises? In that case, have you verified (with a voltmeter) that the supply power makes it all the way to the electronics (use the test points on the PC board)? Or did the electronics fail, and (as seen from the host) the drive simply vanished? By the way, all this is just to satisfy curiosity; it probably won't make the drive come back.

mururoa said:
Is ZFS known to use HDD drives very hard since the filesystem itself is very efficient or it is just a coincidence ?

To begin with, the failure rates of drives (even of WD drives, even the infant mortality of WD drives) are small, several percent of the drives (either infant or per year). Furthermore, while workload patterns have an effect on drive lifetime, that effect is small; other environmental effects (in particular vibration and temperature) are bigger, and even they don't change the basic reliability of a drive to be massively unreliable (it is impossible to deliberately break a drive by sending it IOs). So, what you experienced is just a coincidence.

The reported numbers for large WD drives (3.5" multi-TB drives) have little or nothing to do with the reliability of 2.5" laptop-class drives, which are engineered quite differently. And please observe that even in the oft-quoted recent Backblaze study, their WD drives do much better than their Seagates, over the long run (longer than a few months). I think you just got unlucky.

But given that
BTW I just have to drop the HDD in a trash can since the computer is no more guaranted and the HDD itself is 'out of region'. I guess WD wont replace here HDD that has been bought in China by the PC manufacturer.
So I ordered a WD Red to replace it.[/quote]

kpa · Jan 27, 2014

No, ZFS does not tax the drives any more than other file system. In fact I would argue the opposite, ZFS uses aggressive caching that lessens the read counts over time compared to other file systems.

wblock@ · Jan 27, 2014

I was going to say that earlier, but then thought about zpool scrub. Worst-case (full drives), that is probably not much worse than a SMART long test, but many people don't run those very often. zpool scrub is supposed to be run monthly or quarterly, can't recall exactly.

ralphbsz · Jan 28, 2014

I run zpool scrub every night. This is a (very small and power-efficient) server system; a battery-powered laptop would be different. According to ZFS lore, scrub has very little performance impact, because it is good about getting out of the way when real workload shows up. On my system, a full scrub takes less than two hours, and my server is fundamentally idle in the middle of the night. I run backup every hour, and it takes about 90 seconds (if no files have changed). I've been thinking of running scrub more frequently, maybe 2 or 3 times a day.

Here's why: On modern disks, the probability of getting a media error (read error) is approaching 1. If you take a reasonable estimate of bit error rates (say 10^-15 per bit), times 4 TBytes times 8 bits per byte, the probability is currently 3.2%. But it seems that real-world bit error rates may be about 10x larger than what the drive vendors specify, which is about 30%. Imagine that you have a mirror pair (RAID-1, two disks mirroring exactly the same content). Now disk A fails, you put a spare disk in, and the content of A is recreated (resilvered) from disk B. If during that operation you find a single sector read error on disk B, you have just lost data (admittedly, only a tiny amount of data, but bad enough). The way to prevent this is to regularly scrub the disks: Find the sectors with the uncorrectable read error *** before *** one of the two disks fails, because as long as you have the second disk operational, a single read error does not cause data loss. The more often you look for them, the lower the probability that you find them for the first time when you don't have the second disk.

wblock@ · Jan 28, 2014

But zpool scrub causes disk activity that other filesystems do not have except when doing a repair ( fsck). The concern is whether all that extra head movement, done routinely, causes a significant amount of wear.

I'd guess it does not, at least not in the the first month of use. Over the full warranty life of the drive is a different question.

HDD dead

mururoa

wblock@

ralphbsz

kpa

wblock@

ralphbsz

wblock@