vanessa said:
I also thought this way until I read [link=http://blog.backblaze.com/2013/12/04/enterprise-drive-reliability/]this blog post here[/link]. These are real life stats!
Well, they are stats. Are the "real life", or some toy imitation? A few caveats:
- Backblaze's large system has 25K drives. That's the number of drives that is installed in a single compute cluster in large installations (weather forecasting agencies, supercomputer centers, corporate data centers, and non-existing agencies). There are thousands of such large compute clusters. Backblaze is nice enough to publish their statistics, and they deserve kudos for that. But in terms of overall drive count, they are a tiny backwater.
- The really large data centers in the world are run by the likes of Google and Facebook. Google has actually published research papers on disk drive reliability, with a lot of detail (please look at proceedings from FAST and MSS conferences). All Backblaze gives us is the overall failure rate, without breakdown on who/what/when/how/...
- Backblaze's statements about enterprise-grade drives are particularly laughable. They are based on HUNDREDS of drives. That's what ships in a single EMC/Hitachi/IBM/HP/Dell/... storage server, and large customers have dozens or hundreds of these servers. Statistics based on such a small number of drives are not helpful.
- They compare apples and oranges. Why? Failure rates of drives depend crucially on vibration, temperature, workload, and power supply stability (pretty much in that order). We know that Backblaze's own storage enclosure (the 45-disk SATA enclosure they have engineered) is pretty good about vibration control, but far from perfect. I personally don't know about how well they control temperature, and how good their power supplies are (but other people may know). It's silly to compare drives in one environment (the Backblaze enclosure) with drives in a totally different environment (Dell or EMC enclosures), and think that the conclusions apply to the drives themselves.
- When is a disk "dead"? That depends crucially on definition. We would all agree that a disk is dead if you can't read from it at all (either it refuses to spin, or you can not electronically communicate with it). But how about a disk that is readable, but has lots of read errors? How about one that has exhausted its spare space for revectoring writes, and is readable but de-facto not writeable? How about one that is readable, but is getting so many internally corrected errors that its performance suffers greatly? How about a disk that has reported impending failure via SMART or a similar mechanism, but is still functioning (at least for the moment)? Before you compare failure rates of disks, you have to specify what you mean by "failure". Using different definitions, the answers can easily vary by a factor of 2 or 5.
While I don't disagree with Backblaze's statistics, and personally their conclusion seems plausible and probably has a large grain of truth in it, it is not the end-all of disk reliability analysis.
Anyway, it would be nice to have a tutorial or a HOWTO focusing on FreeBSD and HDDs with all the best practices and tools available - starting from raw disk analysis and performance tuning.
Here are a few hints. Start by buying a disk. While there are lots of horror stories on the web (even including in this thread), please note that disks from all major vendors are used by all major system builders. For example, several posts above crucified WD disks. If they were right, then why would the likes of IBM/EMC/Dell/... sell WD disks in their storage servers? If WD disks really were as bad as people make them out, why hasn't WD been killed by warranty return costs? My assertion is that while there are differences in quality and reliability between drive vendors, they are minor, and hard to detect without looking at large aggregate numbers of disks.
Remember that there are significant hardware differences between high-performance enterprise disks (10K and up RPM), near-line enterprise disks (slow-spinning), and consumer SATA disks. The cost difference between these drives is to some extent caused by the different engineering that goes into these drives, and the different bill-of-materials. There is a very good (but dated) paper by Riedel and Anderson on that topic. Anyone who thinks that consumer drives and enterprise drives will have the same performance and reliability is fooling themselves. Having said that, I only use consumer-grade drives at home, but then I don't store mission-critical data (baby pictures and scanned electrical bills don't qualify), and I'm incredibly good about backups.
Having bought a drive, the best thing you can do to make it run for a long time is to treat it well. Make sure it is NOT exposed to vibration. Put the fans in your computer case on vibration mounts, and the disk on vibration mounts. If you have more than one disk in close proximity, make sure seeks on one drive don't shake its neighbor. Some cases (the more expensive ones) are designed to have rubber grommets the disks are mounted in. Make sure the sheet metal of the case doesn't turn into a musical instrument. Strips of foam rubber or cork, glued in the right places, can do wonders for suppressing vibration.
Next, please cool your drives. Disk drives like medium temperatures; current wisdom seems to think that around 30 to 35 degrees is good. If your computer runs really hot, please add a case fan (I have one enclosure where hundreds of disks run in the '50s, and they're dying like flies). On the other hand, many drives don't like very cold temperatures either; there are horror stories around of farms with ten thousands of drives, barely functioning because they were cooled to about 10 degrees (C, not F), and the drives having to recalibrate all the time. You can use SMART to read the drive temperature.
If you have a workload that is incredibly seek-intensive (many random accesses), consider moving part of that to an SSD, or combating the seeks by giving the system more cache RAM (but beware, SSDs are not a cure-all, and have their own reliability issues). While seeks in and of themselves are not harmful, every write after a long seek increases the risk of off-track writes and actuator miscalibration. And the thing with disks is to avoid risk.