Other The Backblaze case to buy cheap disks

In the latest Backblaze Drive Stats for Q3 2022 the costs and benefits of disk failure rates are examined in relation to purchase cost.

It's interesting to see that sometimes the cost of ownership equation benefits from buying less reliable drives.

[Of course, you need to have a large operation to be able to experience statistical average outcomes to get these benefits.]
 
It love the fact that Backblaze publishes these statistics. Of enterprise drives (which these are), a very large fraction (probably way over 90% are used by a small handful of companies, a combination of the traditional storage array makers (EMC/Dell, Hitachi, IBM, Sun/Oracle, ...) and hyperscalers (FAANG and friends). Those all keep accurate statistics on failure rates, but never publish them. Backblaze are the only ones that make such a data set available (with all its warts). This is a public service.
 
Of enterprise drives (which these are),
Not all of them. There's some Seagate "DM" among them, which should be the usual Barracuda stuff.

The calculation is indeed nice, it show how to think like a businessman. And while the effective savings of this will only appear with high enough numbers, there is also some gotchas one can take from it. With insurances the rule was always that they do only make sense for big and financially dangerous risks - small risks that naturally happen now and again and could be managed on one's own, do not make sense to insure, because the only effect in the long run is that you pay a premium to entertain the insurance company.

Here with harddisks, the rule for the small scale user appears to be: you should have a proper failure protection, because if you have not, you are always at danger no matter which drive you buy. But if you have a solid failure protection scheme, then you can run any drives you like - and the cheapest option would probably be used drives. I might assume, a drive that has run for five years is less likely to fail in the next five than any new drive in the first five years.
Side note: I bought a new drive last month, and when I got it, I found on the label "DOM:28MAR2017"
 
There is a giant part that's missing in Backblaze's calculation, which is the fact that in enterprise applications, data will be stored redundantly. And the amount of redundancy can be adjusted. The classic example of this is RAID: You take your data, split it across 5 disks, and write one parity to the 6th disk. Given fixed capacity, the disk space overhead is 20%, but you are protected against failure of any one disk. It turns out with modern disks (which are extremely large), that's no longer sufficient, as there is a non-zero probability of getting a second fault while the first disk is being rebuilt onto a spare. So what you do is you go to 5 + 2 disks, which is pretty reliable, but has a 40% overhead. For home users and amateurs, that's pretty much the end of the road: it's impractical to have many more disks. Clearly, RAID also has some performance overhead, but with a variety of modern techniques, we've learned how to abate that.

But for an enterprise application, there are ways to do this much more efficiently. It turns out that if you use much larger groups of disks, you can get pretty good reliability, at much lower overhead. For example, if you split your data across 100 disks, you will definitely need more than 1 redundancy disk, but you probably won't need 20 of them (for a 20% overhead, similar to the 5+1 the amateur had), and definitely not 40. For fun, let's say that to get your desired redundancy with a dozen redundant disks, so you are using 100+12 with a 12% overhead. That's because the probability of 12 disk failures (out of 112 total) is about as large as the probability of 2 disk failures (out of 7). Great.

The question that Backblaze did not answer (and which is heinously complicated): What is the tradeoff between reliability and cost, once you are using RAID? Say for example a company that uses lots of disks (millions of them, all arranged in those crazy 100+12 redundancy groups I used as an example above) get a deal: The next million disks will be 10% cheaper, but they will also be 5% less reliable. Or perhaps they will be 15% less reliable. The 10% cheaper means that for the same cost, they can arrange the RAID to be 100+23, with much more redundancy for data protection. But will that be more or less reliable than the previous disks (more expensive but each better)? Tough question. Lots of graduate students have written PhD their thesis on related topics, and lots of engineers and researchers study this topic every day (well, not today, it's the weekend). For the average home user, it's not relevant, since they (a) won't be using 100+ disks, and (b) they don't have the means to calculate how reliable they want their system.

What is feasible for the home user is to get very very cheap disks (for example 5 year old used ones, somewhat similar to the "DOM 2017 you found), and then compensate for their lousy life expectancy by being extremely redundant (for example run 4-way mirroring, which is easily possible using ZFS). My intuition is that this is nearly always a bad idea: while theoretically it could work, what people ignore is the side effects of failures (lots of hassle, and chance of operator errors and bugs), and the risk of correlated failures.
 
the risk of correlated failures.
In both corporate and private circumstances I have often seen spindles fail in clusters. I think it's the biggest outage risk for my ZFS server (I actively manage all the risks, but recovering 12TB from backup would require a monumental outage).

I think that using 5 year old enterprise disks in a deep mirror at home is technically fine -- just acquire and insert one brand new spindle into each mirror. However, the power bill should be considered...

I sort-of practise what I preach with very old WD Reds mirror'd to newish Seagate EXOS enterprise drives. The performance is badly mis-matched, but what matters is getting the best redundancy (until the Reds are rotated out, and then there'll be a significant age difference between identical disks in each mirror).
 
In both corporate and private circumstances I have often seen spindles fail in clusters. I think it's the biggest outage risk for my ZFS server ...
Sadly true. I've been there too: I used to have two Seagate Barracuda disks as a mirror pair in my server, and they died within a year of each other. For me it didn't create a problem, since I re-created the mirror using Hitachi nearline disks (ending up with a mismatched pair). The funny thing is that a colleague at work had a 5-disk RAID array with the same model Barracuda disks, and he also lost several of them. One even early enough that the 3-year warranty worked (he got a free new disk, which then promptly died a few years later).

This really points out some of the issues that small RAID systems (home users, small businesses) have. When a disk fails (or a disk error occurs that makes part of a disk unreadable), the key issue is how fast you repair it. Because the risk of data loss comes from a second disk failing while the first disk is down (or more accurately, more read errors while there is no redundancy due to earlier errors). For a small system, that usually means: Diagnose the problem (which in and of itself can take days), order a spare disk, wait for it to get shipped, install it, start the rebuild. A process that typically takes a week round-trip, once everything is added up. Now compare that to a large enterprise- or cloud-scale system: The disk error is diagnosed within minutes, there are spare disks already online (more accurately, there is a little bit of spare space reserved on other disks), rebuild begins immediately on the most critical parts of the data (the ones that have no redundancy left). In a well-built system that uses many disks, this can get done end-to-end in 15 minutes or half hour. Which means that the system is left in an unprotected state about two orders of magnitude less than a home system, and that means two orders of magnitude better reliability.

The second point is the use of correlations between failures. For a small system with about a half dozen disks (say 5 data disks and 2 redundancy disks, a sensible layout), the probability of a single disk failure is small (many percent per year), double disk failure very small (fraction of percent per year), and data loss due to exceeding the redundancy (triple fault) is so small as to be de-facto uncalculable. Now try to re-calculate these probabilities with correlations between failures: With only 1 and 2 failures being allowed, you will get massive uncertainty in the numbers, so you can't really plan for correlations other than "make it more redundant than you think".

With a large system (for example 500 disks, with the RAID layout being the 100+12 I talked about above), we have sufficient statistics to adjust the redundancy to correlated failures. And a company that uses lots of disks (millions of each model) has enough sample size to actually measure failure rates and correlations between failures. And they can adjust the redundancy to match.
 
What also happens when rebuilding is you add a bunch of load on the remaining disks, which are probably as elderly as the one that died. You're increasing the chances of a failure at the worst possible time. Not that I have a solution for this that won't cost more than I'm willing to spend.
 
The original data (Backblaze) plus these discussions often show there's a big difference between home/consumer and enterprise (I know, stating the obvious).
For home/consumer use I think a lot boils down to overall cost (cost of devices, cost of power, etc), I happen to like mirrors. Yes, if you create a mirror from 2 devices same lot, same time, they probably have similar failure characteristics, but go about a year and make a three way mirror. That should provide enough delta between devices to offset failures.

That's just my opinion, not sure if it really makes a difference or not.
 
And: Keep good backups.

Old joke: There are two kinds of system administrators. 1: Those who religiously do backups. 2: Those who haven't lost data yet.
 
  • Like
Reactions: mer
What also happens when rebuilding is you add a bunch of load on the remaining disks, which are probably as elderly as the one that died. You're increasing the chances of a failure at the worst possible time. Not that I have a solution for this that won't cost more than I'm willing to spend.
Proposal: per-usecase zpools.

It gets much easier to make an L2arc actually useful, because data from one application will not be overwritten by another. It gets easier to layout and position the pool on appropriate disks with appropriate redundancy (SSD or mechanical or part/part). And it allows to utilize disks of any or mixed sizes. It may even allow to utilize stuff that would otherwise be obsolete (and to power it down when not in use).

An LVM like zfs does a good job by putting all the filesystem into one transparent container. But then we don't need to go from the old extreme (each filesystem individually handled) to the other extrem (the whole environment in one big container that you can neither disassemble nor repair).
 
I actively manage all the risks
Here's how I manage the risks:
  1. Manage the environment. Get a well engineered case with good airflow over all disks, and monitor the temperatures.
  2. Avoid clustered failures. Don't mirror or stripe identical disks all of the same age. This may be difficult, but if you carry spares, you can run a disk for a year and swap it for one from spares pool, creating an age differential in the active set. At the extreme, just purchase similar spec drives from different vendors.
  3. Use labels. Attach both physical and electronic labels that identify the exact physical location of every disk. This takes discipline and fore-thought. The dividends pay big when a disk fails. My labels incorporate the disk serial number, name of the cage, and position in the cage. If a disk is really dead, you will sometimes have to identify all the others to figure out which one is missing.
  4. Use hot swap bays. Get as much hot swap as you can afford, but have at least two easily accessible hot swap slots in your server (I have a 3 x 3.5" spindle cage that fits into a position normally occupied by two 5.25" CD drives).
  5. Use a trouble-shooting guide. When you get a disk problem, follow a good trouble shooting guide, like the one published by TrueNAS. My last two "disk problems" were rectified by re-seating the SATA cable.
  6. Re-silver before swap-out. Don't pull a dead disk before you re-silver the new one. This is the first reason why you need hot-swap bays. If you follow this rule, you won't kill the server if you mis-identify the "dead" disk, and pull the wrong drive. You can also re-silver immediately you become aware of a problem, and shut down to move disks at your leisure.
  7. Keep spares disks. Have spare disks available at short notice. It takes a week for deliveries to reach my farm from the big cities. I currently have four 4TB drives in a box (not over-kill in my case because I have 3 x 3TB WD Reds that are nearly 10 years old, and I worry that they will fail in a cluster).
  8. Take regular backups. Rotate them off-site. I bit the bullet and purchased two 12TB disks just for backups. Yes, they were not cheap, but the peace of mind is palpable, not to mention the ease of doing a 100% backup. This is the second reason hot-swap bays are handy (though USB3 might also be a sensible option).
  9. Test recovery. At least once a year, make sure you can recover. Both of my 12TB backup disks have to come back on-site and into the hot-swap bays for the annual test.
  10. Power quality. I lose power regularly, sometime half a dozen times a day. I really have to have a UPS to protect the equipment.
Several list members made comments that helped me to my current disposition with the ZFS server. The ones I remember are SirDice (hot swap bays), diizzy (Fractal Design cases), and VladiBG (large backup disks).
 
Back
Top