Statistical hope of surviving drive loss vs zpool config

pmeunier · Mar 6, 2014

I have been wondering how many drives can you lose in a zpool and still have your data survive, given various vdev configurations. After many hours of searching I haven't found a statisfying statistical analysis from that perspective; probably, I don't know where to look. I made basic outcome calculations. Starting from the simplest cases, and assuming the loss of a drive is random over the zpool:

a) 4 drives in a RAID 10 (2 mirror vdevs)
probability of surviving the loss of first drive: 1
probability of surviving the loss of second drive: 2/3
Hope of surviving drive losses: 1.67 drives
Obviously a raidz2 will survive the loss of 2 drives, so a raidz2 of 4 drives is a bit safer than a RAID 10.

b) 2 raidz2 of 6 drives (12 drives total)
probability of surviving the loss of first drive: 1
probability of surviving the loss of 1 more drive: 1
probability of surviving the loss of 1 more drive: 6/8
(50% chance of a drive being lost to either vdev, 3 tries (2^3 = 8), you lose the zpool only if 3 drives all fail in the same vdev so there are only 2 ways of losing it)
probability of surviving the loss of 1 more drive: 6/8 * 6/16
(there are only 6 ways of surviving the loss of that last drive, out of 16 ways the failures can happen)
Hope of surviving drive losses: 3.03 drives

c) RAID 10 (6 mirror vdevs, 12 drives total)
probability of surviving the loss of first drive: 1
probability of surviving the loss of 1 more drive: 10/11
probability of surviving the loss of 1 more drive: 10/11 * 8/10
probability of surviving the loss of 1 more drive: 10/11 * 8/10 * 6/9
probability of surviving the loss of 1 more drive: 10/11 * 8/10 * 6/9 * 4/8
probability of surviving the loss of 1 more drive: 10/11 * 8/10 * 6/9 * 4/8 * 2/7
Hope of surviving drive losses: 3.42 drives

d) 3 raidz2 of 4 drives each (12 drives total)
probability of surviving the loss of first drive: 1
probability of surviving the loss of 1 more drive: 1
probability of surviving the loss of 1 more drive: 24/27
probability of surviving the loss of 1 more drive: 24/27 * 54/81
probability of surviving the loss of 1 more drive: 24/27 * 54/81 * 90/243
probability of surviving the loss of 1 more drive: 24/27 * 54/81 * 90/243 * 120/729
Hope of surviving drive losses: 3.74 drives

2) 2 raidz3 of 6 drives each (12 drives total)
probability of surviving the loss of first drive: 1
probability of surviving the loss of 1 more drive: 1
probability of surviving the loss of 1 more drive: 1
probability of surviving the loss of 1 more drive: 14/16
(50% chance of a drive being lost to either vdev, 4 tries (2^4 = 16), you lose the zpool only if 4 drives all fail in the same vdev so there are only 2 ways of losing it)
probability of surviving the loss of 1 more drive: 14/16 * 22/32
probability of surviving the loss of 1 more drive: 14/16 * 22/32 * 20/64
Hope of surviving drive losses: 4.66 drives

I'm a bit surprised that with 12 drives, 6 mirror vdevs (RAID 10) should be a bit safer than a zpool consisting of 2 raidz2 vdevs. I wonder if my calculations are sound...

pmeunier · Mar 6, 2014

ah silly me, no they are not sound because there isn't an even 50-50% chance of a loss being in either vdev (or 33% with 3 vdevs) -- it changes as losses happen. Has anyone seen a correct statistical analysis along those lines?

ralphbsz · Mar 7, 2014

pmeunier said:
After many hours of searching I haven't found a statisfying statistical analysis from that perspective; probably, I don't know where to look.

If you want to do this right, it is EXTREMELY complicated. Your analysis (simply counting whole drive failures) barely scratches the surface. The real issues are the following. First, in a real production system, there will be spare drives. They may be either hot spares (meaning an empty drive, already plugged in, and as soon as the first drive fails, the RAID system starts copying data to the spare), or cold spares (not powered up, perhaps sitting in a spare parts cabinet, or even still in Amazon's warehouse), or even distributed spares (meaning you have N drives, but only use N-1 disks worth of capacity, and the RAID system knows to spread the spare space evenly over all drives, and rebuild onto it). Various RAID implementations implement various versions of this, with various levels of automation.

Once you start copying your data from the (degraded) RAID system onto spare space, what really matters is the probability that the second disk (or 3rd or 4th disk) fails *** while you are copying the data over ***. This is the MTTR (mean time to repair) is one of most important determining factors in calculating MTDL (mean time to data loss). This is why RAID implementations play very interesting games to optimize MTTR.

Second, with modern drives, the probability of a read failure when reading a whole drive is significant. For example, many drives are specified to have a data loss rate of 1 part in 10^14 bits. A 4 TB drive has 3.2 x 10^13 bits, so the probability of getting a single read error when reading all the 4 TB is about one third. So when you lose your first disk, there is a ~30% chance that the attempt of copying it to the spare disk will fail for at least one bit, because of finding a read error. Unfortunately, one broken bit can spoil a whole file system. One of my colleagues calls this the "sewage in wine principle": One thimble full of wine in a barrel full of sewage makes sewage. One thimble full of sewage in a barrel full of wine also makes sewage. One dead bit in a 4 TB file system makes data loss.

Now, your simply combinatorial analysis is still valuable, for comparing different RAID implementations, *** under the assumption that everything else is the same *** (remember what the first three letters of assumption spell). Unfortunately, these simple combinatorial things are also terribly difficult to follow, and one can easily confuse oneself (and even worse, believe the results). When doing this professionally, the standard that should be followed is this: one person does an analytical calculation of the data loss rate, using combinatorial analysis the way you describe it. A second person implements a Markov model. The third person implements an event-based simulator. If all three results agree, there is some likelihood that they vaguely describe reality.

There is an enormous amount of literature on this topic; unfortunately, it tends to be rather advanced, and discuss corner cases that users of simple RAID arrays would find esoteric. If you want, I can type in the names of a few papers that you can use for amusement. The "original" reference is the appendix in Garth Gibson's original RAID paper (look for Gibson, Katz and Patterson as authors). While Garth's calculation is technically not "wrong", that calculation is used today to scare grad students, and to show them that simple sloppy thinking leads to ridiculous results.

ralphbsz · Mar 7, 2014

General comment: In the following, we are ONLY interested in the rate of ANY loss. We're using the "wine in sewage principle": If one bit of your file system goes down, you are dead. In practice, this may be wrong. For example, taking your 12 disks are segregating then into 6 RAID1 mirror pairs, each with a different file system and different users has a great advantage: on a double fault, only one of the file systems is damaged, the other 5 are fine. For example, imagine you are setting up a home server, and each member of the family gets their own file system with their own mirror pair of disks. If you get lucky, the double fault will wipe out your mother-in-laws file system. Many married people would consider that to be a feature, not a bug.

pmeunier said:
a) 4 drives in a RAID 10 (2 mirror vdevs)
probability of surviving the loss of first drive: 1

Without loss of generality, let's number the drives 1a, 1b, 2a, and 2b. And let's assume the first drive to be lost is 1a.

probability of surviving the loss of second drive: 2/3

Correct. There are three drives now, if you lose 1b, you're dead, if you lose either 2a or 2b you're good.

Hope of surviving drive losses: 1.67 drives

I've never seen people express it this way. They either express it as a probability (66% after loss of two drives), or MTDL. Actually, I usually find it easier to calculate the probability of data loss, rather than the probability of survival; the numbers tend to be smaller, and expressions like n/N * (n-1)/(N-1) tend to be easier to handle than (1-n/N) * (1-(n-1)/(N-1).

Obviously a raidz2 will survive the loss of 2 drives, so a raidz2 of 4 drives is a bit safer than a RAID 10.

Absolutely. And it even has the same capacity. So in theory, it looks like a 4-drive RAIDZ2 is much better than a 4-drive RAID10. *** BUT ***: You pay for that reliability in performance. Both for reads (half the data in the RAIDZ is parity instead of real data, so you can only use half the drives for reads, losing you parallelism), and for writes (small write penalty, 'nuff said).

b) 2 raidz2 of 6 drives (12 drives total)
probability of surviving the loss of first drive: 1
probability of surviving the loss of 1 more drive: 1

True. But if you think this way, you just confused the heck out of yourself. Because there are two radically different cases you need to distinguish: either the second drive was drive 1b (and one RAIDZ2 now has a double fault), or it was drive 2a (and you have two RAIDZ2, each with a single fault). The probability of these two cases is different, the first one has probability 5/11, the second one 6/11.

probability of surviving the loss of 1 more drive: 6/8
(50% chance of a drive being lost to either vdev, 3 tries (2^3 = 8), you lose the zpool only if 3 drives all fail in the same vdev so there are only 2 ways of losing it)

No. I think it is 5/11 * zero + 6/11 * one, or 6/11, but I'm not sure. (I think the factors of zero/one are actually 0 * 4/10 and 1 * 5/10 / 5/10, but that doesn't matter). And it's too late in the evening to spend a lot of time thinking about it.

c) RAID 10 (6 mirror vdevs, 12 drives total)
probability of surviving the loss of first drive: 1
probability of surviving the loss of 1 more drive: 10/11
probability of surviving the loss of 1 more drive: 10/11 * 8/10
probability of surviving the loss of 1 more drive: 10/11 * 8/10 * 6/9
probability of surviving the loss of 1 more drive: 10/11 * 8/10 * 6/9 * 4/8
probability of surviving the loss of 1 more drive: 10/11 * 8/10 * 6/9 * 4/8 * 2/7
Hope of surviving drive losses: 3.42 drives

I think you're on the right track. I would write out the number of the drives, and see which drives you can renumber without loss of generality, but your fractions look right. I didn't check how you got to 3.42.

The rest of your calculations are too hard for tonight.

I'm a bit surprised that with 12 drives, 6 mirror vdevs (RAID 10) should be a bit safer than a zpool consisting of 2 raidz2 vdevs. I wonder if my calculations are sound...

I'm not surprised. Fundamentally, but putting the drives into pairs, you are doing a very crude declustering of your RAID, and there are lots of papers in the literature that show that declustering is good.

Statistical hope of surviving drive loss vs zpool config

pmeunier

pmeunier

ralphbsz

ralphbsz