ZFS Performance: RAIDz1 vs mirroring

JMOR · Dec 14, 2020

In my home PC, one of my two HDD that I have in (btrfs) RAID 0 failed. So, I am shopping for the replacement. I am planning to install FreeBSD (zfs) and set up RAIDZ (buying 3 SSDs).

Now, I am reading everywhere that mirroring 2 disks ~~(1 vdev per disk)~~ is faster than RAIDz1 with 3 disks. And it doesn't make any sense to me.

Yes, RAIDz1 has to compute the parity info for the third disk. Although being realistic, CPU times (for computing parity) are orders of magnitude faster than I/O times (for accessing disks). So, that advantage should be close to negligible.

But the much more impactful consideration is, when writing on disks, RAIDz1 splits the load between two disks, Mirroring does not. In other words, if I write on disk a 1 MiB file, RAIDz1 should write 512 KiB on one disk and the other 512 KiB on the other disk. And this should be done in (or close to) parallel. Which should translate in (or close to) half the time it takes to write the whole 1 MiB in one disk. I mean, this is the whole schtick of RAID 0. Isn't RAID 5 (RAIDz1) equal to RAID 0 plus a third disk for parity?

I can see, when READING from disk, that 2 mirror disks have equal performance than RAIDz1. Because, one could read one half from each disk in parallel (like RAIDz1). It is not efficient to read the whole thing from just one of the disks. But when WRITING, you have to write the whole thing on each of the mirror disks, the whole 1 MiB of the previous example in each disk. When the mirror disks are writing .. 640 KiB .. 768 KiB .. The RAIDz1 should be more than done writing 512 KiB.

Your help will be appreciated, it will influence what I end up buying. Thanks in advance.

Edit: It is not 1 vdev per disk. It is 1 vdev with the 2 disks in it, otherwise nothing is being mirrored.

Argentum · Dec 15, 2020

JMOR said:
In my home PC, one of my two HDD that I have in (btrfs) RAID 0 failed. So, I am shopping for the replacement. I am planning to install FreeBSD (zfs) and set up RAIDZ (buying 3 SSDs).

Now, I am reading everywhere that mirroring 2 disks (1 vdev per disk) is faster than RAIDz1 with 3 disks. And it doesn't make any sense to me.

Yes, RAIDz1 has to compute the parity info for the third disk. Although being realistic, CPU times (for computing parity) are orders of magnitude faster than I/O times (for accessing disks). So, that advantage should be close to negligible.
[...]

Your help will be appreciated, it will influence what I end up buying. Thanks in advance.

Note that this is my personal preference, but I am satisfied with my solution so far:

1. Considering that the rotating disk space is cheap these days, I can easily buy sufficiently large disk and mirror it;
2. ZFS mirror itself is faster than RAIDZ. It is also more convenient to use. Easy to replicate the system, by moving one disk out of mirror to new platform;
3. Small and fast SSD-s are also very cheap these days. Addting L2ARC and SLOG to the ZFS pool improves the speed significantly, giving a feeling that the whole pool is SSD;
4. Also, on desktop computers, SATA ports are limited resource. So, considering that, with my current motherboard, I have populated two ports with rotating HDD and two ports with small (128G) SSD-s with freebsd-zfs partitions for L2ARC and SLOG mirror over these two SSD-i.

That is how I like it to be. I can physically remove one of the SSD or one of the HDD-s and the system remains bootable and operational. The speed is sismilar to full SSD configuration at nearly the rotating disk price. Also note, that with new OpenZFS, the L2ARC is persistent over boot. This is a nice feature

ralphbsz · Dec 15, 2020

For the home user, performance probably makes no difference. But data reliability makes a huge difference. Furthermore, other file system design decisions (such as caching, log-structured writes, ...) make a bigger difference than RAID. Not to mention device hardware speed (7200 versus 5400 RPM and SSD). But if you want, let's talk about RAID speed.

In the above example, you are comparing a 2-disk mirror (a.k.a. RAID-1) with a 3-disk RAID-Z1 (a.k.a. RAID-5), which is not fair: different number of disks. Since mirroring with 3 disks is hard to explain (it works), I'll compare 4 disks to 4 disks (striped mirror versus 3+1 parity). Note that the capacity of the systems is different: The mirroring system has a capacity of 2 disks, the RAID-Z a capacity of 3 disks. And I'll ignore the degraded case (one disk down), since it happens very rarely.

Reading: Is independent of IO size. On the mirror, you can read from 4 disks in parallel. On the RAID-Z, you can read from 3 disks, since the 4th disk contains the parity. This argument is independent of IO size. So mirroring wins by a factor of 4/3 = N / (N-1). This argument works as long as there are sufficient reads to keep all disks busy; if your workload is not intense enough, you will get the performance of a single drive in either case.

Writing: Let's first look at large writes, where a whole block is overwritten (or appended to). Now we have to consider the IO size, since the on disk data structure is arranged in blocks. With mirroring, writing two blocks worth of data simultaneously keeps all 4 disks busy. With RAID-Z, you can write three blocks of data: the three blocks themselves go onto disks 1, 2, and 3; the parity goes onto disk 4. So here RAID-mirroring loses by a factor of 2/3 = N/2 / (N-1). In theory, for systems other than ZFS, you should also consider small writes (where the parity needs to be recalculated), but ZFS never does those. For those, RAID-5 needs to read two disks, then write 4 disks, requiring 6 IOs to write one IOs worth of data, while mirroring needs 2 writes. But those are not relevant.

So, at constant cost of the system, mirroring is faster by a factor of N / (N-1) on reads, but slower by a factor of N/2 / (N-1) for writes. Now, are reads or writes more important to you? That is workload dependent. Most Unix server workloads tend to be write heavy, with writes being typically 2/3 to 3/4 of all IOs. So if the workload is intense enough (no burstiness, no gaps in the workload), RAID-Z would win. But in many small systems, the cost of writes is very low anyway, since most writes are cached, and the cache destaged during gaps in the workload. So maybe you should look at the read cost, where mirroring wins. On the other hand, maybe performance is not your biggest concern, but capacity is, and there RAID-Z wins at constant cost.

So the answer is complicated.

garry · Dec 15, 2020

JMOR said:
I am planning to install FreeBSD (zfs) and set up RAIDZ (buying 3 SSDs).

Now, I am reading everywhere that mirroring 2 disks (1 vdev per disk) is faster than RAIDz1 with 3 disks. And it doesn't make any sense to me.

I sympathize with you. It doesn't make any sense to me either. I went through the same realization that the common advice given by some experts seems dogmatic and possibly wrong (for our SOHO setups).

I learned a lot from A Team Systems benchmarks comparing zfs mirrors and ufs gmirrors. UFS on gmirror is worth considering. I used it for a while and it performed very well, quite a bit faster than using those same disks in ZFS.

The whole argument against zfs raidz hangs on a calculation of the probability of a second disk failure during the re-silvering operation after one drive has failed and replaced. It is a small but not negligible risk (the risk of a drive failure of one of the two suriviving drives during recovery of raidz1 is at least twice as great as the risk of failure of the one surviving drive during recovery of a zfs raid1 mirror). If you have a backup and are not running a time-critical recovery (like bringing an internet web server back on-line) then the probability of crashing and burning during the recovery is close enough to zero that you can ignore it. You have a backup and enough time to do a complete rebuild in the very unlikely case that your degraded raidz1 pool dies while trying to re-add the third drive. The most effective solution to this problem may be to use enterprise hard drives or high-quality (Pro) ssd drives. I buy second-hand enterprise 4 TB drives in preference to new consumer drives.

So raidz1 does have the advantages of requiring few drives, fewer sata ports, and fewer drive bays. It has (approx) twice the write speed of a mirror. I'm lucky to have (just enough) sata ports and drive bays to use zfs raid10 in my main workstation. I do really like the "stripe of mirrors", getting double the capacity of a single drive, double the write speed, and very fast (4x?) reads. My next build with a compact case that only accomodates three 3.5" hard drives will use raidz1, giving me double the capacity of a single drive, double the write speed and double the read speed. Of course I have backups (to two different backup servers).

Many discussions of your question neglect to consider that a humble system, even a rather high-end non-server box, may be limited to only 3 or 4 drives for a zpool -- many boards like mine have only 6 sata ports. Oh, I always keep the boot drive separate from any data pool. This gives the freedom to export the pool, rebuild the pool, do whatever to the pool without losing the operating system. If what you are trying to do is get a fast operating system by putting it on a zpool along with all your data I can't recommend that, but maybe that's just my prejudice, or experience with being glad when I had the freedom to work on the data pool (for example, replacing the boot ssd with another ssd with linux or with a different release of FreeBSD and running that system on the same data pool).

ralphbsz · Dec 16, 2020

First, I agree with many of your arguments. In particular that the advice that is designed for enterprise-, supercomputer- or hyperscaler-size systems typically does not apply to small systems, in particular individual user systems, where workloads tend to be highly variable. I used to build systems with 200-300 disks per server, and today I work on systems that have way more than that. A lot of my intuition doesn't work for systems with 2 or 3 disks.

garry said:
The whole argument against zfs raidz hangs on a calculation of the probability of a second disk failure during the re-silvering operation after one drive has failed and replaced. It is a small but not negligible risk (the risk of a drive failure of one of the two suriviving drives during recovery of raidz1 is at least twice as great as the risk of failure of the one surviving drive during recovery of a zfs raid1 mirror). If you have a backup and are not running a time-critical recovery (like bringing an internet web server back on-line) then the probability of crashing and burning during the recovery is close enough to zero that you can ignore it.

Unfortunately, the probability is not longer that small. Let's do a numeric example: A RAID-Z1 pool with 4 disks, each 20TB (I know, those are not really released on the market yet, 18TB are, but the round number makes the math easier). If one disk fails, you have to read all three other disks to recreate the parity disk. That's 60 TB, or 480 x 10^12 bits, or (rounded) 0.5 * 10^15 bits. The published uncorrectable read error rate of disk drives is 1 per 10^15 bits read (I just used the Seagate 18TB IronWolf pro, other vendors and models are typically the same). That means that in rough number, the probability of an uncorrectable read error is roughly one half! So with an array this size, every second time you get a disk failure, you will be unable to complete the rebuild! Now we need to be clear: In most cases, this doesn't mean that your whole 60TB useful capacity pool is dead, but that one file somewhere had one sector (probably 4 KiB, or whatever ZFS's checksum and block granularity is) that is now unreadable. Still, using the "sewage and wine theorem" (if you put a thimble of sewage into a barrel full of wine, you have sewage; if you put a thimble of wine into a barrel of sewage, you also have sewage), that means that at this scale, you can no longer trust RAID to guard against hardware failure. It was already about 15 years ago that the then-CTO of NetApp (one of the most respected vendors of storage devices) said that selling single-fault tolerant disk arrays amounts to professional malpractice. 15 years ago he might have been joking, today it's dead serious.

And by the way, the situation for mirroring is not THAT much better; do the same math with a 2-disk mirror pair of 20TB drives, and the data loss probability is only 3x smaller, about 1/6th instead of one half.

So what to do? Either use two-fault tolerant RAID (triple mirrors, RAID-Z2), but that gets expensive and inefficient really quick. Or understand that RAID isn't the solution to data loss any more; it instead just the first step, and really good backups have to be part of the solution. And you need really good backups anyway. That's because RAID doesn't protect you against "rm foo", oops. Or against an incompetent sys admin creating a new Reiser file system on one of the disks that form your RAID array (the joke is because Hans Reiser famously murdered his wife).

Personally, I also buy 4TB drives (new old stock), and I use two of them as a mirror pair at home. But I also keep very good backups. And the boot disk is on a tiny ancient SSD.

richardtoohey2 · Dec 16, 2020

Still working through the wine analogy (if it's cheap wine you might not notice the difference. How big is the barrel?

)

Thanks for the detailed information. I've inhertited some older RAID-1 and newer RAID-5 (with small-ish SSDs) systems and am learning about them and ZFS etc. so this is all valuable reading.

I've set up some test systems and really learning about the performance impact (initially by measuring importing 20Gb of data into MySQL 5.7) of the RAID options (hardware, hardware with/without cache, ZFS), drive options (consumer, enterprise, SMR, spinning, SSD, NVMe) , etc., etc. Lots to learn and research, no "best" option because it will depend on the workload and on-going requirements.

ralphbsz · Dec 16, 2020

Performance benchmarking storage and RAID systems is hard. Usually, the bottleneck isn't the storage hardware. For example, you say you import 20GByte of data into MySQL. Let's be pessimist and assume that it blows up by a factor of 3, (RAID and database overhead) so you're trying to write 60GByte of data. Say for fun that you have 3 disks, which each can do 200 MByte/s sequential outermost. You should be done in 100 seconds. Even assuming the disks aren't 100% efficient, any time >200 seconds can not be explained by the storage system.Clearly, that's not what you measure. So where is your real bottleneck? Tough question.

garry · Dec 16, 2020

ralphbsz said:
First, I agree with many of your arguments. In particular that the advice that is designed for enterprise-, supercomputer- or hyperscaler-size systems typically does not apply to small systems, in particular individual user systems, where workloads tend to be highly variable. I used to build systems with 200-300 disks per server, and today I work on systems that have way more than that. A lot of my intuition doesn't work for systems with 2 or 3 disks.

Unfortunately, the probability is not longer that small. Let's do a numeric example...

Oh, you're reasoning in terms of bit failure. Damn. That is interesting. I was only thinking of the (low) probability of a catastrophic loss of the whole (degraded) pool. If I had to worry about loss of a bit, as you sometimes do, I'd be pushed toward the conclusion that all very valuable data must be stored on archival quality acid-free paper.

The calculations in your posts are helpful. Thanks.

ralphbsz · Dec 16, 2020

garry said:
If I had to worry about loss of a bit, ...

It depends on the use case. For example, is this system being used in a commercial setting, and loss of a single file (due to a single bit error) means that the customer will be very mad? Or is this a home system, where any one file can be quickly restored from backup, and the only user know what they were working on recently, so if the file was modified since the last backup, they can recreate it?

I'd be pushed toward the conclusion that all very valuable data must be stored on archival quality acid-free paper.

People who worry about archival long-term storage actually think about these things. Clearly magnetic media (disks, tapes) are not likely to last for more than 50 years, since they rely on oils (lubrication) and plastics (tape) that degrade. So for long-term storage, one idea is punched cards (with very small punch holes) made out of nickel foil. Another idea is to use glass, and etch or darken the glass; as cathedral windows show, glass lasts a long time.

richardtoohey2 · Dec 16, 2020

ralphbsz said:
Performance benchmarking storage and RAID systems is hard. Usually, the bottleneck isn't the storage hardware. For example, you say you import 20GByte of data into MySQL. Let's be pessimist and assume that it blows up by a factor of 3, (RAID and database overhead) so you're trying to write 60GByte of data. Say for fun that you have 3 disks, which each can do 200 MByte/s sequential outermost. You should be done in 100 seconds. Even assuming the disks aren't 100% efficient, any time >200 seconds can not be explained by the storage system.Clearly, that's not what you measure. So where is your real bottleneck? Tough question.

Good points. I'm not really doing "proper" bench-marking, more tinkering with some machines to see what seems to perform well with this one test case. Nothing I've done is remotely close to 100 seconds, though! But when in production, it's not just about "perform well", it's also going to be "but what if one drive fails at 10 a.m. on a working day"?

ralphbsz · Dec 16, 2020

richardtoohey2 said:
But when in production, it's not just about "perform well", it's also going to be "but what if one drive fails at 10 a.m. on a working day"?

Good point. In the old days, when disk subsystems were small (say 10 disks), the probability of any disk failing was very low. Say disks last for 1.2 million hours (that's a typical MTBF quoted by manufacturers), and it takes 72 hours to fully replace a disk (24 hours to get the replacement shipped by FedEx, then 48 hours to resilver it). In that case, on average, you will be having a disk failure every 14 years. If performance sucks during 3 days every 14 years, that's considered acceptable by most. Now do the same with today's production systems, for example one that has 10K disks (which would not be consider huge for the hyper-scalers or big corporate data centers). A that point, about half the time at least one disk is down and being resilvered! On a system like that, it is vitally important that performance during resilvering is acceptable.

Argentum · Dec 17, 2020

ralphbsz said:
Good point. In the old days, when disk subsystems were small (say 10 disks), the probability of any disk failing was very low. Say disks last for 1.2 million hours (that's a typical MTBF quoted by manufacturers), ...

That is approximately 137 years. In real life we can see that most older hard drives are dead or faulty. These are usually much younger than 100 years!

olli@ · Dec 17, 2020

JMOR said:
Isn't RAID 5 (RAIDz1) equal to RAID 0 plus a third disk for parity?

No, that would be RAID-3 (without striping) or RAID-4 (with striping). These aren’t used anymore nowadays, because RAID-5 is better in practice (or RAID-1, depending on circumstances – ralphbsz has explained this very well).

With RAID-5, the parity data is evenly distributed across all disks. There is no dedicated parity disk; all disks are equal.

For even higher robustness, you can use RAID-6 which contains twice the amount of parity data, so you can lose two disks at once without losing any data. For example, when you have 5 disks, a RAID-5 would have 20 % parity data and 80 % effective capacity, and a RAID-6 would have 40 % parity data and 60 % effective capacity. Write performance is lower, of course, because twice as much parity data has to be written.

A good alternative is a RAID-5 plus hotspare. You get all the advantages of RAID-5 (better performance than RAID-6), and you don’t have to wait for a replacement disk when one disk fails. You can even arrange for “proactive resilver”, i.e. monitor the SMART data of the disks, and when one disks appears to reach its end of life at some time in the near future, start using the hotspare even before that disk actually fails (I’m not sure if there is existing ZFS supports for this, or if you have to write some scripts yourself).

By the way, when creating a mirror of SSD drives, it is good practice to buy two different drives from different vendors. They should have same or similar capacity and performance characteristics, of course. There are two reasons for that: First, the drives that comprise a mirror will have identical write load, which means that wear of the flash cells is identical, which increases the probability that they with both fail at the same time (or in close succession). Different vendors use different wear-leveling algorithms, and probably also different brands of flash memory and flash controllers, reducing that risk. And second, consider the case that there is a firmware bug that causes problems at some point during the lifetime of a drive – things like that happened in the past, so it’s not just hypothetical. If you buy two identical drives, then they will have the same firmware, and again there is a higher risk that they will both cause trouble at the same time.

ralphbsz · Dec 17, 2020

Argentum said:
That is approximately 137 years. In real life we can see that most older hard drives are dead or faulty. These are usually much younger than 100 years!

An MTBF of 137 years (=1.2 million hours) doesn't mean that every disk functions for exactly 137 years, and then fails. To begin with, given the limited economically useful life of disks, nobody would be able to measure this: After about a decade, operating a disk becomes useless, as it is way too expensive (too much power and space usage for its capacity and performance). The MTBF means that during the sensible useful time of new disk drives (typically 5 to 10 years), the rate of failures is such that on average, drives would live 137 years. In reality, it is an AFR (annual failure rate) specification. The other thing one has to be aware of is that the MTBF or AFR is an average over a time-dependent failure rate. Everyone knows the bathtub curve, right?

By the way, I personally don't quite believe the MTBF specification given by vendors. I've done informal estimates of this for large disk systems, and personally I would say that disks seem to last 700-800 K hours (that observation was using disks that were shipped in about 2010, and operated for 6 to 7 years). That factor of roughly 2/3 doesn't change the above conclusions very much.

chrbr · Dec 17, 2020

Argentum said:
That is approximately 137 years. In real life we can see that most older hard drives are dead or faulty. These are usually much younger than 100 years!

This is true. About the reliability things the failures happen at the beginning of the life time of a product and increase after some life time when things as mechanics are weared out. The failure versus time curve looks like a bathtube. The MTBF stuff is related to the flat part of the curve. Early failures are assumed to be sorted out already. And the wear out time with exessive failures is out of scope, too. It is all about the statistics of the normal life time. A good approximation is the e-function. Taking this into account the probability that a device is still alive or has failed is 37%. I do not remember which case is the correct one. Nobody would accept a failure of 37%. In some industries a life time is specified with a maximum failure rate. One can use the MTBF to estimate if a product will meet the spec under certain conditions or not. One of the key factores is the temperature. Another one is the reliability model. The manufacturers run stress of their products under excessive conditions to find a good match with the statistical model and publish the MTBF together with reference conditions as a temperature.