Solved Striped mirror vs raidz2 for reliability

mefizto · Jun 25, 2024

Greetings all,

I am rebuilding my music server and as such, performance is not at issue; consequently, I have been reading on reliability of the two different configurations, as I am dreading the task of re-ripping my collection.

Considering the striped mirror, I understand that the two drives in a mirror will have the same data, however, how is the data distributed about the stripe is a function of the zfs algorithm. Now, if a drive in a first mirror fails and during the re-silver process the other drive in the same mirror fails, the entire pool fails.

This is not true for raidz2, wherein any two drives may fail.

The above does not appear to be the whole story though, because it is claimed that the re-silvering process is less stressful on the remaining drive in the mirror than the remaining drives in the raidz2. Additionally, it appears that the issue is further complicated by the size of the drives.

However, I have not found any analysis beside the claim supra. Is there any paper on a small array (4 x 2GB disks)?

Kindest regards,

M

ralphbsz · Jun 25, 2024

Paper? Don't know of any. I just answered a similar question in https://forums.freebsd.org/threads/new-server-hardware-epyc-9004.93914/post-659859

But in addition to the capacity and durability question I addressed in that post, you may also want to look at performance. And for that, you may need to separately consider write performance, read performance, small versus large files (since RAID-Zx has different behavior for small updates compared to mirroring), and resilvering performance. That is a long and complex question.

mefizto · Jun 25, 2024

Hi ralphbsz,

thank you for the link. I am not sure that the other considerations that you mentioned have a big import. I will be the only user of the server. and if my understanding is correct, streaming music does not require much resources.

Kindest regards,

M

cracauer@ · Jun 25, 2024

mefizto said:
The above does not appear to be the whole story though, because it is claimed that the re-silvering process is less stressful on the remaining drive in the mirror than the remaining drives in the raidz2. Additionally, it appears that the issue is further complicated by the size of the drives.

However, I have not found any analysis beside the claim supra. Is there any paper on a small array (4 x 2GB disks)?

From theoretical considerations I am not so sure about this claim.

It is probably true for traditional RAID systems that have no knowledge of which blocks are actually taken. But ZFS? It seeks to only resilver occupied blocks even in mirrors.

mer · Jun 25, 2024

cracauer@ said:
But ZFS? It seeks to only resilver occupied blocks even in mirrors.

That is my understanding also. Making discussions about "rsync vs zpool send|receive" interesting.

Mirror vs RAIDZx I think really need to look at performance.
Simplistically:
Mirrored writes finish at the slowest device, reads finish at the fastest. Yes there are a lot of nuances, but this is a good approximation.

RAIDZx I think reads and writes finish at the slowest.

Not sure exactly how that translates to resilver performance, but I'm sure it does somehow.

mefizto · Jun 25, 2024

Hi cracauer@,

thank you for replying. As I wrote, that claim was gleaned form searching on the Internet, with results from different fora. Hence my question, whether there is a more formal analysis of the re-silvering.

Hi mer,

thank you for replying. Do you mean performance as to read/write of a functioning pool or the performance as to pool re-silvering? If the latter, that is an interesting point, but again, I failed to find any formal analysis.

Kindest regards,

M

mer · Jun 25, 2024

mefizto said:
thank you for replying. Do you mean performance as to read/write of a functioning pool or the performance as to pool re-silvering? If the latter, that is an interesting point, but again, I failed to find any formal analysis.

"Yes" I think

My statements are basically "me taking everything I've read, thinking about it, toss in some personal experience", giving you my opinions/understanding. I'm not aware of any papers/formal analysis on the different bits.
Functioning pool or resilver performance I think are basically the same.

Mirrors:
By definition push the same request (read or write) to 2 devices at the same time. From the OS, it delivers 1 request, lower layers split it into 2, then wait. A Read is considered complete as soon as either device satisfies it. A write is not complete until both devices are done. So writes tend to track the slowest device, reads tend to track the fastest. Resilvering is "read from one and write to the other" so I think still tracks to the slowest device. Read from the slowest write to the fastest gates on the slowest to complete the read. Read from the fastest, write to the slowest, gates on the slowest to complete the write before doing the next thing.

RAIDZx:
Roughly the block of data is striped across all the devices, maybe plus writing parity. The block read or write is not considered complete until the kernel has the whole block (or devices ack'd written their piece), so again I think driven by the slowest device, also applies to resilver. Why resilver? Lets say you replace the fastest device: putting the missing data on that device requires reading a block from all devices, using parity to derive the missing data, so driven by the slowest device.

mefizto · Jun 25, 2024

Hi mer,

thank you for the clarification.

mer said:
Functioning pool or resilver performance I think are basically the same.

In my way of thinking, it is different. Let us take the music server, as this is what I am currently contemplating. When the pool functions, if I want to listen, the data is read from the pool and delivered to the network interface, hence no step of writing is necessary. If a new piece of music is delivered to the pool, only writing is involved.

During re-silvering, as you explained, both reading and writing is involved.

Kindest regards,

M

Eric A. Borisch · Jun 25, 2024

mer said:
Functioning pool or resilver performance I think are basically the same.

From the documentation: (emphasis mine)

Pool geometry

If small random IOPS are of primary importance, mirrored vdevs will outperform raidz vdevs. Read IOPS on mirrors will scale with the number of drives in each mirror while raidz vdevs will each be limited to the IOPS of the slowest drive.

If sequential writes are of primary importance, raidz will outperform mirrored vdevs. Sequential write throughput increases linearly with the number of data disks in raidz while writes are limited to the slowest drive in mirrored vdevs. Sequential read performance should be roughly the same on each.

mer · Jun 25, 2024

mefizto said:
Hi mer,

thank you for the clarification.

In my way of thinking, it is different. Let us take the music server, as this is what I am currently contemplating. When the pool functions, if I want to listen, the data is read from the pool and delivered to the network interface, hence no step of writing is necessary. If a new piece of music is delivered to the pool, only writing is involved.

During re-silvering, as you explained, both reading and writing is involved.

Kindest regards,

M

Agreed with this use case. Read can be affected by ARC. If you have multiple clients streaming the same song, it's likely a lot will wind up in ARC (roughly read cache). Satisfying from RAM is quicker than satisfying from disk.

Eric A. Borisch thanks. Good info.

mefizto · Jun 25, 2024

Hi mer,

thank you for the confirmation.

Hi Eric A. Borisch

Thank you for the information. Though, I am not quite certain, how do I apply the information to understanding the re-silvering, which is my main concern.

Kindest regards,

M.

6502 · Jun 25, 2024

mefizto said:
I am not sure that the other considerations that you mentioned have a big import. I will be the only user of the server. and if my understanding is correct, streaming music does not require much resources.

No need of striped mirror. Use normal mirror for music streaming. I guess you will use SSD.

gpw928 · Jun 25, 2024

Do consider the reliability issues, other than the disk drives themselves, i.e. redundancy in controllers, plus cables and connectors for both power and data.

When you use mirrors you can split the controllers and cables on different sides of the mirror meaning that a well designed mirror will survive a failure in a controller, or cables. This is much harder to do with RAID-Z.

bakul · Jun 26, 2024

With 4 disks, raidz2 and 2x2 mirrors will have the same capacity & same number of writes during normal operations, and since performance is not an issue, your considerations should be 1) theoretical reliability 2) resilver speed, 3) any disk failures during resilver. For 1) raidz2 is much better. For 2) mirrors should win. for 3) even if another disk fails during recovery, with raidz2 you can recover. With mirrors that depends on which disk fails (if it is the mirror of the already failed disk you lose).

mefizto · Jun 26, 2024

Hi 6502,

we are moving away from the topic, but I have 4 available SATA ports on the server and about six 2GB drives, hence the proposed structure., as I will have two spares.

Hi gpw928,

as I understand the hardware, all the SATA ports are on the same controller. Furthermore, even with multiple controllers, if one fails, the entire stripped mirror will become unusable.

Or, am I missing a point?

Hi bakul,

please do not take it the wrong way, but regarding your point 2) it is the same claim that I discussed in my first post, and arguably several posters disagree. Regarding 3), again, that is what I stated in my first post.

Am I missing a point?

Kindest regards,

M

Eric A. Borisch · Jun 26, 2024

There’s no right answer here for which is “better” for you. (Wrt. resilvering)

Some points:

There is no doubt that a raidz2 is more robust to failures, as you’ve pointed out, with the ability to survive any two drive failures. Any amount of “extra stress” from the more involved resilver of a raidz2 is certainly more than outweighed by this fact. If all you care about is durability, go for raidz2. You should be periodically scrubbing (I run 1/month), which will be very similar in stress level to a resilver, anyway.

Modern zpools have a “device rebuild” feature where a mirror replacement first gets copied fully sequentially (as fast as possible) and then kicks off a scrub (although with two-way mirrors, if the scrub detects a failure in the just copied data — which should match the other drive — you’re out of luck).

So you’ll get back up to your original redundancy quicker with a mirror, but you are more exposed during that time. Tradeoffs.

Note that for small files, you will have some additional overhead (lost capacity) for the raidz2 compared to the mirror due to the layout requirements ZFS imposes (allocations must be a multiple of 3 sectors for raidz2). If you’re mainly storing large files with the default 128k recordsize, this will be ~1.5% of extra overhead. (A 1M recordsize brings that close to 0, for the 1M records.)

Since small random reads are better served by the mirrors, the filesystem may feel “snappier” for some operations (like finds over a large directory tree.)

But since RAID IS NOT BACKUP and you have this data backed up somewhere else, so the reduced durability of the mirrors shouldn’t be a problem.

My vote is mirrors (assuming you have it backed up) and raidz2 if you don’t for the extra durability.

bakul · Jun 26, 2024

As others have said resilver depends on the number of data blocks you have written not the disk size. Now imagine that you have N data blocks. Let us introduce a term "recovery unit" -- for a mirror it is two. That means if you lose one block, you can recover it from the other and it stores exactly 1 blocks of data. For raidz2, it is M+2. That means you can lose *any* one or two of the blocks and you can recreate them from the remaining M blocks. In effect it stores M blocks of data.

For 2x2 mirrors you will have N recovery units. If you lose one disk, (assuming equal distribution) you will have N/2 blocks on each disk so you read N/2 blocks and you write N/2 (to the new disk) during resilvering. If two disks are lost, one from each mirror, you read N blocks and write N blocks.

For 2+2 raidz2, you will have N/2 recovery units. You will have to read 2 blocks from each of N/2 recovery units to recreate the data on one disk. So in effect you are reading N blocks and writing N/2 blocks. If two disks are lost, you still read N blocks but now write N blocks.

mefizto · Jun 26, 2024

Hi Eric A. Borisch

thank you for the detailed answer.

Of course I scrub, in fact on a monthly schedule as you do, and indeed I have a backup. Once one either loses (valuable) data or comes close, the backup is the first consideration.

Nevertheless, given my natural and acquired paranoia, I finally decided on raidz2 for the reasons eloquently stated in your first paragraph.

Hi bakul,

I think that I understand the second paragraph; however, not necessarily the third one. In my limited understanding the data and parity is distributed over all four disks Thus, if one disk fails, to restore the disk a read from only two disks is needed, no?

Or, are you saying the same and I am just not grasping your nomenclature?

Kindest regards,

M

bakul · Jun 26, 2024

Almost. In 2+2 raidz2, read from two disks whether one or two disks fail! I should qualify this further: the data & parity blocks are spread over all four disks. So if one disk fails, you can similarly spread the reads over 3 disks (e.g. a1,a2 from disk 1& 2, b1,b2 from disk 2&3, etc.). So theoretically you have 3 times the disk bandwidth available (as opposed to 1 disk on mirror case).

ralphbsz · Jun 26, 2024

cracauer@ said:
It is probably true for traditional RAID systems that have no knowledge of which blocks are actually taken. But ZFS? It seeks to only resilver occupied blocks even in mirrors.

EXACTLY!

This is the biggest advantage of modern RAID systems, which unify the upper layer of the storage stack (file system or whatever it is called this week) and the lower RAID layer (or whatever the redundancy layer is named). There are a few big observations one can make, and applying them all makes RAID work several times better:

The RAID layer does not have to resilver data that is currently unused (not allocated, nothing valuable stored in it).
When resilvering, give priority to those blocks that have the worst damage right now. For example, in a 2-fault tolerant encoding (like RAID-Z2), first resilver those blocks that currently have no redundancy at all, and only start working on blocks that still have some redundancy after all the worse problems have been solved.
How urgent is resilvering? That depends on how bad the damage is right now. For example, in RAID-Z2, if only one disk has failed, then every bit of data and metadata on disk still has some redundancy, and it is OK to do resilvering slowly, with minimal impact on customer (user, foreground) workload. On the other hand, if two disks are down, resilvering is extremely urgent, and customer workload has to take a back seat for a little while.
Corollary to item 3, when the width of the RAID group and the encoding width can be different: Only a very small fraction of the blocks will have multiple faults. For example, if you have a 10-disk wide encoding with two redundancies (RAID-Z2 over 10-disk groups) using 100 disks attached to the server, and one disk fails, only 10% of all data will have one fault, the other 90% is not even affected. If a second randomly chosen disk fails, only 1% of all data will have two faults (and be on the edge of data loss), while about 19% of all data now has a single fault.
Metadata is more important than data. Meta-metadata is more important than metadata. Meta-meta-metadata is ... you get it. A good system design might for example be to store data in a 2-fault tolerant encoding (like RAID-Z2), metadata (the inodes, directories, allocation tables) 3-fault tolerant, and meta-metadata (like superblocks and partition tables) 4-fault tolerant (5 copies). Like that even in the case of some data loss, most of the data is still preserved, because we can still reason about the state of the system. Some large commercial systems use exceedingly wide encodings, which give incredibly good durability with reasonable space overhead.

I think ZFS implements items 1 and 2, but not item 3 (which would be easy to add). Items 4 and 5 would probably require a major redesign, and it's not clear that for the target market of ZFS (smallish servers with up to a dozen disks per pool) they make sense; this is the stuff that larger systems can do better.

mefizto · Jun 26, 2024

Hi bakul,

thank you for the clarification, I think that we are on the same page. Just one (just for my education)

bakul said:
So if one disk fails, you can similarly spread the reads over 3 disks (e.g. a1,a2 from disk 1& 2, b1,b2 from disk 2&3, etc.).

Do you know if it is implemented or will the re-silvering algorithm try to use only two disks of the three?

Kindest regards,

M

bakul · Jun 26, 2024

No idea. You should do further research on your own

Erichans · Jun 26, 2024

mefizto said:
I have not found any analysis beside the claim supra. Is there any paper on a small array (4 x 2GB disks)?

Not a (full) answer you're looking for, but has some relevance in relation to your topic (with empirical data):
ZFS: Resilver Performance of Various RAID Schemas - 2016 by Louwrentius.
Notable quote:

Observations
I think the following observations can be made:

Mirrors resilver the fastest even if the number of drives involved is increased.

RAID-Z resilver performance is on-par with using mirrors when using 5 disks or less.

RAID-Zx resilver performance deteriorates as the number of drives in a VDEV increases.

I find it interesting that with smaller number of drives in a RAID-Z VDEV, rebuild performance is roughly on par with a mirror setup. If long rebuild times would scare you away from using RAID-Z, maybe it should not. There may be other reasons why you might shy away from RAID-Z, but this doesn't seem one of them.

Erichans · Jun 26, 2024

ralphbsz said:
The RAID layer does not have to resilver data that is currently unused (not allocated, nothing valuable stored in it).

When resilvering, give priority to those blocks that have the worst damage right now. For example, in a 2-fault tolerant encoding (like RAID-Z2), first resilver those blocks that currently have no redundancy at all, and only start working on blocks that still have some redundancy after all the worse problems have been solved.
[...]

I think ZFS implements items 1 and 2, [...]

#2 seems interesting. Any further documentation about that?
I haven't come across any hints or discussions about that; for example I didn't notice it being mentioned in:

Scrub/Resilver Performance by Saso Kiselkov in OpenZFS Developer Summit 2016
New prefetcher for sequential scrub by Tom Caputi in OpenZFS Developer Summit 2017
Resilvering multiple disks at once in a ZFS pool adds no real extra overhead - 2017 by Chris Siebenmann
Sequential scrubs and resilvers are coming for (open-source) ZFS - 2017 by Chris Siebenmann

Some notable quotes from #2 and #3 respectively:

Scrub and Resilver Background
• Scrubs and resilver use exactly the same code

As far as we can tell from both the Illumos ZFS source code and our experience, the answer is that replacing multiple disks at once in a single pool is basically free (apart from the extra IO implied by writing to multiple new disks at once). In particular, a ZFS resilver on mirrored vdevs appears to always read all of the metadata and data in the pool, regardless of how many replacement drives there are and where they are. This means that replacing (or resilvering) multiple drives at once doesn't add any extra read IO; you do the same amount of reads whether you're replacing one drive or ten.

mefizto · Jun 26, 2024

Hi Erichans,

Erichans said:
Not a (full) answer you're looking for, but has some relevance in relation to your topic (empirical data):
ZFS: Resilver Performance of Various RAID Schemas - 2016 by Louwrentius.

Thank you very much for posting this; much better than the (mostly) unsubstantiated claims. The RAID-Z performance is surprisingly good, but given my above-mentioned paranoia, I am rather pleased with my decision to use RAID-Z2, which for four disks fares rather well.

Kindest regards,

M

Solved Striped mirror vs raidz2 for reliability

Pool geometry​

Pool geometry