Solved When does it make more sense to use multiple vdevs in a ZFS pool?

scilek · Jan 4, 2022

I know for one thing that if the primary concern were protection against disk failures, raidz3 would be the way to go. In such a scenario, If I had 7 disks, I would create a pool like this:

Code:

  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz3-0  ONLINE       0     0     0
            ada0    ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0
            ada4    ONLINE       0     0     0
            ada5    ONLINE       0     0     0
            ada6    ONLINE       0     0     0

But what if I had 14 disks? Would I be better off if I created a pool like this:

Code:

  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz3-0  ONLINE       0     0     0
            ada0    ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0
            ada4    ONLINE       0     0     0
            ada5    ONLINE       0     0     0
            ada6    ONLINE       0     0     0
         raidz3-1  ONLINE       0     0     0
            ada7    ONLINE       0     0     0
            ada8    ONLINE       0     0     0
            ada9    ONLINE       0     0     0
            ada10    ONLINE       0     0     0
            ada11    ONLINE       0     0     0
            ada12    ONLINE       0     0     0
            ada13    ONLINE       0     0     0

or like this:

Code:

  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz3-0  ONLINE       0     0     0
            ada0    ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0
            ada4    ONLINE       0     0     0
            ada5    ONLINE       0     0     0
            ada6    ONLINE       0     0     0
            ada7    ONLINE       0     0     0
            ada8    ONLINE       0     0     0
            ada9    ONLINE       0     0     0
            ada10    ONLINE       0     0     0
            ada11    ONLINE       0     0     0
            ada12    ONLINE       0     0     0
            ada13    ONLINE       0     0     0

mer · Jan 4, 2022

14 disks, first option would be better, except for the fact that things are striped across the two of them.
If the primary concern is data protection, mirror the 2 raidz3 vdevs. RAID gives you protection from disk failure, the mirror gives you protection from vdev failure.

Another option would be "create a raidz3 on top of mirrors". mirror ada0/ada1, ada2/ada3...ada12/ada13, then create a raidz3 on top of those mirrors.

scilek · Jan 4, 2022

mer said:
14 disks, first option would be better, except for the fact that things are striped across the two of them.

So if one of the vdevs were to be lost, the whole pool would be lost, right?

mer · Jan 4, 2022

scilek said:
So if one of the vdevs were to be lost, the whole pool would be lost, right?

Correct. Same problem if you stripe 2 physical disks. But you'd need to decide/figure out if losing a raidz3 vdev is a realistic possibility.

sko · Jan 4, 2022

mer said:
If the primary concern is data protection, mirror the 2 raidz3 vdevs. RAID gives you protection from disk failure, the mirror gives you protection from vdev failure.

You should completely forget about classical RAID arrangements when it comes to ZFS. raidz already gives you a margin of 3 disks that could fail. so the whole vdev would _only_ fail if more than 3 disks of that raidz3 are failing. This is not the same as with classic hardware raid where a botched disk header on one drive can ruin your whole RAID5 array.
When using raidz you should also always remember that resilvering a raidz device can take _EXTREMELY_ long (days/weeks!), especially if the pool is under constant load. So for a huge raidz-vdev with 10TB++ drives there is a very high probability of multiple drives failing before the first one finished resilvering.

if you want somewhat usable iops don't use raidz, or use _a lot of_ raidz vdevs. the more zfs can spread the load across multiple vdevs, the better it can perform.
In terms of a single raidz vdev the overall performance of the pool is roughly the same as the slowest provider (disk) of that single vdev. Yes, IO is spread across all of the disks, but for a read or write operation that goes across all drives, zfs has to wait until each drive has finished; so the limiting factor is (in the worst case) always the slowest drive.

Using multiple mirrored (2-, 3- or N-way) vdevs gives you the combined iops performance of all vdevs and especially for reads the fastest drive of each mirror dictates the vdevs performance, not the slowest one.

Depending on the use case of this pool, I'd go either with 6x2mirror vdevs + 2hot-spares if the pool is used e.g. for VMs or other I/O-intensive tasks. Mirrors can resilver very fast, so using 2-way mirrors is usually sufficient, but for high-risk scenarios use 3-way mirrors.
For a pure data grave (file-/backupserver) where iops and overall throughput don't matter that much, use raidz vdevs with 5 or 7 devices and keep at least 2 hot-spares in the pool. I remember a recommendation of minimum N hot spares for N raidz vdevs (e.g. 5 spares for 5 raidz vdevs). I don't know how accurate this recommendation is nowadays though...

Erichans · Jan 4, 2022

This may be helpfull:

Choosing the right ZFS pool layout (August 30, 2021); by Klara Systems
Six Metrics for Measuring ZFS Pool Performance: Part 1 - Part 2 - pdf (2018-2020); by iX Systems

But, as mentioned, your use case may favor certain pool layouts, be it RAIDZ3 on the one end of the spectrum to lots of mirrors on the other end.
___
(P.S. If you need to enhance the speed (=IOPS) "beyond available hardware disk IO" of, especially, spinning platters(*), ZFS also offers the possibility of adding a SLOG or Separate intent LOG SSD; see for example 1 & 2.).
For accelerated heavily used read-only blocks, ZFS offers the option of a separate (fast) L2ARC device(**)
___
Edit: be sure to stay away from SMR drives, for example: WD Red SMR vs CMR Tested Avoid Red SMR

Edit: (*) spinning platters added. (**) L2ARC added: more info at sko's refs to ZFS ?

sko · Jan 4, 2022

In addition to Erichans links I can highly recommend both ZFS books by Michael W. Lucas and Allan Jude:
FreeBSD Mastery: ZFS
FreeBSD Mastery: Advanced ZFS

They basically cover everything you might ever need to know about ZFS as a sysadmin, and - as with all books by MWL - are really fun to read.

Andriy · Jan 5, 2022

mer said:
If the primary concern is data protection, mirror the 2 raidz3 vdevs.

I think that this is a quite bad advice.
How would you mirror raidz-s anyway?

gpw928 · Jan 5, 2022

I wanted to look at some comparisons so the performance penalty for RAIDZ can be better understood.

If your are looking for capacity and reliability, at low cost, RAIDZ a good choice.

It's also quite reasonably good in some performance aspects.

Below are the bonnie++ benchmark results for ZFS VDEVs:

a 2 x SSD mirror VDEV (250 GB Crucial/Micron -- CT250MX500SSD1)
a 1 x spindle disk VDEV (250 GB WD Velociraptor -- WDC WD2500HHTZ-04N21V0)
a 2 x spindle mirror VDEV (250 GB WD Velociraptors -- WDC WD2500HHTZ-04N21V0)
a 7-spindle RAIDZ2 VDEV (some 3TB WD30EFRX, and some Seagate ST3000NM0005)

The tests were the same in all cases (just the test directory and machine tag differed). The machine was otherwise idle, and the file systems had plenty of vacant capacity.

Code:

# bonnie++ -d<target_dir> -s 524288 -n1024 -m <test_name> -r 16384 -f


Version  1.98       ------Sequential Output------ --Sequential Input- --Random-
sherman             -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
2SSDMirror     512G            2.4g  91  1.6g  88            3.1g  99 +++++ +++
1Raptor        512G            2.3g  89  968m  53            2.7g  85 523.7  15
2RaptorMirror  512G            2.2g  84  884m  49            2.7g  86 950.3  28
7SpindleRaidZ2 512G            1.1g  42  647m  36            2.3g  75 755.6  21

Version  1.98       ------Sequential Create------ --------Random Create--------
sherman             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
2SSDMirror     1024 98814  96 166773 90   3409  4 100935 98  90319 99  1824   3
1Raptor        1024 99979  92 198779 99   3701  4 100828 92 134804 99  2555   4
2RaptorMirror  1024 98749  91 214291 99   3783  4  99158 91 171343 99  2567   4
7SpindleRaidZ2 1024 101957 95 211903 99   2213  2 103257 96 177184 99  2005   3

The random seeks for the SSDs were off the chart (completed in less than 500 ms). The latency for this SSD mirror test was 1126us, as opposed to 147ms for the Veloviraptor mirror.

For the RAIDZ2 VDEV, writing has a cost due to the parity overhead. Block Sequential Output really suffers, as does re-writing. On the upside, reading is acceptably fast, and random I/Os are better than a single disk. File operations (open, stat, delete) are reasonably fast. but there is a price to pay with deletions (writing parity required). I'm surprised that file creations ran as fast as they did, as they also require writing.

I'm surprised that the SSD mirror was somewhat slow at a lot of the file operations. Not sure why. This was run on the zroot, with a lot of free space.

Unfortunately I don't have the resources to provide benchmarks for striped mirrors, or mirror'd stripes, where we know the performance is to be had.

But at least we have a practial demonstration of some of the trade-offs for the reliability you get with RAIDZ2.

I can't remember how long it took to re-silver the last broken disk. Sufficient to say it was "over night" and the VDEV runs like a dog while it's happening. I would not use RAIDZ for serious disk bound applications.

I used to work with hybrid IBM RAID enclosures, where heavily used disk blocks were seamlessly re-mapped in the virtual address layer from spinning disks to SSDs. That was a nice feature.

sko · Jan 6, 2022

Andriy said:
I think that this is a quite bad advice.
How would you mirror raidz-s anyway?

You can't. That's why I said you should forget about classical RAID configurations when you use ZFS.

scilek · Jan 30, 2022

I know it has been some time since I started this thread, and now I must rekindle the flame.

I think you all know Linus, the famous Youtuber. After watching his latest video, I decided that striping might not be a good idea.

For those of you who are interested:

View: https://www.youtube.com/watch?v=Npu7jkJk5nM

gpw928 · Jan 30, 2022

They had petabyte ZFS storage arrays, but were not:

reading root's email; nor
running smartd; nor
scrubbing; nor
manually checking zpool status; nor
backing up their data.

If you think it's expensive to hire a professional to do the job, wait until you hire an amateur.
-- Red Adair

scilek · Jan 31, 2022

gpw928 said:
They had petabyte ZFS storage arrays, but were not:

reading root's email; nor

running smartd; nor

scrubbing; nor

manually checking zpool status; nor

backing up their data.

If you think it's expensive to hire a professional to do the job, wait until you hire an amateur.
-- Red Adair

Yes, indeed. But that also goes to show that ZFS is not %100 fail-proof and that one failed vdev will bring the entire pool down and that raidz3 is always better than raidz2. Maybe we should even have a raidz4, who knows?

chungy · Jan 31, 2022

scilek said:
goes to show that ZFS is not %100 fail-proof

Let it never be said that it is 100% fail-proof. If you have multiple hardware failures and absent administrators, you are going to have data loss sooner or later.

Even with competent administrators staying on top of things, one bad zfs destroy would ensure data loss without any hardware failure. This is another reason backups are important.

scilek · Jan 31, 2022

chungy said:
This is another reason backups are important.

I'll take heed.

gpw928 · Jan 31, 2022

scilek said:
ZFS is not %100 fail-proof

Correct.

scilek said:
one failed vdev will bring the entire pool down

Only if you design it that way.

scilek said:
raidz3 is always better than raidz2. Maybe we should even have a raidz4, who knows?

I just moved my tank from raidz1 to raidz2, so I completely understand the sentiment.

However, it comes down to a design issue. The competing issues are cost, capacity, performance, and time to recover from a failure.

Of course, if you don't have a viable backup, then no amount of raid is going to save you from serious finger trouble, errant software, or a sufficiently bad hardware problem (they admitted to frequent power failures, and intimated no UPS, so power spikes were probably in play.).

Erichans · Jan 31, 2022

gpw928 said:
[...]

scilek said:

one failed vdev will bring the entire pool down

Click to expand...

Only if you design it that way.

From a pool perspective, a VDEV is a single unit of storage. A storage pool does not survive when one (or more) VDEV in its pool is lost. Perhaps I'm overlooking your intention gpw928 but, how can you design against that?

scilek · Jan 31, 2022

Erichans said:
From a pool perspective a VDEV is a single unit of storage. A storage pool does not survive when one (or more) VDEV in its pool is lost. Perhaps I'm overlooking your intention gpw928 but, how can you design against that?

Right, you just can't. Having multiple pools with a single vdev instead of one pool with multiple vdevs sounds a bit safer to me. "Do not put all your eggs in one basket." is still good advice, despite fact that it sounds a bit of a cliché and putting all your eggs in one basket is very convenient.

gpw928 · Jan 31, 2022

Erichans said:
A storage pool does not survive when one (or more) VDEV in its pool is lost.

That's true. My assertion was sloppy. What I was trying to convey was that VDEVs can be designed to be extremely resilient -- so that they are extremely unlikely to fail and bring down the pool.

scilek · Jan 31, 2022

Is it reasonable to hope that someday we might have radiz4 or even raidz5 for giant vdevs?

Erichans · Jan 31, 2022

gpw928 said:
[...] What I was trying to convey was that VDEVs can be designed to be extremely resilient -- so that they are extremely unlikely to fail and bring down the pool.

Agreed. Your list is something that should be in the focal point of one's attention after considerations about hardware architecture, hardware components, pool layout design, ZFS installation & configuration and, last but not least, management considerations (see video mentioned earlier). For good measure: also before one begins.

sko · Jan 31, 2022

I only skipped over parts of that video posted above, but it seems as with many of this "influencers" videos/tips he once again used dangerously superficial knowledge and now blames the used systems for that. If you ignore all good practices about basic system administration and then also treat zfs like some old fashioned RAID, you are bound to fall hard on your face and you deserve it. If on top of that you don't have backups and rely on a single system for all your data, you won't get any sympathy from me and you shouldn't give "tech tips" to anyone...

Apart from the obvious basics (regular scrubs; hot spares and monitoring of disk health):
If you want a high resiliency and fast resilvering don't use raidz; use N-way mirrors and a decent number of hot spares.
raidz takes _VERY_ long to resilver while inducing a high load on _ALL_ providers in the vdev, increasing the possibility of more failures and also heavily decreasing operational performance of the pool.
A mirror can resilver much faster and even a dying disk (e.g. increasing number of bad blocks) can still provide decently to the rebuild and take some load off the other providers. Plus: with mirrors your pool is much more flexible as mirror vdevs can be removed, which makes it possible e.g. to migrate a pool to fewer but bigger disks.
raidz in comparison is much more inflexible and impact of different vdev performance is much higher, therefore a raiz-pool should always ONLY be extended with IDENTICALLY configured vdevs (e.g. same raidz-level and number/size of disks)! A disk upgrade in a pool consisting of raidz-vdevs therefore requires you to replace _ALL_ providers. Read performance of a raidz vdev is dictated by the slowest provider in that vdev as opposed for a mirror vdev where it's always the combined read performance of all providers. So if you have a dying disk that refuses to go dark, a raidz pool often comes to a crawl and needs manual intervention (good luck with that on a busy system!) where a mirror pool usually still performs halfway decent with the exception of write performance.

Using a single vdev for a pool is even worse and severely impacts the performance of the pool, especially when using a single raidz vdev (remember: raidz read performance = slowest provider!). There are always numerous complaints about bad ZFS performance in Free/TrueNAS-related forums (where superficial knowledge (or none at all) about ZFS seems to be the norm), which almost always can be traced back to using the wrong pool configuration for the job (i.e. raidz to cheap out on disks needed)

On smaller pools (<25 disks) you are almost always better off using mirrored vdevs instead of raidz; especially on the long run when it comes to upgrades and renovations.

ralphbsz · Jan 31, 2022

scilek said:
It reasonable to hope that someday we might have radiz4 or even raidz5 for giant vdevs?

That's a difficult and debatable question.

The reliability of modern disk drives is somewhere around a MTBF of a million hours. Vendors tend to claim 1.5 to 2 million, real-world measurements may be a little more pessimistic, but basically agree. So an individual disk drive is likely to break once every ~100 years. The uncorrectable error rate is estimated to be 1 in 10^17 bits (again, vendor specification are a bit optimistic, actual measurements a bit pessimistic). So on modern ~dozen TB disks, read errors are not unheard of, but also not common. You can do a non-mathematical analysis this way: The most likely way to lose data in a redundant array of disks is that one disk completely fails (if you have N disks in the array, that happens about every 100/N years), and while trying to recover from that by reading the redundant (parity) information, that parity copy gets a read error (happens roughly once per dozens of TB). But with a second redundancy copy, you can then survive that double fault, and a triple fault is astronomically unlikely (roughly 10^-17).

For accurate predictions, you need to do the math, which is quite difficult; it requires solving Markov models of disk and sector failures. Their result tends to agree with the above logic: Two-fault tolerant RAID is sufficient, even for very wide RAID groups (dozens of disks, up to ~100 disks), assuming that faults of all disks are uncorrelated.

RAID groups that are much wider than ~100 disks are today still unrealistic; the way the IOs are done, it leads to very inefficient distributed computation and requires lots of IO. Consider the following: If you have a fault that put the RAID code into rebuild mode, then for every read (meaning every step of resilvering or rebuilding) you'll have to read *ALL* other disks. Suddenly, your IO workload increases by a factor of 100. Now consider a large system, for example one that has 100K disks, organized as a thousand RAID groups each of 100 disks. Systems of this size are not uncommon today. In that system, pretty much all the time some part is in rebuild mode, meaning throughput drops by a factor of 100. Most customers will not tolerate a system that runs 100x slower much of the time.

So when using traditional parity-based RAID codes (such as the ones ZFS RAIDZ-n uses), group width larger than dozens are implausible. The research literature and patents describe potential solutions, but I don't know of any open source file and storage systems that use those.

Jose · Jan 31, 2022

scilek said:
I think you all know Linus, the famous Youtuber. After watching his latest video, I decided that striping might not be a good idea.

I've never understood the appeal. Superficial knowledge with with a heavy dose of cutesy facial expressions. I figure it's for children.

scilek · Jan 31, 2022

Jose said:
I've never understood the appeal. Superficial knowledge with with a heavy dose of cutesy facial expressions. I figure it's for children.

I believe it will be for the benefit of all of us to keep our discussions purely technical. You can never guess what will happen if emotions start to get involved.

"Professionalism is the opposite of sentimentalism." I say.