ZFS Best practice for 24TB drive RAIDZ3 pool

I have a new storage server for backups that has 24 x 24TB drives. I plan to set up a RAIDZ3 pool. To maximize capacity, I would like to set up the zpool with a single vdev. In regards to resilver times, would best practice be setting up the zpool with two 12 drive vdevs instead?

Thanks,
 
Best practice for me (based on my experience) would be:
  • vdev configuration reflects value of data. (e.g. mirrored vdevs for high value data) However datasets allow multiple copies as well.
  • Never spread a vdev accross multiple enclosures and/or controller. Keep it simple.
  • For each vdev there is a hot spare. (e.g. 4 vdevs equals 4 hot spares, exception: mirrored vdevs)
  • Keep the resilvering time below 48 hours ( I prefere less then 24 hours) for a single vdev.
 
Please take into consideration, I'm not a storage specialist.

To maximize capacity, I would like to set up the zpool with a single vdev.
That is a lot of storage; this strikes me as somewhat overly wide, even for a backup server. Please take into account that, especially with all drives of the same age from the same make, type and batch the chance of having multiple failures at once, especially while disk intensive resilvering is in progress, is not completely unthinkable. For these sizes, I hope you are considering the use of SAS drives (not SATA) as there you should easily leverage higher throughput; IIRC pricewise that hardly make a significant difference and they might have better warranties, please verify. In the end it is your evaluation and decision.

Resilvering times are dependent on a lot of factors. For example: normal pool IO load during resilvering; I imagine this will not be an important factor for a backup server.

ZFS resilvering takes into account only blocks that represent used data, i.e. used disk space (not total disk space). This differs from most, if not all, traditional (hardware) RAID systems. So expect that to be a large determining factor.
 
Last edited:
Very wide RAID groups have plusses and minuses. As cracauer already said, the upper limit of the time for one resilvering is set by the time required to read/write one whole drive's worth of data (by "resilver" I mean that a new blank drive has been added to the group to replace the failed drive, and the redundancy is restored by making new copies onto the blank drive). With a declustered RAID (as is used by ZFS), you get more parallelism in a wider RAID group (as more disks can participate at once), but the probability of a failure also increases.

There are (at least) three different metrics for this: (a) The actual average time required for one resilvering (here I mean the wall-clock period, how long it takes). (b) The fraction of the time that the array is in degraded mode during a resilvering operation (here I mean how often it is busy). (c) The probability that during the degraded operation another fault happens, which brings the array close to data loss, or data loss occurs.
How any of these metrics scale with the array size depends crucially on how ZFS implements things. Erichans already touched on that with his observation that ZFS doesn't even waste time on repairing things that aren't needed. There are many other large factors, for example whether resilver operations are done in large sequential blocks or with small random IO, whether newly restored blocks can be written on spare space any disk (avoiding the target disk becoming the bottleneck), and whether the IO scheduler is aware of the relative priorities of user workload versus resilvering versus scrubbing. It's a very complicated engineering tradeoff.

Personally, I would go for the simplest possible thing, and not worry too much about resilvering times. Data loss is more likely to be caused by human error, so making things more complex is usually a bad tradeoff.
 
24 x 24TB that's a lot of "Finding Nemo" or whatever young kids are watching now.

Would the access pattern affect any decisions? Is the end result read mostly, write mostly or roughly balanced between read and write?
I have zero experience with datasets that big, but in general pattern can matter.
 
Others have touched on it, but it's worth remembering that your production workload can be profoundly impacted by a resilvering operation. If production matters, you need to understand just how bad things can get, and for how long.

Also, consider the single points of failure, because a bad one can kneecap your system.

RAID-Z is a lot more vulnerable to controller and cable failures than (striped) mirrors, because you can put each side of a mirror on a different controller,
potentially providing redundancy for disk power supplies, controllers, data cables, power cables, and spindles.

With 24 spindles in use, I would source them from multiple batches, and hold several spares in-house (probably some hot).

I have seen RAID sets as large as yours fail. The most common cause was operational error, usually with inexperienced staff pulling the wrong disk. Those RAID sets generally had backups, which I why it's worth asking if you plan to backup the backups.
 
I just found this by Googling, but Klara Systems is a reliable source:

For any given number of disks, narrower stripes and higher vdev count will outperform wider stripes and lower vdev count for the vast majority of workloads. Yes, this is even noticeable on all-SSD pools!
 
"Narrower stripes ... will outperform". Yes, but is this a good tradeoff?

The cost of narrower stripes is greatly reduced capacity. Example: the OP has 24 disks. If they format those as RAID-Z3 (can tolerate three faults), they will have 21 disks' worth of capacity. If they instead split it into 4 vdevs and stripe those, each vdev will have 6 disks, meaning 3 disks' worth of capacity, for a total of 12 disks' worth of capacity. And 12 < 21, by a BIG margin. They could instead make two vdevs of 12 disks each, format those using RAID-Z2 (can tolerate 2 faults and has 10 disks' worth of capacity), and then mirror the two vdevs (now it can tolerate 3 actually 5 faults), and the net result would be 10 disks' worth of capacity. Sure, the last option would have better performance ... but at a cost of a factor of 2 in capacity.

Today, many storage workloads are not actually performance limited, but capacity (or cost) limited.

EDIT: See Bob's post below, I counted failures wrong.
 
They could instead make two vdevs of 12 disks each, format those using RAID-Z2 (can tolerate 2 faults and has 10 disks' worth of capacity), and then mirror the two vdevs (now it can tolerate 3 faults),
It would tolerate between 5 and 14 failures, depending on where they fell. This would be substantially more reliable than raid-z5 because 6 failures would need to be split 3+3.
 
I have seen RAID sets as large as yours fail. The most common cause was operational error, usually with inexperienced staff pulling the wrong disk. Those RAID sets generally had backups, which I why it's worth asking if you plan to back up the backups.
With all those possible human errors mentioned, especially if there's not another layer of backups, IMO, I don't see anything less than RAIDZ3 or equivalent where you can tolerate 3 random drive failures as a solid solution*. Having less overall redundancy, that means as a scenario: 1 drive fails and (the human factor) 1 (wrong) drive gets accidently pulled either from a live system, or not and it is detected too late. Then you'd most likely have to resilver without any redundancy left; somebody is bound to get nervous (I would).

P.S. the whole assessment changes completely when resilvering has to be done under load, that is when anything other than resilvering is acting on the pool; I've seen it mentioned with smaller drives out in the field: weeks of resilvering ...
Also: don't underestimate the load you get from regular scrubs (note: the max load is limited and can be changed) ; apart from the massive data movement to the new drive 'under construction' while resilvering, all the same hashes of all ZFS blocks in use have to be checked and all the same seeks over all disks have to be made.

___
* see also: The need for triple-parity RAID by Adam Leventhal, 21 december 2009; and the linked Triple-Parity RAID and Beyond
 
With all those possible human errors mentioned, especially if there's not another layer of backups, IMO, I don't see anything less than RAIDZ3 or equivalent where you can tolerate 3 random drive failures as a solid solution*.
Some of the big storage companies are starting to use extremely wide RAID groups; not the 24 the OP is contemplating, but 100 to 300 disks wide. Those are typically configured to have roughly a dozen (give or take a factor of 2) redundancy disks. So in ZFS terms you can think of that as a vdev built from 200 physical disks, and that uses RAID-Z12 encoding. Obviously, the real implementation of these things is much more complex than ZFS.
 
Never go over 12-wide because large vdevs have very long resilvering time
That data is very interesting. But some information is missing, and it is not clear how it extrapolates to other systems.

To begin with, many resilver times are reasonable. Most modern drives can read or write sustained at 100-200 Mbyte/s when doing large sequential reads and writes. The drives used in that test are 1 TB drives and old, so they're probably at the lower end of that spectrum. The bottleneck in a simple (single failed drive) resilvering should be writing to the new (target) drive, which can be done continuously and mostly sequentially. Therefore they should take about 5,000 to 10,000 seconds to do a complete read or write, which is about 80 to 170 minutes. Given that the file system is only 25% or 50% full, it should take a quarter or half of that time, which brings the range to 20-42 minutes for 25% full and 42-83 minutes for 50% full. And most of the times reported fit nicely in that range, at the slower end.

But for the high-redundancy systems (RAID-Z2 and -Z3) with large vdevs, the resilver times get much worse, reaching as much as as 252 minutes for something that should take at most 83. Where is the bottleneck? Is ZFS not using enough threads (or generating enough IO) to keep the source disks busy? Is the CPU out of horsepower to do CRC checking and "parity" (encoding) calculation? Is it a design flaw in ZFS that prevents it from exploiting the parallelism of the source drives? How would this scale to modern disks (similar sequential bandwidth and IOps, but much larger capacity), and modern CPUs (way more cores to do parallelism, and more integer speed)?
 
Where are you going to backup this "24 x 24TB drives." ?
Ignore this, i didn't read that this is going to be the backup server :)

According to my link above, DRAID "...was designed for systems with 60 or more drives..."
 
Back
Top