Best ZFS configuration for large server

cross · Oct 1, 2024

Greetings. I looked around on the internet and this forum, I see similar but not the same use-case (ref
similar question about a desktop system)

I have a server with [up to] 12 LFF large HDDs. I also have a few (3) of 2.5" SSDs in/for it. (one currently in an adapter in a 3.5" bay, two others in 2.5" NVME SSD slots)
I'd started with a single 800GB SSD for the OS, then large HDDs for a ZFS raidz, modeling what I have on an older system. But, I'm certain there are better ways. The other two SSD slots can be whatever, that's a lot of what I'm trying to work out.

I asked folks about this a while ago, while still spec'ing the server (which I've been accruing over time) and for my use case, mostly reads, mostly light load, I understood that a number of mirrors in a pool would perform better than a raidz, and given the size increase of disks since I built my last system, I can maintain the same (or larger) pool size. Is that still recommended for fast random access?

Then, there are SSDs. There's cache, there's L2ARC, I only partially understand all of that. Also the question, should I put root on a ZFS SSD mirror, or just on a single SSD? I don't worry as much about SSDs failing as HDDs, but that may be foolish of me.

If you have any questions, let me know, and I'm happy to hear pros/cons for my options. Thanks!

ralphbsz · Oct 1, 2024

Let's look at performance and durability separately. And I will assume that you have infinite money to buy extra disks.

For durability, you want redundancy. For large and slow disks, you want double-fault tolerance, since the probability of getting or finding a second fault while the first fault is being rebuilt is pretty large these days. On the other hand, for small and fast disks (SSDs), and for not so important data, single fault tolerance might be OK. In either case, a crucial factor is availability of spares: If a disk fails, and you don't notice for weeks, or it takes you weeks to obtain a spare to restore the redundancy, your real durability is much lower: you want rebuild to start as soon as possible after a failure. The OS is actually the most unimportant part, since you can "trivially" recreate that by reinstalling (trivially here means: will probably take hours if you are really well organized, days otherwise). So here would be my suggestion for the SSDs: Use two SSDs in a mirror for the OS itself. Then use the 3rd one for something like L3ARC or unimportant data (temp caches or such). Should one of the primary SSDs fail, blow the 3rd SSD away and use it an an instantaneous spare for the OS.

So far, so good. The part of the system that's stored on the OS disk that is very hard to recreate in case of failure is your customized setups. So I would use some space on the data disks to do regular backups (hourly?) of /etc/ and /usr/local/etc, and similar directories that contain things that are not "vital", but that would be very painful if you didn't have them.

Now, for the real data, it depends crucially on workload. For sequential reads and writes, hard disks are still unbeatable in terms of MB/s/$. So if your workload is something like video streaming or very large files being written, put all your data on hard disks. On the other hand, SSDs are much better for random small IOs. But they are expensive. If you can segregate your data to find a small part of the data that has most of the random reads and writes, move them to SSD (probably co-located with the OS disks). The danger of having lots of small writes: You will wear the SSDs out. But honestly, most amateur workloads are not capable of wearing out SSDs if you buy quality (cloud servers and HPC is a different story).

That leaves us with probably putting most of the data on hard disks. Here you need to find the balance of capacity and performance, in terms of IO/s (IOs per second, mostly counting random IOs, mostly random reads). Each disk gives you about 10-20 TB of capacity, and each gives you about 100 IO/s. It also gives you 150-200 MB/s, and most people are not capable of using that streaming bandwidth anyway, so let's just focus on IO/s and capacity. How much do you need? Buy this many disks, except for that nasty little redundancy correction factors.

If you have mostly reads, then RAID-Zx is fast. Since you want dual-fault tolerance, I would go for RAID-Z2. The question whether you want two RAID-Z2 groups of 6, or one of 12 needs to be answered by benchmarks. The former would give you 8 disks' worth of capacity, the latter 10 disks' worth (assuming you populate all 12 disks). ZFS has a reputation of performance degrading with larger RAID groups, but I don't know whether 12 would be on the "larger" side. If your workload has parallelism (most do), you will keep all your disks busy. So here is the first place where your cost/capacity/performance tradeoff becomes really hard, and probably requires benchmarking.

If you have lots of writes, then it gets ugly. Large writes are fast in RAID-Z2, so if they dominate your workload, their capacity advantage dominates the decision. Any parity-based RAID (like RAID-Zx) will have a small write penalty, so mirroring is theoretically faster if you have lots of small writes. For ZFS, that difference is smaller than traditional RAID, since it does most small writes by appending to a log, turning them into larger (not large!) writes. To fix this, you'd have to go to 3-way mirroring to keep your double fault tolerance, so with 12 disks, you only get 4 disks' worth of capacity. But boy will it be fast. So how many writes do you have, what fraction are small, how much capacity do you need, and how much money do you have? That is a super tough tradeoff.

But then, you say "mostly light load". In that case I would start with 3 data disks and go for mirroring. If that totally satisfies your needs, the whole complicated question above doesn't need to be answered.

Cath O'Deray · Oct 1, 2024

cross said:
… [up to] 12 LFF large HDDs. … (3) of 2.5" SSDs … mostly reads, mostly light load, …

Whatever your chosen arrangement for slow devices (HDD): adding a fast device, to a pool, as L2ARC, will probably benefit the read use case.

How much memory can you give to the server?

cross · Oct 3, 2024

with regard to workload, it will be mostly sequential large files. Video streaming and backups I think will be the majority of the action, both of which are large files, and principally sequential access.

Reading the rest of ralphbsz 's post, these are the same questions I have. I will say that of the data on the large filesystem(s), some is not important (videos) but some may be very important (backups). So it's tough. I've been thinking that moderate resiliency, single redundancy, will be enough. Write performance is important, but I think read is more important. The "streaming" (is it streaming if it's NFS accesses? close enough...) needs to be fast, i.e. responsive, for a good experience. Writing of backups, however, also needs to be _reasonably_ fast. It can be moderate, but a 2 hour backup vs a 90-minute backup is a notable difference.

I will need to think about my storage to see if there are things other than the primary large data files that I am storing. The idea of co-locating smaller things with random io onto the faster OS disks is interesting.

My original plan had been 3 raidz1's, 4 disks per. Then, as noted, I got the impression/suggestion that a group (up to 6) of single-replication mirrors would be better. But it sounds like that might not be the case. For streaming bandwidth, it will all be local gigabit ethernet. 10Gbe in core systems/networks, 1Gbe to the endpoints. So the disks will always be the slow part I think.

And the other key point that Cath O'Deray brings up is memory. I'm 90% sure this system has 256GB in it, so plenty. (It started with 128GB, but I think I up'd it to 256. More is supported, but time and money, etc.). And if 256GB is not "gobs of memory" like I think it is, well, then I'm just old.

So 10's (50-80) of TB of spinning disks and 2-3 SSDs (<1TB up to maybe 2.8TB ? Bigger is an option, but I'm not sure if it would help).

Sounds like the OS on mirrored SSD is a/the recommendation. I was wondering if the mirroring was even necessary, but again most of my planning experience is based on magnetic media 20-30 years ago; where failure is certain. ;-). Then, I need to learn how best to use/configure caching. And a log device? Is offload of ZIL onto dedicated devices still a recommended thing? Will an L2ARC help me a lot, or is 200GB of RAM enough that L1ARC is fine?

Thanks all.

mer · Oct 3, 2024

ralphbsz always has some good insights. I have no experience with the scale you are talking about, but everything I've seen "what is your workload" (mostly reads, mostly writes, mix) makes a huge difference. Following are simply my opinions based on reading lots of things and investigation in general.
Obviously OS on separate devices is good, mirror can help in terms of reliability if the system has problems and you need to recover quickly.
ZFS likes RAM especially if the workload is primarily read (ARC etc).
ZIL is related more to writes, lots of people think a separate device is good, especially for reliability.
I think L2ARC is useful if your read load changes but falls into a pattern. Reads wind up in ARC, they fall out of ARC into L2ARC so workload could wind up relatively steady state with data in ARC and L2ARC. zfs-stats can help figure some of that out.

Erichans · Oct 3, 2024

cross said:
should I put root on a ZFS SSD mirror, or just on a single SSD?

cross said:
Sounds like the OS on mirrored SSD is a/the recommendation. I was wondering if the mirroring was even necessary, [...]

In a big storage system, an OS separate from all data makes a lot of sense. If you choose ZFS for that (the obvious choice), you should not choose a non-redundant pool for that. Let me emphasize that ZFS takes exceptionally good care of your data: it will not serve you data that is not correct and you will be notified when incorrect data is read; similarly for data written to disk. However, ZFS can only correct those errors when there's redundancy. When there's no redundancy left anymore in the pool, any extra error beyond that means you lose the pool: all data has to be restored from backup! (There's no fsck(8) equivalent for ZFS.) As suggested, (a minimum of) RAIDZ2 is a valid starting point.

If an L2ARC will be useful depends on the load of the ARC (I think you are referring to the ARC when you mention L1ARC; note that the article referenced below explains that an L2ARC is not like the ARC at all). You have a lot of RAM (=space for the ARC); however, I can't predict your reads and how "cachable" they will turn out to be. Because you have such a relatively "slow" pool of spinning rust any caching/buffering that is needed and goes beyond what your ARC can muster, you'll have a real potential of gaining overall performance by adding for instance an L2ARC; again, when your ARC isn't sufficient. Have a look at OpenZFS: All about the cache vdev or L2ARC by Jim Salter.

cross said:
And a log device? Is offload of ZIL onto dedicated devices still a recommended thing?

Have a look at What Makes a Good Time to Use OpenZFS Slog and When Should You Avoid It by Dru Lavigne. As mer mentions, a SLOG* is related to writes, specifically synchronous writes. I'm not aware that the addition of a SLOG as such increases reliability though.

Be aware that when adding a VDEV (such as a SLOG) that has write-sensitive data, that VDEV and its data is an integral part of the pool to which it belongs. Failure of such a SLOG will bring down the whole pool; therefore, it needs have redundancy; a mirror is often deployed.

___
* edit: as is the ZIL.

mer · Oct 3, 2024

Erichans said:
I'm not aware that the addition of a SLOG as such increases reliability though.

Perhaps "recoverability" instead of reliability? As in we have stuff in SLOG/ZIL, system crash, reboot, "oh look I have stuff there let me replay it"?

bgavin · Oct 3, 2024

One item not addressed is data corruption or loss.
You can have the most solid and redundant system with multiple parity, ad nauseum, but it can still get raped by malware or ransomware.

Another loss of data is an unnoticed deletion of a directory.
I lost nearly all of my 2019 photographic work, because the parent directory got deleted.
How.. is still a mystery to me.

I was able to recover 90% from an "I got Lucky" offline backup to a USB disk.
This was pure luck on my part.

If you have decades of terabytes of data, separate offline backup becomes a huge and unwieldy proposition.

ralphbsz · Oct 3, 2024

mer said:
Perhaps "recoverability" instead of reliability?

Evaluation of storage systems happens along multiple metrics. One metric is: "how often is the storage system capable of performing reads and writes?". That is usually called "availability". Another metric is "what fraction of the data is lost?". That is usually called "durability". But as stated, those two metrics are too black and white; in the real world, the system doesn't have to just be able to perform reads and writes, it also has to do so at a sensible or reasonable speed, and reads of previously written stuff (both file content data and file metadata) have to succeed. One way to express that is called "performability", and a typical performability SLA might be "90% of the time the system can do 1GB/s reads and 100MB/s writes, 9% of the time it can do at least 1/10th of that, and it will be unavailable or slower than that no more than 1% of time". Similarly, durability has to be turned into a quantifiable SLO, for example "the average data loss rate for byte of user data in files is 10^-11 per year, and for file metadata making access to a file impossible 10^-8 per year, and a rate of data corruption of bytes in files of less than 10^-13 per year".

In real world systems that use rebuild (resilvering, recovery, ...) the durability and availability metrics get mixed together. For example: data is readable with no more than 10s latency per megabyte read with 6 nines of availability. In case of disk failure within the fault tolerance of the encoding, it may take up to 5 minutes to perform a read, if the read requires data to be rebuilt from other disks first. And in case of disk failure that exceeds the online fault tolerance, it may take up to 3 days to read data from backup tapes that have been stored in a deep mine far away.

So what people call "reliability" is a complex mixture of objectives and metrics.

bgavin said:
One item not addressed is data corruption or loss.

Old joke: The best way to administer a computer is to hire a man and a dog. The man is there to feed the dog; the dog is there to bite the man if he tries to touch the computer.

Seriously: The vast majority of all data loss is not caused by hardware (or systems or network or ... failure), but by human error. The single largest cause of data loss is the rm command (and other user commands that work the same way), followed by operator error in managing the storage systems, followed by software bugs in implementing the storage system. Disk failures are actually a minor problem, although we focus a lot of attention on them.

From this, we conclude that the single most important thing in a storage system is good backups. And when I say "good", I mean backups that are still useful if someone does "rm *" in the original (live) copy. This immediately implies that the backup system must be independent in failure modes from the live copy. So for example, if backup detects that a file has been deleted in the live copy, it must not delete the backup, but instead perhaps mark it as "archive only do not restore unless specially requested". And at least one backup should not share the same disks as the live copy, in case someone does "fdisk /dev/adaX" by mistake. And so on.

If you have decades of terabytes of data, separate offline backup becomes a huge and unwieldy proposition.

Actually, with today's relatively low cost of storage (a 10TB disk is a few hundred $) and ubiquitous availability of good networking and cloud services, it is "mostly" a software problem. A terribly difficult one to be honest.

cross · Oct 3, 2024

Thanks. That gives me a much better understanding of ARC and cache (aka L2ARC, but not an ARC.

). And, reading that klara article about slog, I don't know that I need it. I mean, it might help, but I won't [frequently] have a large write load, so I think the performance cost of the ZIL on the spinning media is acceptable. Hmm. Unless multiple backups are happening and someone is trying to read a datafile (media) at the same time. Okay, so maybe it would be of some value for those times, just not so much others.

It doesn't sound like my use case cries out for either slog or L2 cache; and whether or not they'd help in some cases isn't clear. It may just be that with enough RAM, an L2ARC just won't be needed. Unless I have data that is accessed repeatedly, which while it may happen will not be the average case for a storage array. Bleah. This is gotten so much more in the "I still don't know." I feel like I know enough to argue both for and against either log or cache vdev's for my pool. :-(

"If it were easy, everyone would be doing it."

Argentum · Oct 4, 2024

cross said:
Thanks. That gives me a much better understanding of ARC and cache (aka L2ARC, but not an ARC. ). And, reading that klara article about slog, I don't know that I need it. I mean, it might help, but I won't [frequently] have a large write load, so I think the performance cost of the ZIL on the spinning media is acceptable. Hmm. Unless multiple backups are happening and someone is trying to read a datafile (media) at the same time. Okay, so maybe it would be of some value for those times, just not so much others.

Agree. The L2ARC is really worth to try. Personally, I have one desktop machine with mirror of spinning disks. Added relatively small (256GB) SSD L2ARC drive makes a big difference. On the server, it may be not so obvious because there may be just few memory resident applications constantly running.

Erichans · Oct 8, 2024

mer said:
Erichans said:

I'm not aware that the addition of a SLOG as such increases reliability though.

Click to expand...

Perhaps "recoverability" instead of reliability? As in we have stuff in SLOG/ZIL, system crash, reboot, "oh look I have stuff there let me replay it"?

The second sentence I can agree with wholehartedly.

As to the first sentence, without a reference or a more detailed explanation, it's difficult to see what this might be about. As it is, I see no relation between adding a SLOG and reliability or recoverability as such.

Let me finish with a quote from ZIL Performance: How I Doubled Sync Write Speed, by Prakash Surya (my italisized addition):

SLOG stands for Seperate LOG device. The ZIL and SLOG are different in that the ZIL is a mechanism for issueing [synchronous] writes to disk and the SLOG may be the disc that the writes are issued to.

mer · Oct 8, 2024

Thanks Erichans Like almost everything performance tuning, it depends on the specifics. Adding SLOG under a mostly read condition does very little if anything, but adding L2ARC may do alot.
It takes a lot of effort to actually test correctly plus a lot more to actually tweak correctly.
Avoiding the urge to "change 12 things at the same time" is likely the most difficult.

As for recoverability instead of reliability statement that was me thinking out loud on your original sentence, basically agreeing that SLOG may not increase reliability (I define that as roughly uptime and not losing data).
Lots of general knowledge floating around these forums, that may or may not apply to "one's specifics".

gpw928 · Oct 8, 2024

Some good advice above. A few additional thoughts...

I keep multiple pieces of spare media (SSD and HDD) on site for my ZFS server. Any media failure can always be addressed immediately.

Some hot-swap capacity for spinning disks is highly desirable. I have a 3-bay hot swap cage for 3.5" SATA disks. It's normally empty. It occupies the space normally required to accommodate two 5" CD drives.

I use hot swap bays routinely to hold my 12TB backup disks, which are rotated off-site.

When a disk needs replacing, I just slip the replacement into a vacant hot-swap bay and re-silver without disturbing the cables or anything inside the case (which could easily trigger a fault in more disks). When the re-silver is complete, I remove the dead spindle at my leisure, and relocate the replacement to its permanent bay. An additional advantage of this approach is if you accidentally pull the wrong disk during replacement (after re-silvering), your pool is still recoverable.

I use enterprise class SSDs which have power loss protection. I use a UPS. I use a quality case which has fans blowing directly (at identical angle) onto each spinning disk. The hot swap cage has its own fan.