ZFS Accelerating a RAID-Z with one SSD

I have a five-drive RAID-Z that works great but is a little slow on heavily random-read workloads. The server has space for one 2.5" SSD and I'd like to get one to improve performance without hurting the integrity of my data. I am considering a few options:

- Use SSD for L2ARC: Good idea, this is very safe
- Also use part of SSD as ZFS log device: Slightly unsafe but may increase performance more
- Also use part of SSD to store metadata: Very unsafe, if SSD dies, all data is toast
- Move OS to SSD, use hard drives for user data only: Doesn't hurt data integrity but also doesn't help performance

Does this all sound correct to you? Are there any other configurations I should consider?
 
How much RAM does the system currently have?
Simple rule of thumb for ZFS: If you want more cache performance, add more RAM!

The L2ARC only holds what fell off the ARC (=RAM), so there should be a reasonably sized ARC to begin with. Also, the L2ARC needs RAM (~25MB RAM for 1GB of L2ARC) to hold the metadata. So if the system is already rather low on RAM, adding an L2ARC can even hurt performance instead of improving it.

Do you have workloads that include lots of (random) synced writes, like databases or VMs? If not, adding an SLOG won't help you, so use the SSD exclusively as L2ARC or (depending on its size) for OS + L2ARC. This is what I recently did on some smaller systems, where >100GB or even 50GB of L2ARC (on a slow SATA SSD) just make no sense, especially with its additional memory footprint.

Make sure to use SSDs with high enough endurance ratings! Consumer SSDs wear out really fast on a busy ZFS server and especially cheaper ones tend to corrupt/loose data and/or just cease to be without any premonition like spinning rust does. Don't rely on SMART data as well - I've had 2 cheap SSDs that died way above 50% "SSD_life_left"...
 
The server is pretty low on RAM, it only has 16 GB. It's maxed out so I can't add more.

I didn't know the L2ARC used so much RAM. I could support up to 500 GB but it would be at the expense of the smaller, faster, RAM cache. Combined with the endurance issue I'm not sure it's worth it. I think I will just stick with what I have now until it's time to get an entirely new server with more RAM. When I do that I'll consider adding multiple SSDs so I can have metadata on mirrored SSDs.

Thanks for the advice!
 
The server is pretty low on RAM, it only has 16 GB. It's maxed out so I can't add more.

I didn't know the L2ARC used so much RAM. I could support up to 500 GB but it would be at the expense of the smaller, faster, RAM cache. Combined with the endurance issue I'm not sure it's worth it. I think I will just stick with what I have now until it's time to get an entirely new server with more RAM. When I do that I'll consider adding multiple SSDs so I can have metadata on mirrored SSDs.

Thanks for the advice!

Some thoughts:

Reliability of L2ARC is not a threat to your data; checksums are still used, and if the cache returns bad data, the request is sent to the pool.

If your working set size (the size of data being actively random read across) fits in the SSD but not your RAM, L2ARC can be a big win. Get an SSD targeting a random read workload.

Use zfs-stats to understand if it is metadata or user data that your primary cache is "missing". You can set the L2ARC to be metadata only if it makes sense for your workload. You can also disable use of the ARC or L2ARC filesystem-by-filesystem to control what data gets to fill them up. (NOTE: disabling L2 is always filled/reaped from ARC, so if you set primarycache=none, the secondarycache setting is irrelevant.)

If you have a UPS, consider setting vfs.zfs.vdev.bio_flush_disable=1. This lets the HDD write caches do their thing. Yes, if your power supply pops at the wrong time, you could have an issue. You have backups, right?

But, above all, if you need more IOPS from your pool, either go all SSD, or use a stripe of mirrors rather than raidz in the future.

Hope that helps!
 
You say your workload is random read. Is it really only random read? Or is it 80% random read with 20% write, perhaps small random write? Next question: How large are your IOs? For example, if you have a 20TB file system (quite realistic with 5 drives), you could have 100MB random reads, or you could have 4KB random reads, and that makes a giant difference. While 100MB reads are still "random" when compared to the overall size of the file system, they are individually large enough to get great sequential performance out of the drives, and prefetching will work well.

Next question: Are the reads whole files (but you happen to have lots of smallish files), or is it random offsets within a large file? This matters for how intense the metadata traffic will be, and also random offsets within a large file tends to defeat prefetching.

Eric's question about the working set it very important, but I fear the answer may be: there is no working set in the usual sense, and the whole big file system gets accessed. If there is no re-use, then read caching (including ARC and L2ARC) won't help, and you just need to get your backend disks to run faster.

And an insulting question: Have you turned off atime updates?

With a parity-based RAID, random small writes are expensive. So much so that even mixing in 20% small writes can hurt overall performance. On the other hand, parity-based RAID tends to be pretty good with random reads (because you read the data directly from disk, without having to do parity stuff); and with the rotating parity on modern RAID implementations, a random read workload will keep all disks busy. That's why I asked about atime: those can turn a read-only workload into having a significant metadata update workload (in particular if you are reading many small files, meaning many atimes need to be updated).

If you have small random writes, one great (but difficult) strategy is to change the application to get rid of them. Maybe implement your own log, where the writes are temporarily stored in a sequentially written file, and then have a separate process that asynchronously writes that log back to the correct location.

If your workload is metadata read intensive, it could be that metadata is the real bottleneck. In that case, moving all the metadata to SSD would help (but the reliability problem you point out is real).

Lastly, if you use the 2.5" slot to handle the OS, then maybe you don't need to actually use an SSD, but you can use a SFF (laptop-sized) spinning disk. Those are obviously much cheaper than SSDs and have better endurance.
 
Would someone please use every zfs relevant keyword / setting in this thread, to start a flowchart WITHOUT zfs on root, to 1... seperate the zfs-on-root zfs users from regular zfs users, and 2... begin a flowchart-making-sense to surpass manpages so common in Linux and dialog boxes/wikis so common in Linux and and Windows, so that printables can quickstart a new crop of BSD users as well as persons such as I who have deep understanding issues of ZFS etc without such flowchart to handhold us through any such usage, as well as augment the numbers of BSD users so that the developers are not so numerically hamstrung upstream as they tirelessly work through showstoppers on our behalf... ... this is not meant as a criticism of anything in this thread, rather a framing of it, to put on the wall, as a singular example of the intelligence of BSD users in terminology related to its use throughout forum posts that put my posts to shame.
Thanks!
 
A flowchart that is, with multiple 'you are here' on one edge, and 'you want to attain this' on the other edge, and box-to-box flows so the decisions are not having to be lookup up but made for you, to spare not only time, expense, but to also preclude errors.
.....................
No hurry by the way.
.....................
Saving this thread in my /somewhere/ZFS data set... that is, set of files, not so much a zfs data set or zfs backed filesystem set of files.
................
 
Back
Top