ZFS Selecting compression options for stable/14

Tickerguy · Sep 21, 2024

The "common answer" is always set lz4 unless you know the data is not compressible (e.g. most video streams, already-compressed data archives, etc.) because ZFS will attempt it, bail early if it detects non-compressible data, and thus the loss is small but the potential gain (in storage capacity) is real and material.

But over on the various mailing lists for zfs 2.2 is some rather nebulous notes about zstd-9 being able to behave the same way; checking first if it will compress with lz4, then zstd-1 and only if both succeed then go ahead and use it on that block, otherwise not. Zstd-9 is quite resource intensive to compress but the level of zstd compression doesn't impact reads to any material degree so if it can bail fast and early this might be a significant win for compressible data sets over the default and worth the CPU expenditure, particular if I'm writing to a multi-vdev set of mirrors which is all spinning rust, and therefore physical transfer rate is likely the limiting factor.

Is this in the current code or is the answer still "use lz4 except in special circumstances"? Its not clear at all whether that's the case from what I can determine, but stable/14 did import the 2.2 code from upstream.

If you know, thanks in advance!

Alain De Vos · Sep 22, 2024

Just use lz4. It's a good compromise.

Erichans · Sep 22, 2024

Generally speaking, compression increases disk IO. lz4, on any decent modern CPU, has hardly a significant compute impact—if the data is compressable or not. zstd-X differs in that compression and decompression behaves differently as you mentioned; it matters (a lot) more if the data changes often or hardly at all (read-only vs. write intensive). Have a look at Zstandard Compression in OpenZFS - 2020 by Allan Jude.

Besides compression, there are other ZFS properties that are (very) important for disk performance, such as ashift and record size / volblocksize, have a look at OpenZFS: Understanding Transparent Compression - June 2, 2020 Klara Systems.

Tickerguy · Sep 22, 2024

Yeah Erichans; I've learned all that over the years and incidentally a lot of the "standard" advice is wrong. For example the "standard advice" is, for a database such as Postgres, to use 8kb recordsizes because that matches what the databases uses internally.

The logic is sound but the facts are that on modern hardware, particularly on either SSD or nVME storage, its almost always faster to use native record sizes and compression -- and not a little faster either! Several times faster, and the CPU overhead for doing so doesn't go up much.

Having looked into Zstd when it first appeared my conclusion was that it was only going to be usable in my workloads in very specific circumstances (which rarely arise in my case.) However, if the early-abort allegedly in 2.2 is in there that changes the picture quite a bit, which is why I was asking if it is.

Erichans · Sep 22, 2024

Tickerguy said:
Is this in the current code or is the answer still "use lz4 except in special circumstances"? Its not clear at all whether that's the case from what I can determine, but stable/14 did import the 2.2 code from upstream.

Tickerguy said:
However, if the early-abort allegedly in 2.2 is in there that changes the picture quite a bit, which is why I was asking if it is.

I don't know if there's a change coming for the default compression scheme for ZFS, but based on (master) zfsprops.7 - compression it's still lz4 (or lzjb).

Merge in OpenZFS: zstd early abort #13244 - May 24, 2022; this was released in Open zfs-2.2.0 - Oct 13, 2023—included in 14.0-R - ZFS notes and beyond; so, it's a fact (no more allegedly).

The most recent zstd compression links that I have are:

Refining OpenZFS Compression by Rich Ercolani at OpenZFS Developer Summit 2022 (see: "Refining OpenZFS compression")
Some notes on ZFS's zstd compression kstats (on Linux) 2024, by Chris Siebenmann

cracauer@ · Sep 22, 2024

If there is code to try multiple compression algorithms before giving up I can't find it.

My understanding also is that zstd has a fairly high startup cost before you get actual compressed chunks that you can measure.

Tickerguy · Sep 22, 2024

A couple of quick checks; these are running large (~200Gb) image backups from a Win11 machine to a SMB-mounted volume; NIC on the PC is a 2.5G and the image software cannot saturate it (but can get reasonable close; ~1.8Gbps -- it can saturate a gigabit link) and the uplink to the target is 10Gbe. The target volume is a 3x2 mirror zfs filesystem which reaches ~50% I/O utilization on a per-spindle basis during it (in other words its not saturated either.)

Image size (uncompressed, as shown with "ls -lH") is 197Gb.

With compression set to lz4 disk consumption in blocks (difference between free when starting and after copying) is 141Gb. CPU during the write is immaterial (its there obviously but most of it, as shown by systat -vm, is occurring in smbd to service the network.) Compression saves ~29% of the disk space and incidentally Lz4 beats the built-in "medium" compression that the backup software (Macrium) has; I do not know what algorithm they're using, however.

With compression set of zstd-9 disk consumption in blocks is 125.6Gb. But CPU during the write is now very substantial with load average running ~2x higher and a whole lot of it in the kernel (presumably in the compression algo.) In this case compression saves ~36%, which is a fairly substantial increase -- at the cost of material CPU during writing.

In addition a read from that file back down to the PC runs at wire speed and equally-interesting the CPU hit from doing so beyond the smbd service, while visible (e.g. some kernel use) is extremely small -- maybe 15-20% beyond the smbd service requirement. As such decompression has no notable CPU cost with zstd-9.

This is the reason I believed it might be in the recent OpenZFS 2.2 import from upstream: https://www.truenas.com/community/t...-compression-that-saves-cpu-and-space.113247/ and the commit referenced implies it went into the codebase in October of last year. And then there is this in the statistics after these runs, which implies it is indeed there and operating:

root@NewFS:# sysctl -a | grep zfs | grep std

kstat.zfs.misc.zstd.size: 1443024
kstat.zfs.misc.zstd.buffers: 9
kstat.zfs.misc.zstd.passignored_size: 86
kstat.zfs.misc.zstd.passignored: 86
kstat.zfs.misc.zstd.zstdpass_rejected: 1283245
kstat.zfs.misc.zstd.zstdpass_allowed: 120540
kstat.zfs.misc.zstd.lz4pass_rejected: 1403785
kstat.zfs.misc.zstd.lz4pass_allowed: 1677820
kstat.zfs.misc.zstd.decompress_failed: 0
kstat.zfs.misc.zstd.compress_failed: 0
kstat.zfs.misc.zstd.decompress_header_invalid: 0
kstat.zfs.misc.zstd.decompress_level_invalid: 0
kstat.zfs.misc.zstd.compress_level_invalid: 0
kstat.zfs.misc.zstd.decompress_alloc_fail: 0
kstat.zfs.misc.zstd.compress_alloc_fail: 0
kstat.zfs.misc.zstd.alloc_fallback: 0
kstat.zfs.misc.zstd.alloc_fail: 0

A reasonably-large system image such as this has both essentially-incompressible pieces and quite-compressible ones, and in addition those copies are done on a schedule and not "all the time, every day." What the data appears to show is that while it does hit the target server's CPU quite-considerably during writes the saving in disk space is material enough (about another 13% differentially compared with lz4, or ~7% raw) to be worth it for this specific filesystem.

Mirror176 · Sep 22, 2024

My understanding was it was something that is in but I don't know of any way to enable/disable it. I figure that not being able to disable the use of lower compressors for early abort checks means that data that compresses poorly with lower compression settings and better with higher will have some data that could be compressed now getting skipped. Similarly it doesn't adjust the compressor up only if it would have saved more disk space or been faster to decompress and still leaves you with the original high compression setting.
Messing with compression=zstd-18 has had less of a performance impact than I would have expected but I haven't been sure when it is an early abort from 18 vs a lower compressor early abort but it certainly happens.
Adjusting record size can certainly impact compression further. Higher than the default is usually better but comparing test runs with different settings is required to see what is the best record size and best compression setting if just trying to maximize it.
Also handy is that if you gain compression from a higher zstd then you can also gain more ARC compression efficiency and performance isn't identical at all levels for decompression but does stay fairly similar. If you copy form one compressed dataset to another dataset with the same compression you won't have to decompress+recompress so it will copy around with low overhead like it wasn't compressed but get there faster if there was 'any' saved space from the first compression.
If you are concerned about space then you will still usually get better results using a compressor separate from ZFS (zstd, xz, zpaq, etc.). Separate compressors are able to group more than single record sized chunks of content and can look across that larger quantity to optimize compression predictions. They can also use other advanced techniques like a custom compression dictionary. If comparing the same compressor built into ZFS vs external to it, ZFS likely has an older version which can be behind for achieved compression ratio and compress/decompress speed. On the opposite side, Zstd does not have multithreaded decompression at this time so decompression performance may be impacted compared to what the ZFS counterpart gets because ZFS can decompress multiple records at the same time which negates the need to wait for separate multithreaded decompression work.
The 8k/16k recommendations for databases is so you can have small records both to read and update. Lets assume you have a 1GB database and 128k record size. If you want to read an 8k record, you have to read a 128k record to be able to checksum verify it and if compressed then you can expect to decompress 128k to read that 8k. If you want to write it, then adding to the read steps above, you then update the 8k within the 128k, redo checksum/compression/etc. to the 128k, and then you can write that new 128k block to disk. Trying more extreme record sizes like 1M or the (less tested) 16M size can really amplify this so you can see how such 8k activity is being impacted or not.