ZFS thoughts on zfs & compression

I have set compression of /var/log to gzip , because text files compress well.
I have set compression of /usr/ports/distfiles to off, because distfiles are already compressed.
Maybe you have you other tips ? Interesting tunings ?
Do you set compression on /var/db/mysql ?
 
Be aware that compression will not save you any space below the size of your pool's ashift. If your ashift is 12 (as it should be for most devices these days), then a very compressible 3kB log file takes up a 4kB block, whether compressed or uncompressed.

If you do compress /var/log, you can turn of additional xz compression for newsyslog rotation (if you use that).

Also, consider the effects of recordsize and compression, even for uncompressible data. Having compression completely off will certainly waste some space for large recordsizes, because files usually don't fit exactly into an integer number of records. For a recordsize of 1M, a file of 1100kB will take up 2MB on disk (because that's 2 records). However, if that dataset is compressed, you get the used space back down (close) to 1100kB.

If you're worried about performance, you can set compression to lz4 on such datasets, which is very fast, even on uncompressible data. To be absolutely certain that no CPU cycles are wasted, but also no half-empty records are wasted, set compression to zle ("Zero Length Encoding"), which only compressed runs of zeroes. That should also take care of half-filled records, but I believe I haven't tested that myself.

In general, a large recordsize is good for compression (because data is compressed per record), but bad for metadata overhead. Ultimately, it pays off to experiment, because it's hard to know how some actual data will handle. I recently archived 90GB of MP3 files into a pool and experimented with four compression algorithms and recordsizes from 16K to 1M. I ended up using lz4 compression and 64K recordsizes, which gave the most efficient balance for the total used space on disk.
 
In general, a large recordsize is good for compression (because data is compressed per record), but bad for metadata overhead.
Sorry, that came out wrong. I meant to say: A large recordsize generally gives you better compression and less metadata overhead, but space is wasted on partly-filled records if compression is off :)

Conversely, a smaller recordsize will reduce compression efficiency and give you more metadata overhead, but not waste much space on partly-filled records, even with compression off.

It's hard to know where the sweet spot is for the recordsize of partly-compressible data in medium-sized files (anything between 1M and 10M), which is why experimentation is useful.

Beware that all this doesn't take writing performance into account (my most important pools are long-term storage with very little writing). recordsize has a big impact on writing performance for data that changes often (like databases), but I don't know much about that.
 
Of course compression on (lz4) for mysql.
In fact turning off does not make sense even for already compressed data
 
There is also the size of the "buffers" as used by the operating system. And by the application.
Postgresql uses smaller buffers than mysql.
 
  • Like
Reactions: mtu
An interesting question is what do you do with a directory , full of mp3 , full of flac, or full of avi ?
 
In general, I almost always have lz4 enabled, there's very few disadvantages. At a minimum though, I'd use zle instead of off; it at least makes sure files will be sparse if they can be.

If your system has the zstd feature available, I'd highly recommend using that instead of gzip. It's basically better in every way (better ratio and less CPU time). In some circumstances, it can replace lz4, especially if you have zstd-fast-1 or so (default is zstd-3).

My opinion of the compression algorithms available on ZFS:
off = guarantees files take up space and don't automatically become sparse on runs of zeros, just like UFS would do.
zle = blocks of zeros don't get stored to disk, otherwise equivalent to off - I use this on my swap zvol.
lzjb = the original fast ZFS compression algorithm. Mostly useless now except on old pool versions.
lz4 = current default for compression=on with an extremely good ratio and CPU time tradeoff. Fully replaces lzjb and mostly a good idea to enable on a root dataset (if your pool is sufficiently general-purpose anyway)
gzip = old algorith offering decent ratio but severely expensive in CPU time. Mostly only good for old pool versions.
zstd = Extremely flexible algorithm nearly rivaling lz4 in speed (the fast levels can actually achieve similar benchmarks). Ratio is much better than any of the prior algorithms, even the default zstd-3 is very fast and achieves better compression ratio than gzip-9 can.

From this list, my main considerations in any circumstance are zle, lz4, and zstd. It comes down a lot to the write performance characteristics desired. I do even have a zstd-19 file system; very rarely do I write to it, so it achieves an extreme level of compression while not being demanding on read.

A large recordsize generally gives you better compression and less metadata overhead, but space is wasted on partly-filled records if compression is off
recordsize advises ZFS of the largest record size to use for file blocks, but small enough files can and will use the smallest unit available on the hardware (typically 512 or 4096 bytes).
 
gzip = old algorith offering decent ratio but severely expensive in CPU time. Mostly only good for old pool versions.
zstd = Extremely flexible algorithm nearly rivaling lz4 in speed (the fast levels can actually achieve similar benchmarks). Ratio is much better than any of the prior algorithms, even the default zstd-3 is very fast and achieves better compression ratio than gzip-9 can.
I disagree on that one. I recently migrated a couple of terabytes of stuff into a "write once, keep forever" kind of storage pool and experimented a lot to find the best space efficiency (with write performance being almost no concern).

I found that gzip (default level 6) does considerably better than zstd (default level 3 or even maximum level 19) on two kinds of data:
  • software binaries (mostly ISOs, as well as some tarballs and uncompressed executables)
  • ebooks (PDF and ePub)
zstd was, however, better or equal on most of the other data (which was mostly near-incompressible music and video).
recordsize advises ZFS of the largest record size to use for file blocks, but small enough files can and will use the smallest unit available on the hardware (typically 512 or 4096 bytes).
I'm surprised, because I saw very large space usage overhead (around 10–20%) on when storing about 90G of MP3 files (of mostly 3M to 7M each) in uncompressed records of 1M. This went down considerably when I lowered the recordsize. I assumed this was because of partly-filled records. I'll have to do some more experiments :)
An interesting question is what do you do with a directory , full of mp3 , full of flac, or full of avi ?
For my MP3s of the typical mid-2000s to mid-2010s kind: recordsize 64K, compression zstd. For large avis: recordsize 1M, compression zstd. FLACs should be somewhere in between.
 
I'm surprised, because I saw very large space usage overhead (around 10–20%) on when storing about 90G of MP3 files (of mostly 3M to 7M each) in uncompressed records of 1M. This went down considerably when I lowered the recordsize. I assumed this was because of partly-filled records.
Here we go, this is all the same 478 MP3 files of sizes between 2 and 7MiB each, for a total of 1940MiB:
Code:
NAME                 RECSIZE  REFER  LREFER  COMPRESS
safe/test/lz4-1M          1M  1.86G   2.09G       lz4
safe/test/lz4-512K      512K  1.86G   1.98G       lz4
safe/test/lz4-64K        64K  1.85G   1.88G       lz4
safe/test/lz4-4K          4K  1.89G   1.88G       lz4
safe/test/none-1M         1M  2.09G   2.09G       off
safe/test/none-512K     512K  1.98G   1.98G       off
safe/test/none-64K       64K  1.88G   1.88G       off
safe/test/none-4K         4K  1.90G   1.88G       off
safe/test/zle-1M          1M  1.87G   2.09G       zle
Look at LREFER and REFER for the 1M-recordsize datasets: It looks very much like partial-record overhead that gets compressed away (or doesn't).

But then again, all documentation says that chungy is right, and there is no such thing as partial-record overhead.

In which case I don't understand what's happening here. What on earth is driving the LREFER up for the 1M-recordsize datasets? Why does compression counter the effect, even if it's just zle?
 
Be aware that compression will not save you any space below the size of your pool's ashift. If your ashift is 12 (as it should be for most devices these days), then a very compressible 3kB log file takes up a 4kB block, whether compressed or uncompressed.

A very, very compressible 3KB log file, which ZFS can compress into around 100 bytes or less, will be put inside a block pointer structure if the embedded data option is enabled. In this case, no 4kB block will be allocated at all.
 
  • Like
Reactions: mtu
Long story short:

To compress, or not to compress, that is the question​

Yes, enable LZ4 compression.

LZ4 is perfect for admins who would like to use compression as a default option for all pools. Enabling LZ4 on all volumes means compression will be used where appropriate, while making sure the algorithm will not waste CPU time on incompressible data and reads will not incur a latency penalty due to transparent decompression. Just incredible.

Personally I have never found a concrete case in which to disable lz4, even in the presence of already compressed data and very fast media (3GB + / s)
 
Personally I have never found a concrete case in which to disable lz4, even in the presence of already compressed data and very fast media (3GB + / s)
I've found ZFS with lz4 (or any algorithm, really) will often recompress already compressed data with a terrible ratio. Yes, it's still technically "smaller", but negligibly so, and then this has to be decompressed twice--once for lz4 (or whatever you're using) and again for the compression mechanism on the file itself.

Don't believe me? I wrote a script to compare the "on-disk" size to the "apparent" size, and if the file was compressed by ZFS using lz4 (i.e. on-disk < apparent), it added those sizes to the totals of each depending on the file's extension. In truth, the actual compression ratios are probably worse than this as it excludes the possibility that a compressed file could be larger than an uncompressed one, which is probably the case (as I believe ZFS only considers the initial few KBs to see if a file is compressible). As such, this data is likely biased in favor of the compression algorithm, but I had no way of knowing if files where the on-disk size was larger than apparent were skipped or just poorly compressed. I then ran this on my SSD that's full of games, which let me get a wide variety of file types & extensions. Here's some of the more interesting results, where the percent listed is how large the compressed size is compared to uncompressed:

Code:
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zssd   928G   567G   361G        -         -     1%    61%  1.00x    ONLINE  -

Some of the worst offenders using common extensions:
99.907302%  flac
99.880143%  webm
99.871827%  PAK
99.833232%  swf
99.736952%  mp4
99.559496%  wmv
99.528653%  gz
99.401093%  bnk
99.174981%  bk2
99.057178%  bik
99.047371%  zip
98.662708%  m4v
98.654002%  arc
98.597121%  pdf
98.443760%  otf
98.440942%  gif
97.040944%  db9
96.965660%  dat
96.543321%  ivf
96.161654%  MSI
96.012911%  cab
96.010030%  msi
95.976791%  wav

It's not all bad though; some extensions have big gains:
20.719518%  JSON
18.516408%  bmp
16.133807%  log
15.202436%  xml
14.062500%  sqlite-shm
4.813791%  sqlite-wal
0.554591%  sqlite
 
The real question is: a lz4 compressed pool is faster or slower?
Bigger or smaller?
In my experience is just as fast, and smaller.
Some times much smaller
 
The real question is: a lz4 compressed pool is faster or slower?
On most commodity desktop/laptop hardware, I'd wager that it's likely to be nearly always faster.

The tradeoffs are between how fast your disk can read data and how fast the decompressor can decompress it. lz4 is an algorithm explicitly optimized to have minimal CPU time, much faster than your disk reads are.

It's possible to have enough and fast enough disks to outperform lz4, and you'd start to consider off or zle here.

I've found ZFS with lz4 (or any algorithm, really) will often recompress already compressed data with a terrible ratio.
ZFS won't store compressed blocks if the compressor isn't shrinking it smaller than ⅞ of its original size. It's likely that all of your examples following are the metadata headers being compressed while the bulk of the compressed codec data is being stored as-is. You could poke around even more with zdb(8) to find out exactly which blocks are compressed and not.
 
ZFS won't store compressed blocks if the compressor isn't shrinking it smaller than ⅞ of its original size. It's likely that all of your examples following are the metadata headers being compressed while the bulk of the compressed codec data is being stored as-is. You could poke around even more with zdb(8) to find out exactly which blocks are compressed and not.
Also note that a dataset's recordsize plays a part in this. Consider a fictional example of an MP3 file with a size of 5MB, of which 50kB are compressible metadata (with a ratio of 2.5x), and the other 4.95MB are incompressible. Let's say for simplicity's sake that the compressible metadata sits at the start of the file.

In the case of the default zfs recordsize=128K, the following happens:
  1. The first record contains 50kB of metadata and 78kB of incompressible music. With the metadata compressed to 20kB, the total compression ratio for this record is 98/128 = 76.56%. This is better than 7/8 = 87.5%, so the record is stored compressed.
  2. All the other records are incompressible.
  3. In total, the file is compressed from 5MB down to 4.97MB, with a compression ratio of 1.006.
In the case of the recordsize=1M, however:
  1. The first record contains 50kB of metadata and 974kB of incompressible music. With the metadata compressed to 20kB, the total compression ratio for this record is 994/1024 = 97%. This is worse than 7/8 = 87.5%, so the record is stored uncompressed.
  2. All the other records are incompressible.
  3. In total, all of the file is stored uncompressed.
Beware that my example of 50kB of 2.5x-compressible metadata was fictional, I don't know how much metadata there usually really is, or how compressible it is.

I did, however, experiment on 90GB worth of real-world MP3 music, and found that recordsize=64K gave the best storage efficiency (though it's all a question of 1 or 2GB at best). I think what happened was: Any larger recordsize would (sometimes) fail to compress metadata, as laid out above. Any smaller recordsize generates so much metadata overhead that it eats the compression advantage.
 
I've found ZFS with lz4 (or any algorithm, really) will often recompress already compressed data with a terrible ratio. Yes, it's still technically "smaller", but negligibly so, and then this has to be decompressed twice--once for lz4 (or whatever you're using) and again for the compression mechanism on the file itself.
No, it does not really work like that.

(as I believe ZFS only considers the initial few KBs to see if a file is compressible)
Compression is not file-based; it is record-based. ZFS will attempt to compress every record. If it is possible to win at least one ashift worth of space by compression, the record is then saved compressed. If not, the record is saved in plain so as not to decompress it on every read.

When you average many files, you get an apparent 99% compression rate, but it is in fact composed from many uncompressed records and a few significantly compressed records.
 
Compression is not file-based; it is record-based. ZFS will attempt to compress every record. If it is possible to win at least one ashift worth of space by compression, the record is then saved compressed. If not, the record is saved in plain so as not to decompress it on every read.
Do you mean to say "If it is possible to win at least one ashift worth of space by compression and the compressed size is smaller than 7/8 of the original"? Otherwise, I'm confused ;)

The 7/8 rule is mentioned in zfsprops(8), though I have been told by developers on IRC that it might change in the future.

(Also, why does that manpage confusingly seem to speak of records and (logical) blocks interchangeably?)
 
Compression is not file-based; it is record-based. ZFS will attempt to compress every record. If it is possible to win at least one ashift worth of space by compression, the record is then saved compressed. If not, the record is saved in plain so as not to decompress it on every read.

When you average many files, you get an apparent 99% compression rate, but it is in fact composed from many uncompressed records and a few significantly compressed records.
While it's good that it's record based (for files that might have a mix of compressed & uncompressed data), I still don't think it's a good idea to use any compression on a ZFS data set that will only store already compressed files (e.g. already compressed audio/video). It will mostly just slow down writes, as it has to test if each block is smaller when compressed, for negligible gain.

That said, it's easy to create more data sets, so there's no reason to not take advantage of this. I have separate data sets (or in some cases, entire pools) just for storing audio, video, and images, and then anywhere else that might have compressible data I take advantage of lz4 or zstd.
 
While it's good that it's record based (for files that might have a mix of compressed & uncompressed data), I still don't think it's a good idea to use any compression on a ZFS data set that will only store already compressed files (e.g. already compressed audio/video). It will mostly just slow down writes, as it has to test if each block is smaller when compressed, for negligible gain.
You'd have to have some mighty fast drives, and some mighty slow CPU, to see an actual slowdown of writes with lz4. When my ZFS pool is being written to at the ~150MB/s capacity of the HDDs, the CPU (some 5-year-old Intel quadcore) is just chilling at around 5% load, so the bottleneck is definitely the spinning rust, and not the compression.
 
The limiting factor in my current setup is the 10G interfaces, but I'm using them with relatively "low end" hardware, i.e. Atoms. Between the CPU used just for network & disk I/O, and encryption/decryption, the CPU load can actually get quite high. I'm not sure if compression would push it over the edge and make the CPU the limiting compnent, but I see no reason to find out as everything stored on those pools is already compressed by another content-specific codec.
 
On most commodity desktop/laptop hardware, I'd wager that it's likely to be nearly always faster.

The tradeoffs are between how fast your disk can read data and how fast the decompressor can decompress it. lz4 is an algorithm explicitly optimized to have minimal CPU time, much faster than your disk reads are.

It's possible to have enough and fast enough disks to outperform lz4, and you'd start to consider off or zle here.
On most server-grade hardware, it's way faster, because you will use multicore CPUs.
Even for 3GB/s+ storage I find nothing against lz4, even for non-compressible data.
It really is a "fire and forget"

I believe that filesystem compression should not be confused with compression by application programs.
Since, practically always, the system is as fast as it is without compression, and the space occupied is less, sometimes much less, as with VM, the short version is (at least for me): lz4 on
 
Back
Top