ZFS Does ZFS compression work on single files or globally?

Nyantastic · May 30, 2017

Does the compression option on ZFS file systems compress each file individually or does it work globally to compress all files using the same encoding? Suppose I have files A, B, and C, and A, B, and C all have similar contents, would ZFS take less space to store them if using compression?

Eric A. Borisch · May 30, 2017

No, compression is done independently for each file.

There is dedup, but unless you have identical files, it is likely of little use. (N.B.: don't switch it on unless you have read up on the cost/benefit of using it.)

Nyantastic · May 30, 2017

Eric A. Borisch said:
No, compression is done independently for each file.

There is dedup, but unless you have identical files, it is likely of little use. (N.B.: don't switch it on unless you have read up on the cost/benefit of using it.)

Thank you, I have studied the FreeBSD manual page for zfs and the handbook chapter 19 on ZFS but it did not explicitly explain the compression algorithm used. I do not have identical files but a huge number (hundreds of gigabytes) of extremely similar log files. I was not considering dedup due to the memory consumption issues you mentioned, and because one of the reasons I am untarring all these files is actually to delete the duplicates. I have two 3TB discs in a mirror configuration with 4GB of memory.

SirDice · May 30, 2017

Nyantastic said:
I do not have identical files but a huge number (hundreds of gigabytes) of extremely similar log files.

They don't need to be similar, text files tend to compress really well in any case.

usdmatt · May 30, 2017

Thank you, I have studied the FreeBSD manual page for zfs and the handbook chapter 19 on ZFS but it did not explicitly explain the compression algorithm used

There are several compress algorithms used, listed in the zfs() man page. By default lz4 is used these days which is usually the best choice.

compression=on | off | lzjb | gzip | gzip-N | zle | lz4
Controls the compression algorithm used for this dataset. Setting
compression to on indicates that the current default compression
algorithm should be used. The default balances compression and
decompression speed, with compression ratio and is expected to work
well on a wide variety of workloads. Unlike all other settings for
this property, on does not select a fixed compression type. As new
compression algorithms are added to ZFS and enabled on a pool, the
default compression algorithm may change. The current default com-
pression algorthm is either lzjb or, if the lz4_compress feature is
enabled, lz4. The lzjb compression algorithm is optimized for per-
formance while providing decent data compression. Setting compression
to on uses the lzjb compression algorithm. The gzip compression algo-
rithm uses the same compression as the gzip(1) command. You can spec-
ify the gzip level by using the value gzip-N where N is an integer
from 1 (fastest) to 9 (best compression ratio). Currently, gzip is
equivalent to gzip-6 (which is also the default for gzip(1)). The
zle compression algorithm compresses runs of zeros.

Pretty much everything in ZFS works at the "record" level. All data is split into records that are up to 128kb in size (This can be set with the recordsize dataset property). It's these records that get compressed, or de-duplicated, or mirrored across disks, etc.

The metadata for each record stores the type of compression used. As such, if you change the compression algorithm, all existing data will still be stored using the original algorithm and ZFS uses the compression flag to know which decompression algorithm to use.

Compress is configured per-dataset, zfs get ratio pool/dataset will show you have much compression you are getting.

Nyantastic · May 30, 2017

usdmatt said:
There are several compress algorithms used, listed in the zfs() man page. By default lz4 is used these days which is usually the best choice.

Thank you. I have read the documentation for ZFS. I am aware that there are several compression algorithms such as gzip. My question is not about the compression algorithm itself, but about whether the compression is per-file or crosses over files.

usdmatt said:
Pretty much everything in ZFS works at the "record" level. All data is split into records that are up to 128kb in size (This can be set with the recordsize dataset property). It's these records that get compressed, or de-duplicated, or mirrored across disks, etc.

Thank you. Since this answer seems to contradict the response by Eric A. Borisch above, is it possible for someone to provide a reference so I can have some way of knowing who is correct.

Nyantastic · May 30, 2017

SirDice said:
They don't need to be similar, text files tend to compress really well in any case.

I've untarred about a third of the log files so far, and have consumed about 350 GB of disc space.

Code:

zroot/nyan  used                  352G

The compressed tar files take up about 40 GB of disc space, so the increase is about tenfold. The compression on the disc looks like this:

Code:

zroot/nyan  compression           lz4                    inherited from zroot

The log files are bzip2 or gzip compressed. (I had to switch to gzip from bzip2 because it was taking too long to compress them with bzip2.)

Eric A. Borisch · May 31, 2017

Nyantastic said:
Thank you. Since this answer seems to contradict the response by Eric A. Borisch above, is it possible for someone to provide a reference so I can have some way of knowing who is correct.

It does not. Records will not span more than a file. atom:molecules = record:files. I omitted the implementation detail to focus on your question: redundancy and similarity between files that might be leveraged in, for example, a .txz of a bunch of similar files, are not leveraged between records, or therefore, files, in ZFS compression.

ZFS Does ZFS compression work on single files or globally?

Administrator