ZFS ZFS Compression + Deduplication

Technical-ish question...

On my file server, I'm using both compression and deduplication (yes, I have enough RAM for dedupe). For a particular dataset, I was trying to determine if compressing the data with gzip-9 vs lz4 would yield a better compression ratio while still meeting my performance requirements. I tried to test this with (basically) the following...

Code:
zfs create pool/new
zfs set compression=gzip-9 pool/old
zfs snapshot pool/old@snap
zfs send -R pool/old@snap | zfs recv -F pool/new

I've used this pattern of copying/rewriting data before to switch settings that are only applied on write with success. This time, however, I got an unexpected result. The compression ratio on 'old' and 'new' were exactly the same, but 'zpool list' showed a significant increase in dedup ratio. Also the free space on the pool was barely changed. So what I think happened is that the entire new file system deduped against the old before the new compression was applied.

Does this sound plausible? Are blocks deduplicated before compression is applied? If so, is it possible to do what I'm looking to achieve without having to first copy all of the data off the system and back?

Thanks.
 
As far as I'm aware compression is done first, as dedupe will use ZFS checksums to decide what are duplicates. The checksums are generated after compression, as ZFS uses these to validate what's on disk, and so the checksum must be of the actual data on disk.

Are you currently using lz4 but trying to test gzip-9. That's what it seems like but your example commands above are wrong. (Obviously the commands you actually ran might of been different)

Assuming the pool is already lz4, what your commands above are doing is setting the old dataset to gzip. This means that all existing data on that dataset will stay compressed with lz4, but any new data will get gzip.

You then send the data from this dataset. During this, ZFS will decompress the data. It's then received onto pool/new. You never changed the compression setting on pool/new, so I suspect this still has your original lz4 setting applied. As such, the data will then get re-compressed with lz4, and end up an exact copy of the original blocks.

Edit: Just to add, usually you would let ZFS create the destination dataset when you do a first send. I'm not 100% sure what would happen if you create the destination, set properties like compression, like use -F to effectively force it to be replaced.

Also, you're using -R which according to the man page "preserves all properties". I don't use -R as I find it takes over too much and sends stuff I don't want. I don't know if one of these preserved properties is the compression setting.

What I would do is something like the following -

Code:
# zfs create -o compress=gzip-9 pool/test
# zfs snapshot pool/old@test
# zfs send pool/old@test | zfs recv pool/test/new

The creates a new dataset with gzip compression enabled. All child datasets inherit settings, so any datasets created under this will have gzip compression.

I then take a snapshot of the old dataset, which currently has lz4 compression.

The last command sends the snapshot to a new dataset under pool/test. Because this is under the test dataset, it will get gzip compression by default.
 
Interesting. I don't know why I thought that send -R and recv -F were required here... I'm trying this again using your suggestion.
 
Back
Top