ZFS ZFS Performance FreeBSD vs Linux

rootbert · Feb 12, 2020

I did a measurement concerning ZFS. I was just interested in a small unscientific performance measurement. It's not really worthwile for a post, however, I just want to share my findings.

I bought a new machine and migrated my personal data to this new one. So I could do a fresh installation of FreeBSD on the old machine which was running Linux before. I then thought about not completely getting rid of Linux since KVM offers some nifty tricks I need for my work. I setup the old machine to dual-boot FreeBSD and Ubuntu 18.04 (Kernel 5.3.0) via rEFInd from my old SSD. For storage, I use 3x 8TB Seagate Archive HDDs. I wrote a little python script which reads all files, calculates md5sum + sha256sum and stores the filename + hashsums + size in a sqlite3 database which resides in memory (/dev/shm on Linux and using md(4) on FreeBSD). In total 672345 files (I guess 98% photos) use 5.21TB of space.
The underlying disks were encrypted with the default configuration of geli under FreeBSD and cryptsetup on Ubuntu, setup as RAIDZ1 (no cache, no log).

So how long does the python script run to accomplish the task?
*) FreeBSD 12.1-RELEASE, Python 3.7.6: 15 hours, 48 minutes, 46 seconds
*) Ubuntu 18.04 with ZFS (0.7.5) and Python (3.6.9) from the main repo: 20 hours, 43 minutes; I could not believe the difference, thus:
*) Ubuntu 18.04 with ZFS (0.7.5) from main repo, and Python 3.7.6: 14 hours, 45 minutes, 22 seconds
*) Ubuntu 18.04 with most recent ZFS (0.8.3) and Python 3.7.6: (did a "zpool upgrade"): 14 hours, 45 minutes, 17 seconds

so in my small and unimportant personal scenario for mostly having my data at rest ;-), depending on what you base it, Linux was roughly 7% faster (first Linux test was probably so much worse because Python gained some nice improvements in 3.7 compared to 3.6).
However, my subjective findings are - and this is what made me curious - that FreeBSD feels much snappier with ZFS compared to Linux. My workflow with personal data is however quite limited: editing and showing pictures to friends.

I used FreeBSD in the 4.X and 5.X days as desktop, then switched to Linux, and now I am back. Professionally I have used FreeBSD since 4.X but only on servers.

ralphbsz · Feb 13, 2020

What part of the benchmark is file reading, what part is calculating checksums? Let me make an educated guess: the disks themselves can run at 100-200 MByte/second in hardware (throughput for large IOs). The "large IO" assumption is pretty reasonable, given that your mean file size is about 8 MB. Let's assume you are using RAIDZ1 encoding, in which case at any given moment only two disks are reading. And let's then assume that ZFS random seeks and other file system overhead eats half of the disk performance. That gives us 100-200 MByte/s of file system throughput. At that speed, it should take 5-10 seconds per GB, or 5000-10000 seconds per TB, or 26050-52100 seconds for your data set. That happens to be 7.2 to 14.4 hours. So of the elapsed time of 14.8 hours, roughly between half and all should be disk/filesystem, and the rest probably checksumming. Which means that your benchmark might be disk limited, or it might be exactly 50:50 balanced between disk and CPU.

That brings up the question: What program did you use to do the md5sum and sha256sum? Did you do that in Python? Can you try replacing that with a well-optimized version? Or changing your script to do checksum processing and reading in parallel (begin reading the next file while the first one is still being checksummed)? With multiple cores, you could use that to untangle what your bottleneck is.

rootbert · Feb 13, 2020

its just a very basic sequential script reading the file and using hashlib ... hashlib is compiled and works almost as fast as the system tools md5 or sha256.
I did not really see a bottleneck since I did not really expect anything as the outcome, I was just curious.
I have already planned a threaded and a async variant of that script, will report here. (but FreeBSD only since I don't want to reboot and format the drives again for Linux)

rootbert · Jun 20, 2020

I started some serious performance tests a few days ago. I am testing FreeBSD/geli + ZFS, FreeBSD/geli + OpenZFS, FreeBSD/encrypted-openzfs, Linux/LUKS openzfs-0.8.4, Linux/encrypted-openzfs-0.8.4 on 1-disk-ssd, 1-disk-hdd, 2-disk-hdd. I think I will start blogging again and will post a link to the results, but the benchmarks are not finished - still quite some time to go.

rootbert · Jun 22, 2020

Since I have noticed quite some difference, I am not sure if I did everything right. I am wondering why Linux/LUKS is so much faster than FreeBSD/geli. aesni.ko is loaded, then I do a geli init -P -K /root/geli.key -e aes-xts -l 128 -s 4096 ada0p5, followed by a

 geli attach -p -k /root/geli.key ada0p5 && zpool create -f test /dev/ada0p5.eli && zfs set compression=lz4 test && zfs set atime=off test

. Since the default ashift value on Linux is 9, I also kept that on FreeBSD, knowing that it is not optimal. Linux/LUKS is roughly 2.5 times faster than FreeBSD/geli with openzfs-kmod. I will also perform the benchmark for ashift=12 again, and will tinker around with geli sectorsize. I don't want to manipulate the benchmark, however, I want to know the fastest option for each configuration, so I am asking for some ideas to improve the performance, maybe some configuration with gnop, or is gbde faster? I doubt that changing gsched would improve anything, but I am open to suggestions. I am using fio as benchmark tools which has served me well over the years

ralphbsz · Jun 23, 2020

ZFS is very complicated, and it does its IO in a very strange way, being mostly a log-structured file system: writes tend to be sequential, reads can be dangerously random. It could be that this happens to play really nice with LUKS, and not so nice with Geli. Something like prefetching, buffer sizes, how they handle partially filled buffers, and so on.

Suggestion: Do a baseline performance test of just LUKS versus Geli without a file system, just block device, or with a trivial file system (like FAT a.k.a. ms-dos). Try both sequential and random, reads and writes. If you see giant differences already, then the result probably has nothing to do with ZFS, and is underlying differences. If you are good so far, then do a test of ZFS *without encryption* on the two systems. This is a lot of extra work, but will help you untangle where the difference comes from.

Eric A. Borisch · Jun 23, 2020

rootbert said:
Since I have noticed quite some difference, I am not sure if I did everything right. I am wondering why Linux/LUKS is so much faster than FreeBSD/geli. aesni.ko is loaded, then I do a geli init -P -K /root/geli.key -e aes-xts -l 128 -s 4096 ada0p5, followed by a geli attach -p -k /root/geli.key ada0p5 && zpool create -f test /dev/ada0p5.eli && zfs set compression=lz4 test && zfs set atime=off test. Since the default ashift value on Linux is 9, I also kept that on FreeBSD, knowing that it is not optimal. Linux/LUKS is roughly 2.5 times faster than FreeBSD/geli with openzfs-kmod. I will also perform the benchmark for ashift=12 again, and will tinker around with geli sectorsize.

Two things:

LUKS defaults to 512 sectors; you don't list your LUKS commands here, but there's a possibility you have a better ashift/crypto size match with LUKS and ashift=9. If you're using (like above) 4k sectors for GELI, you really need ashift=12, otherwise you're forcing significant extra work on every small read/write. (Running ashift=12 on top of 512 crypto sectors isn't as big of a deal, but it's not optimal.)
Those might be SMR drives. Here's another list. If they are, beware that ZFS has been known to not play nicely with some drive-managed SMR.

rootbert · Jun 23, 2020

thank you both for your input! I will do some benchmarks on the raw block devices provided by geli and luks, thats a good idea. During the night the ashift=12 tests started and I have inspected some of the results, however, the figures are a little bit worse than with ashift=9. This is all on an old notebook with a SSD disk. The tests on the HDDs (conventional magnetic recording; 500GB enterprise class from an old SUN Server) need of course much longer, and in addition I am doing more tests (I just have one SSD to test with, but two HDDs so I am testing stripe, mirror, raidz1). And unencrypted tests (they are easy since there are no combinations of encryption algo/keysize) are also included ... however, I am just doing each test one time since it takes quite a long time already and I don't need a scientific test without a special workload.

Mjölnir · Jun 23, 2020

If the encryption is not done in the equal way on both Linux and FreeBSD? Maybe Linux uses your CPU's hardware encryption facilities and FreeBSD does not? Trying to insert an I/O scheduler seems to be a good idea to me. I had much smoother system behaviour with it on UFS; for ZFS I never tried. You can grab my service script to integrate gsched(8) into [/usr/local]/etc/rc.d/ driven by entries in rc.conf(5)[.local]. The usual disclaimer applies: It worked for me on FBSD10/11, YMMV. Should you use it and find any improvements or corrections for 12.1, please post them in that thread.

rootbert · Jun 23, 2020

thanks! I think I will try different schedulers ... the benchmarks are consuming lots of time anyway so a little bit more work is ok ;-) and I am curious of course. Thanks for the script! I am quite sure FreeBSD uses the cryptofacility because in my first benchmarks aesni was not loaded into the kernel and the figures were quite bad

Mjölnir · Jun 23, 2020

I guess the RR scheduler will be the 1st choice, and I don't see a way to insert it into the vdev. Adjust the geom_sched_rr_* knobs for the throughput of your vdev or disk(s). AGAIN: I never tried this with ZFS! Backup your data 1st... !EDIT!: I think the service script supports
service geom_sched rcvar This might help.

rootbert · Jun 23, 2020

thanks for the tipp ;-) don't worry, my disks with data on them are unplugged, and the notebook is a dedicated test notebook without any data