ZFS Presentation on ZFS by Kirk

ralphbsz · Feb 14, 2020

In about 2 weeks, Kirk McKusick (one of the fathers of BSD) will give a presentation on ZFS:

Afternoon Tutorial 2: An Introduction to the Implementation of ZFS | USENIX

www.usenix.org

This is part of the FAST conference (a scientific/research conference on file and storage systems), run by Usenix. It is in Silicon Valley, at the Santa Clara convention center Hyatt hotel. It is not free, registration is required. I don't know whether the slides will be available afterwards, not whether video will be distributed.

ralphbsz · Feb 28, 2020

I didn't actually get to see the whole presentation (darn paying job), but I managed to squeeze into the room for some parts. Clearly, it repeats much of what is written in his book "Design and Implementation of FreeBSD (2nd edition). In particular I was there a short section where Kirk explained the pros and cons of ZFS versus FFS (remember, he is the father of FFS). Some fun statements, like about snapshots in FFS, and when he would use which file system. It's absolutely delightful to hear these things from the original source. (Even if I don't agree on everything, for example he strongly recommends against using ZFS with less than 16gig of RAM and a 64-bit CPU, which is good advice for production systems, but for low-performance personal use I think he is exaggerating.)

Best part: He autographed my copy of the book.

xavi · Feb 28, 2020

ralphbsz said:
Best part: He autographed my copy of the book.

I went to EuroBSDCon in Paris in 2017 and attended Kirk's FreeBSD course where I got him to sign my copy of the book. Then during the conference I bumped into George Neville-Neil and got him to sign it as well. Now all I need is Robert Watson's signature. Then it's going straight on eBay

`Orum · Feb 29, 2020

ralphbsz said:
...a short section where Kirk explained the pros and cons of ZFS versus FFS (remember, he is the father of FFS). Some fun statements, like about snapshots in FFS, and when he would use which file system.

I'm curious what he has to say on this. My rule of thumb, based on my experience, is that it depends on writes vs reads, and at a certain ratio you're better off using UFS over ZFS. Keep in mind though that the servers I administrate are governed by file I/O, and that the databases I deal with are small and low activity, so I'm not sure if this holds true for severs in the opposite circumstances.

However, after seeing that Netflix uses UFS and not ZFS on their cache servers, it's made me wonder. I'm not working with anything at their scale, but I'm sure there are reasons they do so, and those servers must have a lot more reads than writes.

ralphbsz · Feb 29, 2020

Summary of what I remember of Kirk's pros and cons: He says ZFS needs 64-bit CPU and 16gig of RAM. He recommends FFS for smaller systems, in particular laptops. For production use at reasonable performance (disk-limited), he's right. For small systems with low performance and a small working set of the workload, I think that requirement is excessive: ZFS will be safe but slow on small systems. Would I use it on a laptop with a GUI? Probably not, because the working set of modern desktop usage is broad, the devices in laptops tend to be fast (NVMe or SSD), and humans have little tolerance for low performance. But on a small server, I think it will work fine.

I'm going to ignore the obvious administrative advantages of ZFS, like multiple disks, multiple file systems with a single pool, infinite snapshots, integration. And there is also no need to talk about the obvious safety aspects of the no-overwrite design and checksums. And dedup and compression.

He points out the obvious: ZFS has on average very good write performance, even for small files and metadata updates (for example directory operations). He recommends it for web servers, that often write many small files. The flipside is not so good read performance, in particular for large files, which are not sequential on disk. So not for HPC. And if you write a file slowly (for example log files that get slowly appended to), or you modify it in place, in particular in the presence of snapshots, the files will get very fragmented. So don't use it for (unusual) workloads that do lots of modification of files in place. What about databases on ZFS? I was only in his talk for a short time, and I don't know.

He pointed out that write performance and fragmentation depend heavily on how full the file system is. That's obvious for all file systems, and even more pronounced for log-structured ones like ZFS, but I was surprised by how low his recommendations are: he said keep it under 75% full for write performance, in particular in the presence of fragmentation.

Next: RAID. ZFS's RAID implementation is single-handedly the best you can get on a small system, because it avoids the "RAID hole", where parity-based RAID updates are either very slow or very unsafe. In my opinion, the only other safe RAID implementation that's safe to use is hardware RAID with dedicated NVRAM on the RAID hardware. So I agree completely with him. He then made the same argument I always make: rebuild (after disk failure) is faster on ZFS than traditional RAID, because ZFS doesn't have to rebuild areas that are not allocated to files. From this viewpoint, ZFS is better than any RAID system that's not integrated with the file system. The part that surprised me is that rebuild performance drops as the file system gets full, and for very full disks can get very slow, worse than I expected. That's the big reason he recommends keeping your ZFS file system not too full, the good area is 50-75%.

Interesting observation from his notes: He writes that he always gets complaints that FFS (not ZFS) needs 5% free space. He has no sympathy for the people who complain about that.

`Orum · Feb 29, 2020

ralphbsz said:
He points out the obvious: ZFS has on average very good write performance, even for small files and metadata updates (for example directory operations). He recommends it for web servers, that often write many small files. The flipside is not so good read performance, in particular for large files, which are not sequential on disk.

That's interesting, as my experience has been somewhat opposite.

For instance, one of the servers I administrate records HD video, from about two dozen security cameras, 24/7. I originally ran Linux on it with ext4, not because of any love of Linux (would much rather have run FreeBSD), but simply because the recording software, ZoneMinder, didn't run on FreeBSD. Some time later, and after many headaches of updating ZoneMinder, I switched to other recording software that thankfully runs under FreeBSD. I knew ZFS was probably a bad idea, but I wanted to try it anyway, so I gave it a try for a bit. Sure enough, after a few weeks, performance became abysmal, especially when it automatically deletes old footage to make room for new footage.

Note that this server never shared physical disks between the OS and the footage, and has a very high write:read ratio, well above 1:1. In fact, looking at it today, it's over 1100:1 for B/s and around 2650:1 Op/s. Yes, much of the recorded footage is never read back, as it's typically watched live. Additionally, few if any ZFS features were used on the footage disks; no compression, no snapshots, no atime--nothing beyond fletcher4 and zpool mirroring.

Since I had kind of expected this problem, I just switched to the UFS2 w/ gmirror, and all the problems disappeared, as if by magic. Old footage is deleted in a fraction if a second instead of half a minute, and when we do need to pull up recorded footage to view it, it happens much more quickly. So as much as I like ZFS, I know it's not a tool for every situation, which is why I'm glad FreeBSD continues to support UFS.

ralphbsz · Feb 29, 2020

Your problem MIGHT have been caused by the application writing slowly, which is one of the vulnerable parts of a log-structured (or append-only) file system such as ZFS which does not use extent allocation. Maybe your video files are kept open, and the program writes another few kB at a time to them rarely, and the automatic fsync (checkpoint generation) comes frequently between writes. That will cause extremely high fragmentation, where a file consists entirely of tiny 4KiB blocks scattered all over. In such a scenario, the metadata required to describe the file will be huge, and metadata operations (such as delete) will take a long time.

Note that this workload is very different from one that creates many small files.

I agree that it is good that we have a choice between ZFS and UFS. It would be better to have one file system that can handle both types of workloads very efficiently, but that's very hard to accomplish without a huge amount of engineering, probably more than one can expect to be invested in a FOSS file system.