Interesting musings on a potentially bad bug in ZFS, some type system thoughts - and Rust

cracauer@ · Sep 15, 2025

Bug:

An (almost) catastrophic OpenZFS bug and the humans that made it (and Rust is here too)

A couple of weeks ago I fixed a nasty bug in this function in OpenZFS:

despairlabs.com

Summary of discussion on bug:

What people said about an OpenZFS bug

A few days ago I wrote about a OpenZFS bug I found and why it went unnoticed. It generated a surprising amount of chatter (at least for me!) which I found quite interesting, and did actually help clarify my thinking on a few things.

despairlabs.com

This is Rob Norris, who just worked around the latest GPL-only interface assault from the Linux kernel on ZFS. He's with Klara.

Yes, he mentions Rust

End notes:
"So that’s a lot of words about why it’s hard to do a lot of things, but I’ve also said a lot of times that it’s important to me that I do something. So what am I looking at?

In the immediate, static analysis. As I said above, it’s currently a bit of a mixed bag, but that doesn’t mean we can’t do anything at all. It turns out we already do have a little bit of CI-driven linting based on Cppcheck but it covers very little of the codebase currently because most of the kernel-specific code is difficult for other tools to work with, so it wasn’t fully integrated yet (to the extent that I didn’t even know about it). I’ve had a look at it and it doesn’t look very complicated to get running regularly against more of code, so that’s the place to start.

After that, I think its just the continuing refactoring and cleanup that I’ve been doing for a couple of years already. I have some huge plans and dreams for OpenZFS, but most of them are stymied by the code not being particularly modular, so difficult to use in other contexts. Some kinds of testing, like unit testing, are difficult for us for the same reason, so continuing that work will make that easier too.
"

ralphbsz · Sep 15, 2025

This is the kind of thing that would be found in a GOOD code review. By that I don't mean the common code review that is required in the CI flow of typical modern software processes, which are 99% concerned with "short functions, small commits, coding style, naming conventions", and are mostly a tool to create sociological uniformity. But a code review where the reviewer walks the logic of every function de novo (a legal term meaning: from scratch, without using assumptions). One technique to find this bug is to mentally apply RAII and data flow analysis to it. Why is psize defined but not initialized? Check whether it is used before being set: good, it is not. Can we rewrite the function so psize is only defined when we have the value? Turns out we can, except the assert may have to be moved a few lines down. Then apply the same logic from the back of the function: are there any dead stores? The obvious one is discussed in detail. But interestingly, the cols variable also has a dead store: It is set when defined (good), but that value is then ignored and it is set again. By simply reordering the variable definitions, one can get rid of that one too.

Another good way, as Rob points out, is to use static analysis tools. I'm particularly fond of Coverity (comes out of a Stanford research group, and is now a commercial product). I have two reasons for liking it: (a) I know the founder. (b) We used it on a large file system project I worked on, and it did great, but was super annoying: It found thousands of false positives (which ended up wasting roughly a man year of both senior and junior engineers to deal with), it found hundreds of real bugs (which all got fixed), and of those dozens were bugs that had data loss potential. It also cost a 5-digit amount, but given that it objectively prevented a dozen real dangerous bugs from getting out into the field, it was very much worth it. Rob complains about the false positives, and that is just an indication of a code base that has lots of cobwebs. The correct answer would be: Pay for a commercial Coverity license, and turn on that every commit has to pass Coverity with NO errors. Which would mean first spending a year fixing all false positives. Rob explains why the risk/reward situation of an open source project makes that impossible in the short run. He's not wrong ... and that is a fundamental issue with the funding flow and cooperative (don't step on anyone's toes) culture of open source projects. A powerful BDFL could fix this, but I don't think ZFS has that person.

Finally, here's my personal obnoxious comment: I don't believe in unit tests as a cure for problems like this. I always find that unit tests come in two flavors: One is a very simplistic test for functions that don't require a lot of underlying infrastructure; those typically don't test that the function does anything useful, or that it does what the caller really needs and expects; instead it tests that the function does what the code says (duh). Or for functions that require a lot of infrastructure (like this one, which needs vd, vdrz and txg, which are all complex), the "unit test" is mocking up a few simple scenarios, and then thinking that passing those means the function is good. Unit tests give a false sense of security, because (a) the requirements for a function are usually not spelled out clearly (lack of requirements analysis and design documents), and (b) unit tests can't explore a wide enough space of state combinations. For something like a file system (or anything else with stored state), I believe in intense system tests: Run the software with intense and bizarre workloads, on the largest tests systems you can lay your hands on, and inject real-world problems (network glitches, disk failures, bizarre operator commands, ...), and make sure data is readable, and the read data matches what was written. In the big computer companies, this is done, and typically a software project has as many testers as it has software engineers or more (2:1 is my favorite tester : programmer ratio), and some of the largest systems are used for resting. In open source, the money is typically not there for this, and users end up being the guinea pigs.

Useradd · Sep 15, 2025

ralphbsz said:
I always find that unit tests come in two flavors: One is a very simplistic test for functions that don't require a lot of underlying infrastructure; those typically don't test that the function does anything useful, or that it does what the caller really needs and expects; instead it tests that the function does what the code says (duh). Or for functions that require a lot of infrastructure (like this one, which needs vd, vdrz and txg, which are all complex), the "unit test" is mocking up a few simple scenarios, and then thinking that passing those means the function is good. Unit tests give a false sense of security, because (a) the requirements for a function are usually not spelled out clearly (lack of requirements analysis and design documents), and (b) unit tests can't explore a wide enough space of state combinations.

Not if the unit tests are done properly. They should be defined and can be to some extent coded during the sprint. After they have been implemented they should be continually updated to catch any bugs that may slip through subsequently.

bakul · Sep 15, 2025

At a previous job we had 8-12 "verification" engineers for one engineer who developed our first network ASIC.

I think we need more than unit tests. We need "sub-assembly" tests. Each unit (submodule) may pass all its tests but may still misbehave in combination with other such units if the assumptions in different modules don't match. But we don't yet know how to do this systematically.

I think as systems get more complex we need to figure out how to write resilient software, not brittle software that falls apart if there is one bug. But so far we don't know how to do this in a systematic manner (or even otherwise). I don't recall if I posted a link to a talk by Jerry Sussman about this, but here it is.

[edit:] One other thought. I think all software should be rewritten every few years! This is what the author of the K array programming language did. The idea is you iterate and simplify until all excess fat is removed but this is impossible to do when we keep adding new features.

Interesting musings on a potentially bad bug in ZFS, some type system thoughts - and Rust

cracauer@

An (almost) catastrophic OpenZFS bug and the humans that made it (and Rust is here too)

What people said about an OpenZFS bug

ralphbsz

Useradd

bakul