Solved rsync blocks zfs operations and blocks input devices, unusable

atax1a · Dec 18, 2025

we have heard these words, verbatim, from more than one btrfs fan: "the repair tools are really nice"

this is decidedly not something we want to hear about a filesystem. hard pass.

ralphbsz · Dec 18, 2025

bakul said:
In 2009 a Linux friendly sysadmin at a large corp was telling me how BTRFS was soon going to wipe the floor with ZFS. The reality distortion field around all cults is rather strong....

I am lucky to count several people closely associated with the internals of ext2/3/4, ZFS and BtrFS among my professional friends. There are good reasons I use ZFS for my server's storage needs, and ext4 on Linux, and they start with "personal trust", followed by examining what is important to me (availability, durability).

ralphbsz · Dec 18, 2025

loveydovey said:
But on a root partition?

The OS installation part of the root partition is indeed usually not performance critical. Except that on single-user machines and small servers, users will sometimes use the root file system and partition for their user data. Which makes separate management of performance (and other service level criteria) of their data impossible.

loveydovey said:
ZIL writes metadata, which is much worse than sequential writes, and the sync option forces it to always interrupt sequential writes?

Much more than just metadata is written to the ZIL. Most importantly, it records whole transactions, which usually contain data, and metadata (for example updates to a files' attributes, allocation table, and the directory content if it has to match). It turns out log-style writes are actually pretty efficient even on spinning HDDs, because the head typically doesn't have to move.

Add to that the fact that deep down, ZFS is a log-structured file system, with log cleaners, and it works significantly different from direct-allocating (block or extent) file systems. In particular, log-structured and CoW file systems have different performance characteristics from traditional "overwrite in place" systems.

The sync option (which can happen as a system call, as an option on an open file, or as a mount point / dataset option) simply forces writes to be somewhere on disk. That has costs, and users have to think through the cost/benefit analysis whether they want to use that. When I say "users", I mean at the system call level; user-space applications (such as rsync or UI-base file managers) may hide what is really going on, often rather too well.

loveydovey said:
ZFS even has an option to have a dedicated ZIL drive (sounds like an enterprise-level configuration) ...

Having a dedicated log device (ZIL or L2ARC in ZFS) on separate hardware is a way to optimize performance. When the main disk storage is on HDDs, this can become very important to get good performance. If the main storage is on storage such as SSDs that is already fast for small and random reads and writes, the need for a separate log or cache device is diminished, although not always eliminated.

loveydovey said:
SSDs also have this crazy thing where they have fast burst writes (something like they do a "superficial" voltage write of 3 bits as 1 at first on some special subset of space [which is fast] and then they do like a "deep" voltage write of 1 bit for each 1 bit [which takes longer and is a major slow down]). Or something like that. Correct me if I'm wrong please.

That is at a much lower level, when individual "bit" cells are written (today each cell tends to hold 2...4 bits). On top of that, SSDs internally run a pretty complex "file system" of their own, known as the FTL. The reason for that is: while on the NAND flash hardware, reads of any size are possible and efficient anywhere and at any size, writes are only possible on blocks that have been recently erased. But the layer above the SSD (the host's LVM and file system layer) thinks that it can write individual sectors anywhere. So the SSD interposes its own virtualization, and internally redirects writes to erased blocks (sometimes more than one to get hardware parallelism), and keeps track of each sector's location and liveness. That allows it to perform internal log cleaning, and erase enough blocks to keep writes going reasonably quickly. What makes this complex is that flash storage has a limited lifespan (counted in erase cycles), so the FTL has to perform "wear leveling" to make sure no part of the SSD is overused. And when placing newly written blocks, it has to make sure that the eventual log cleaning remains efficient, without requiring lots of internal copying. Or in summary: Write performance of SSDs is very complex.

loveydovey said:
To be fair, I never lost data on BTRFS except the "negligible" data due to no sync.

Data loss on storage systems is also a charmingly interesting topic. By far the single most important source of data loss is user error. The "rm *" is the textbook example. The most important thing to do to protect data is to guard against that, using things like backup, snapshots, and write protecting or other permission systems.

The second biggest source of data loss is software bugs, in anything from user-level software to internal firmware of storage devices (whether those are hard disks, SSDs, disk arrays, HBAs, and so on), with the OS-level storage stack (file system, RAID, LVM) having a fond place in my heart. Even there, many errors are actually caused by humans. Some of my favorite examples include certain Windows version helpfully adding partition tables to any disk they find on the SAN that looks unformatted (overwriting every disk on a cluster file system!), and clueless sys admins formatting some disk they think is connected to their host with a Reiser file system (leading to bad jokes about "murders your wife and your supercomputer"). For a while, ReiserFS was a darling of the high-performance community, for reasons ...

Data loss by actual hardware failure (disk errors, complete disk failure, SSD wear out) can be completely mitigated today, if the user/administrator cares. There is a reason the cloud providers can advertise a data loss rate of 10^-11 in good conscience: their systems meet those objectives. The biggest remaining cause of "hardware" data loss is either human error or catastrophic failure. Two anecdotes (both sadly true) about that: A former employer actually measured when field service caused data loss carefully, and the single biggest cause was a service technician, when replacing a failed disk, by mistake pulling one of the surviving GOOD disks out of the system and throwing it in the trash; this was the biggest driver towards making disk arrays double fault tolerant (the second fault is caused by the service technician trying to fix the first fault). For redundancy, there is the (extremely sad) story of a large IBM mainframe customer who had their main data center in New York's World Trade Center. They had a complete backup data center with copies of all data, alas it was in the other tower of the World Trade Center. That's why "off-site" backup needs to think about the blast radius of environmental events, such as fire or flood.

loveydovey · Dec 18, 2025

ralphbsz said:
pulling one of the surviving GOOD disks out of the system and throwing it in the trash

Haha, that's me a couple of days ago, when I threw out the working Linux installation in the (metaphysical) trash. Not by mistake, and a very happy story overall.

Solved rsync blocks zfs operations and blocks input devices, unusable

atax1a

ralphbsz

ralphbsz

loveydovey