Linus Torvalds Begins Expressing Regrets Merging Bcachefs

lgrant · Aug 31, 2024

rbranco said:
Debian Orphans Bcachefs-Tools: "Impossible To Maintain In Debian Stable" - Phoronix

www.phoronix.com

I don't know much about Rust yet, but I am wondering if this dependency problem is similar to the "DLL hell" problem Windows used to have. Their Common Language Runtime solved the problem for them (if I understand correctly) by allowing there to be multiple versions of a dependency installed at the same time, and having the program depending specify exactly which one it needed. I'll have to do more research before I can form an opinion about whether Rust could do something like that.

zirias@ · Aug 31, 2024

lgrant The classic approach with shared libs (.so) already allows multiple versions. A library typically has a version number with multiple components, like libfoo.so.1.2.3. It also has a SONAME property in its headers containing only a part of the version number, most commonly just the first component, so here libfoo.so.1. The library is then installed with two symlinks:

Code:

libfoo.so.1.2.3
libfoo.so.1 => libfoo.so.1.2.3
libfoo.so => libfoo.so.1

When you upgrade to a newer major version, but need to keep the old one, you'd see something like this:

Code:

libfoo.so.1.2.3
libfoo.so.2.0.0
libfoo.so.1 => libfoo.so.1.2.3
libfoo.so.2 => libfoo.so.2.0.0
libfoo.so => libfoo.so.2

The source of a program needing libfoo will typically just link libfoo.so and therefore get whatever this symlink points to. This will add a dependency in the binary to the SONAME, so e.g. libfoo.so.2. Some program compiled with libfoo.so pointing to the older major version will always request libfoo.so.1 from the runtime linker.

This approach works perfectly well as long as:

For every major version, you make sure to have the latest version of the library installed
The library author carefully versions the library: Additions are always ok, but whenever there is a breaking change, the part of the version that's in SONAME must be bumped

It is also very friendly for package management, in theory, you can package every library separately and upgrade your package (for the same SONAME) independently of any consumers without ever causing breakage.

In practice, it happens from time to time that a library update breaks stuff without SONAME getting bumped. If this happens, it's an individual versioning error on the side of the library.

Now we enter the "wonderful"? world of language-specific package managers (like nuget, cargo, npm, ...). They try to solve the problem by having every program specify the full exact version of its dependencies. Typically, the libraries are just linked statically into the final binary, or they're bundled on installation. If I'm not mistaken, Rust offers a way for dynamically linking, which only works if all components were build with the same rust version, but that doesn't help much, you'd end up with tons of incarnations of the same library installed. To "simplify" things, there's often the possibility to have a "vendor" subtree in your source repository, containing full copies of all your dependencies. This indeed simplifies the packaging work, but doesn't solve the other issues: You clutter the installation with a huge amount of libraries (in case of dynamic linking) or you install many unnecessarily fat binaries (in case of static linking), and you completely lose the ability to upgrade a library independently of its consumers. The latter is very relevant when a library has a security vulnerability. With the classic shared libs, you'd upgrade let's say the package libfoo2, replacing libfoo.so.2.0.0 with libfoo.so.2.0.1 which fixes the vuln. Every consumer requesting libfoo.so.2 will automatically use the fixed version. But if the exact library is baked in everywhere, you'll have the maintenance nightmare to analyze the dependency trees of all binaries, and upgrade all of these affected.

Jose · Sep 1, 2024

gpw928 said:
I have lost count of the number of times I have grown ext4 and xfs file systems on-line with lvextend. I'll admit that shrinking file systems on-line is dangerous, and I'd always prefer to do it off-line. But shrinking is acutely rare (read never happens).

Hope you chose the number of inodes correctly!

resize2fs failed while extending ext4 filesystem

While increasing filesystem over resize2fs, it failed complainig max inodes number

www.ibm.com

gpw928 · Sep 1, 2024

Jose said:
Hope you chose the number of inodes correctly!

resize2fs failed while extending ext4 filesystem

While increasing filesystem over resize2fs, it failed complainig max inodes number

www.ibm.com

ext[234] file systems have all sorts of hard coded limits, such as the number of inodes per file system mentioned above.

Another common limit is the number of sub-directories per directory (which, I think, was doubled with ext4).

However, BSD ufs file systems generally have comparable limitations.

Running out of inodes on a V7 file system was not too difficult (e.g. on a net news server).

However, the limits for both ext[234] and ufs are pretty generous for most situations.

My practical experience of ufs and ext[234] is that it's unusual for hard coded limits to be an operational issue.

mer · Sep 2, 2024

remember when sizeof(int) was 32 bits? that drove a lot of sizing in the metadata structures of filesystems like UFS. Then sizeof(int) wound up at 64 bits, so there was a transition to using 64bits in the structures. I don't recall when, but UFS went from 32bit inode to 64 bit inode sometime ago.

zirias@ · Sep 4, 2024

mer, FreeBSD uses, like most "unixy" systems, the LP64 data model. Windows even uses LLP64. int has 4 bytes in both. ILP64 (where int has 8 bytes) is a pretty rare choice.

Code:

$ uname -spr
FreeBSD 14.1-RELEASE-p3 amd64
$ lldb
(lldb) p sizeof(int)
(unsigned long) 4
(lldb) p sizeof(long)
(unsigned long) 8
(lldb) p sizeof(void*)
(unsigned long) 8

mer · Sep 4, 2024

Yep, I was thinking more of the inode structure used by UFS. As devices got bigger; there was a change in the size, maybe from 32 bits to 64 bits.

weberjn · Sep 9, 2024

ralphbsz said:
Depending on how the underlying storage system is implemented, the workload presented by ZFS's CoW style writing can either lead to terrible performance, or to great performance. The thing is that the average user doesn't know ahead of time which it is going to be.

So this means you better use UFS for a virtual server?
But if you want full disc encryption and do not want to set it up manually, only in the ZFS path the installer sets it up for you.
The bsdinstall should get an encryption option for UFS, too.

2.6.4. Guided Partitioning Using Root-on-ZFS

ralphbsz · Sep 9, 2024

weberjn said:
So this means you better use UFS for a virtual server?

I don't know. Benchmark it yourself, so much depends on the workload.

But if you want full disc encryption and do not want to set it up manually, only in the ZFS path the installer sets it up for you.
The bsdinstall should get an encryption option for UFS, too.

I'm not disagreeing that this is a valid request, but I don't know how much effort is available to improve the installer, and whether this request is important enough compared to other tasks.

gpw928 · Sep 9, 2024

ralphbsz said:
I don't know. Benchmark it yourself, so much depends on the workload.

Yes, way to go. My ZFS server is just a storage server.

Some VMs get storage provisioned as zvols through iSCSI to a virtualisation server.

There's enough layers of complexity there to do anyone's head in.

So, in the end, a benchmark is the way to go, assuming you care about performance -- sometimes I just want convenience.

[However, the important VMs are all provisioned from non-CoW file systems on an SSD mirror local to the virtualisation server.]

Sam The Ripper · Sep 21, 2024

GlitchyDot said:
Why do people dont trust BTRFS ? i tried few times but could not bother to utilize it but now after using zfs and need linux on the side - BTRFS sounds like a real nice option with raid.

Because BTRFS is fundamentally broken by design.

Chris Mason is a good engineer, but he completely missed the point when he was trying to compete with ZFS.
And so is the bcachefs developer. People in Linux just don't understand ZFS enough.

Just make things straight:
ZFS was not designed to be the file system of your Operating System. Using it like that was a SUN's marketing move for pushing Project Indiana.
ZFS was designed to give SUN a SAN (pun not intended LOL) offering comparable to IBM, NetApp, EMC (now Dell) etc.

It was part of a more large goal, which included COMSTAR (Solaris' FC/iSCSI/Infiniband management framework) and FMA (Fault Management Architecture), called Project Fishworks, which is now known as Oracle ZFS Storage Appliance, also included in Oracle SPARC SuperCluster, as part of their hyperconverged solution.

Also at that time, Solaris missed advanced local storage capabilities, like LVM, which was already present in HP/UX and AIX. People who were ran Solaris, most of the time adopted VERITAS for that purpose.
So SUN didn't have a solution neither to provide LUNs to a data center nor to aggregate those LUNs when they were imported in a Solaris server.

That's why they designed ZFS by combining Logical Volume and RAID management with file system capabilities.
Linux developers (and Mason in particular) didn't understand that, and called it a rampant layer violation.
For them, ZFS was just a file system with snapshots capabilities that just did too much. And they are not the only ones who think like that.
But snapshots were just a normal feature of a *storage* solution. Every storage can do snapshots, because they are what storages replicate in a DR site. And that's why ZFS has send/recv.

Ironically, Mason was working for Oracle at that time, and he "surprisingly" left the company after the Oracle-SUN merging. He's now working at Facebook where they use BTRFS for hosting containers, invalidating the main point of using it instead of ZFS (SAN-based storage).

When he designed BTRFS he made some irreparable (some from a technical side and some from its reputation side) mistakes that caused its doom:

He didn't provide an on-disk format from the beginning, especially since Linux was a fast moving target. Therefore, an upgrade of a Linux kernel version will automatically cause the upgrade of the BTRFS pool metadata, making it unusable anymore with previous versions of the kernel. There was no "btrfs upgrade" like in zfs/zpool commands. That said, while it is not a problem anymore since there is no modification in BTRFS behavior nowadays, it caused a *LOT* distrust.
It was quite unstable for a lot of time and IT STILL IS in some mission-critical configurations. BTRFS suffered for a lot corrupting bugs and it still known to be unsafe in RAD5/6 configurations. And RAID5/6 are MANDATORY when we talk about storage. No one would be so out of mind to build a RAID0/1/10 in an enterprise environment. I wouldn't even do that in my NAS. Both striping and mirroring are useful if you don't have critical data.
subvolume concept was bugged for a LOT of time. While in ZFS every dataset was always a semi-independent file system, where just a bunch of properties are forcefully inherited (quota, for example) in child dataset from the parent ones, in BTRFS every subvolume (even when they are not related) HAD to share the same properties. To make an example, If you set the compress flag in a subvolume it was automatically set even when you mounted another subvolume. Now it seems like they fixed it. But that was a *ridiculous* bug that caused distrust.
Volume management nonsense. This is also valid for bcachefs. Why in the hell I have to mount a STORAGE POOL by calling the RAW DEVICES? Not even LVM/MDRAID do this, since they are able to create an aggregate device. Why we can't mount the pool by calling something like /dev/btrfs/pool, instead of /dev/sdaX like a traditional file system? What is the logic behind this? A storage pool exists to MASK the management of the raw devices! And please, no, don't tell me to use UDEV specific notations, like UUID=xxxx, because they are awful and they don't fix the problem (check your mount list if you don't believe me!).
Unsuitable for a storage since it doesn't support BLOCK VOLUMES. Seriously, guys! With Btrfs there is no something equivalent to ZVOLs. You cannot create a block device. And therefore, there is no possibility to create LUNs that you can export via FC/iSCSI/Infiniband. So you cannot use BTRFS as a SAN-based storage solution. People will tell me that they are using it in their NAS. But, guys, NAS are FILE storage appliances, not BLOCK. Since all BTRFS can do is to create subvolumes, that are considered directories, they can only be exported with file storage protocols, like SMB/NFS. But in an enterprise environment, they are just a small part. What ZFS really can do is to provide LUNs to servers, which can aggregate them with LVM or even ZFS itself.
Writable snapshots. LOL. I can't even think about that without laughing. A snapshot is a READ-ONLY point of restore of a volume. It is designed as such because you have to make a consistent photography of a specific status. And that specific status is what you have to expect when you switch from the Production Site to the Disaster Recovery site. They are read-only everywhere, not just in ZFS. Did you ever seen a writable snapshot in VMware when you make it on your VM? NO. Because when you want to copy something to perform modifications you have to CLONE it. But not BTRFS, because it was designed to make WRITABLE snapshots by DEFAULT. If you want a REAL snapshot you have to specify the "-r" flag in btrfs subvolume snapshot command. This is so wrong at many levels. BTRFS broke the entire concept of a snapshot and its role.

These are just the first things who came in my mind right now. I'm quite sure there is something else I forgot to mention.
But I think that they are enough to explain why BTRFS failed and why EVERY half-assed attempt to mimick ZFS will fail too.

ralphbsz · Sep 21, 2024

Sam The Ripper said:
Writable snapshots. LOL. I can't even think about that without laughing. A snapshot is a READ-ONLY point of restore of a volume.

A read-only snapshot is useful and valuable. It says: This was the exact state of this file system (or set of file systems or block device or ... whatever) at this time. You can forever continue reading it, and you can trust that is has not been and will not be modified. Even if the original file system is written to in the meantime. Their biggest application is making consistent backups, quickly and without disrupting the foreground workload. That's why they are called snapshot: it's like walking by with a camera, and quickly taking a photo of reality, without having to do much staging (the short people in front, the tall ones in the back, everyone smile). Taking a full backup by making a complete copy of all data is very slow, and disrupts the foreground workload; on a typical modern disk, it takes roughly a day to make a complete copy of a disk drive. Because snapshots are implemented in the metadata, they can be nearly instantaneous; the price for that is that all operations on the file system are a tiny bit slower after the snapshot is taken. The disk space they use is the amount of modifications done to the original file system since the snapshot was taken; for most file systems (where things change slowly), that is very little.

And a writeable snapshot is also useful and valuable, but in different ways. It says: This was the exact state of the file system (or whatever) at this time. If the original file system is written to after the snapshot was taken, it will not change your copy. You can read it and modify it. If you modify it, your modifications will not modify the original file system either, so there is perfect isolation. You can make dozens or hundreds of writeable snapshots, even at the same time, and they are all isolated from each other, but they are all fully functional file systems that are writeable. They share the performance characteristics of readonly snapshots, see above.

As you said, writable snapshots are often called clones, or other names.

Why is this useful? Because the copies share the underlying storage. You can make a hundred writeable snapshots, and use no extra disk space. Once users start writing to the snapshots, the amount of space used is only their local changes. About 25 years ago, there was little use for writable snapshots. But today they are very useful, for deploying lots of VMs: setup an OS or server installation once, make lots of "clones" of the root disk, and use those clones to run those VMs. Since the bulk of the file system remains unmodified, this is an easy and efficient way to manage storage for server farms.

Sam The Ripper · Sep 21, 2024

ralphbsz said:
A read-only snapshot is useful and valuable. It says: This was the exact state of this file system (or set of file systems or block device or ... whatever) at this time. You can forever continue reading it, and you can trust that is has not been and will not be modified. Even if the original file system is written to in the meantime. Their biggest application is making consistent backups, quickly and without disrupting the foreground workload. That's why they are called snapshot: it's like walking by with a camera, and quickly taking a photo of reality, without having to do much staging (the short people in front, the tall ones in the back, everyone smile). Taking a full backup by making a complete copy of all data is very slow, and disrupts the foreground workload; on a typical modern disk, it takes roughly a day to make a complete copy of a disk drive. Because snapshots are implemented in the metadata, they can be nearly instantaneous; the price for that is that all operations on the file system are a tiny bit slower after the snapshot is taken. The disk space they use is the amount of modifications done to the original file system since the snapshot was taken; for most file systems (where things change slowly), that is very little.

And a writeable snapshot is also useful and valuable, but in different ways. It says: This was the exact state of the file system (or whatever) at this time. If the original file system is written to after the snapshot was taken, it will not change your copy. You can read it and modify it. If you modify it, your modifications will not modify the original file system either, so there is perfect isolation. You can make dozens or hundreds of writeable snapshots, even at the same time, and they are all isolated from each other, but they are all fully functional file systems that are writeable. They share the performance characteristics of readonly snapshots, see above.

As you said, writable snapshots are often called clones, or other names.

Why is this useful? Because the copies share the underlying storage. You can make a hundred writeable snapshots, and use no extra disk space. Once users start writing to the snapshots, the amount of space used is only their local changes. About 25 years ago, there was little use for writable snapshots. But today they are very useful, for deploying lots of VMs: setup an OS or server installation once, make lots of "clones" of the root disk, and use those clones to run those VMs. Since the bulk of the file system remains unmodified, this is an easy and efficient way to manage storage for server farms.

Actually, even in ZFS clones initially shared the same allocation of the parent snapshot. It's not something BTRFS invented.

Bash:

root@homeserver:/etc # zfs list storage/jails/templates/14.1-RELEASE
NAME                                   USED  AVAIL  REFER  MOUNTPOINT
storage/jails/templates/14.1-RELEASE   417M  37.2T   417M  none
root@homeserver:/etc # zfs list -t snapshot storage/jails/templates/14.1-RELEASE
NAME                                                   USED  AVAIL  REFER  MOUNTPOINT
storage/jails/templates/14.1-RELEASE@20240920081312Z   160K      -   417M  -
root@homeserver:/etc # zfs clone storage/jails/templates/14.1-RELEASE@20240920081312Z storage/test
root@homeserver:/etc # zfs list storage/test
NAME           USED  AVAIL  REFER  MOUNTPOINT
storage/test     0B  37.2T   417M  /storage/test

Don't get me wrong: it's not like BTRFS is technically doing something it shouldn't do. The problem here is that it calls things with wrong names, breaking a consistent behavior. There is no something like a writable snapshot.
If we talk about a volume snapshot we're talking about a point of restore. If we talk about a clone we're talking about a copy of a volume.

ralphbsz · Sep 21, 2024

Ah, just wrong naming. Got it.

weberjn · Sep 24, 2024

ralphbsz said:
I don't know. Benchmark it yourself, so much depends on the workload.

I did.