UFS geli: Protection against bitrot?

One major benefit to ZFS over UFS is bitrot protection through builtin checksumming.

It sounds like geli can offer this. Wouldn't it offer decent bitrot protection?

The downside, if I understand correctly, is that you know there's been bitrot, but no chance for recovery. Not even with RAID 1, right?

Unless maybe you had GELI on individual drives and did RAID 1 over those? Then maybe it could read from the alternative drive if there was a failure?

This is mostly hypothetical here. I know most people should probably just use ZFS, but being able to have bitrot protection on UFS is also nice.
 
It sounds like geli can offer this.
Every layer of the stack can theoretically do checksumming, error detection, and error correction. The lowest layer (the drives, both HDD and SSD) definitely do that. You typically don't actually see that, unless you use tools like SMART to check on them (or your disk dies). The transport layer (SATA or SCSI=SAS) definitely does it, but there you hardly ever see it: errors on the wire are immediately fixed by the received noticing, and retransmitting. Inside the computer, either the HBA or the memory are typically the weakest links; but memory is supposed to have ECC, some file systems have checksum protection of data structures in memory (been there, done that, got the T-shirt), and HBAs are supposed to be bug free (ha ha, funny). From that point upwards, it is all software, and every software layer could do it. But software is also theoretically intended to be bug free.

The SCSI committee, which is typically much more focused on commercial environments which care much more about data durability and availability, thought about this, and built the T10DIF standard, which allows transporting checksums all the way from the user layer to the platter, and checking them at all layers of the stack. Nice theory. It's even implemented sometimes, and rarely implemented correctly, at least from the drive to the file system layer (I've never seen it go up to user space or applications, but maybe for lack of looking).

Why do we care about T10DIF? For some illogical reason, people are always too worried about their hardware (disks, SSDs, memory) screwing them over. But that hardware has been getting very reliable, nearly laughably reliable when combined with standard recovery mechanisms (various flavors of RAID and backup). On disk drives, only one worrisome error mechanism remains: off-track writes: the disk accepts a write for a block, but on some subsequent read returns the old (overwritten) content of the block. This happens in spinning rust (due to servo problems), and in SSDs (due to firmware bugs in the FTL). The drive vendors are doing all they can to prevent that, but to really address it, they need help from the layer above, and T10DIF is exactly that. Bummer it's so hard to use.

The real problem with this is the following: If you implement checksums in a layer (like GELI), and you detect a checksum error, what do you do? The only safe answer is "fail stop". Two reasons: First, you can not return the block or file with the error to the layer above, and log an error: returning wrong data is always unsafe. But traditional operating systems (the design of Unix is about 50 or 60 years old) have no means of a lower layer telling the upper layer "I could read this thing, but something smells fishy". They can only do "OK" and "EIO"; that semantics is not rich enough for the next layer to do something useful. Second, as I said above, all the layer below should have done good error detection and correction. If an undetected checksum error managed to get through all the safety, that probably means that a systemic problem exists: the error was probably not an alpha particle (see ECC above), it was probably something worse. Best thing to do: Crash the computer, and by extension the whole worldwide internet. That's just not practical, so it is usually easier to just not look and hope that the layer above takes care of it (this is the ostrich school of management: head in the sand).

In the real world, the largest causes of data loss are NOT bit rot, nor disk failures. They are (and the order is pretty accurate): (a) eser errors (the infamous "rm -Rf *", and more elaborate versions), (b) software bugs (well known examples include ReiserFS and BtrFS: the first one murders your files, not your wife; the second one is a machine for destroying files). (c) Site disasters (fire, flood, hurricane, explosion).

So all in all: One software layer, in particular one that is not knowledgeable about redundancy at the neighboring layers, implementing checksums by itself is usually not a good investment of time and money.

The downside, if I understand correctly, is that you know there's been bitrot, but no chance for recovery. Not even with RAID 1, right?
How would GELI even know that RAID 1 exists? It is a block encryption mechanism, for one block device at a time.

Unless maybe you had GELI on individual drives and did RAID 1 over those?
You just turned GELI into a RAID system. That is about 10x or 100x more complex than what it already does. And making it work just for RAID 1 (mirroring) isn't going to solve many people's problems, since disk space is getting to be quite expensive, and the bulk of the world's data is stored in some encoding based RAID (RAID 5 or higher), which is even more complex.

This is mostly hypothetical here. I know most people should probably just use ZFS, ...
Given that ZFS exists for FreeBSD, I think putting checksums into GELI alone would be doubly foolish: Not only is it a bad investment (see above), but a good solution already exists. Here is something I would like to see instead: Teach ZFS how to work with T10DIF, and teach it to export/import checksums to user programs, in particular the large middleware layers (such as databases and object stores) that already know how to operate on blocks, and often internally have checksums of their own. This is a gigantic amount of work, and it's not clear to me it's feasible in the open source world (because it requires close cooperation across layers of the stack that have no common management or financial interest in OSS).
 
Thanks for your reply!

I'd like to address one main part. That is, bit rot. I think it's way more common that people give it credit for.

I worked at a company that had numerous servers setup with software RAID 10, with Linux.

There was a /proc entry that detailed how many blocks differed in the RAID 1. Each server, having been up for several years, seemed to have +/- 1,000 blocks of difference. No obvious ill-effects, but how would you know?

Maybe this was a transfer issue, but I am guessing it had more to do with the data on the platters being incorrect.

You bring up a lot of fair points and have a lot of good information there. I just don't feel like checksumming at the software level is a crazy thing to do. Of course, ZFS does it for you and ZFS is great for most.

This is mostly just to satisfy my curiousity.
 
It would be relatively easy to make a new anti-bitrot GEOM layer, presumably based on ECC like it is used for RAM.
 
The part that gets hard: Where to store the checksums? Option 1: Right next to the data blocks. Has the advantage that it doesn't need extra seeks, and data + checksum can be read and written with a single IO. Big disadvantage: It screws up the block size. On SCSI disks, you can do some of this using larger physical sectors (520 or 528 bytes with 512 bytes of cargo), but it's not clear that is enough checksum to allow reliable error detection, much less correction for typical syndromes (as on disks, the typical problem is not a single bit flip). For "error correction", you really need full redundancy, since you usually need to reconstruct a whole sector. On SATA disks, that trick doesn't work. And while SCSI disks are not really more expensive than SATA any longer, you can't rely on them being available if you are just a software provider or vendor.

Option 2: Put the checksums into something like a database, or a special set of sectors. Now normal IO requires lots of seeks, and is no longer atomic. If one keeps the checksums primarily in memory, atomicity becomes a giant headache, and dumping them to disk at the right time requires lots of work. Which can either be performance killing, or unreliable, or very complex to implement.

If you can control all of the storage stack (disk IO, RAID and file system), you can work around some of these issues. But that requires whole system design, in the style of ZFS or better. One example just to give the flavor: On modern disks, the block size is 4K. But 4K is a heck of a lot of space for just storing checksums. Way too much for the checksums of a single 4K block. So you rewrite the whole stack to make a 32K block be the standard unit of allocation and IO. That can be done if you control all of the stack. But 4K is still a little too much, 2K would be ideal. So you lay out the data always using 32K data block + 4K block with checksums for both blocks + another 32K data block. Like this reads can always be done using 36K IOs, with little overhead. The problem is writes: sometimes you don't know the checksum for the other data block in the pair, so you have to prefetch and do a read-modify-write cycle. With good disk scheduling, that's not too painful, if the platter rotation can be hidden behind other slow operations. And then you cache the checksums in memory (more aggressively than data blocks, since they're more valuable).

This is just one example of the tricks one can use to build such a system ... but it requires quite a bit of work, and it will only work really well if one can make lots of compromises elsewhere.
 
Every 512 or 4096 bytes of data already have ECC on the hard disk. For SSD this ECC is for entire cell. Make some research how the ECC is working and how the data is structured on the disk. The only problem is if the ECC data got corrupted/damaged then the entire sector can't be verified and need to be discarded. On Hardware based RAID with redundancy (raid1,5,6 etc) it get reallocated and recovered by data which is stored on other disk (entire data block is rebuild based on data parity)


 
Every 512 or 4096 bytes of data already have ECC on the hard disk. For SSD this ECC is for entire cell. Make some research how the ECC is working and how the data is structured on the disk. The only problem is if the ECC data got corrupted/damaged then the entire sector can't be verified and need to be discarded. On Hardware based RAID with redundancy (raid1,5,6 etc) it get reallocated and recovered by data which is stored on other disk (entire data block is rebuild based on data parity)

I think the concern here is that disks sometimes come up with changed blocks regardless.
 
No, you can't read corrupted data from the disk. It will be reported as bad block if ECC fail and there will be no any faulty data transferred out from the disk controller. There's multiple end to end ECC verification. Inside the disk controller, from the disk controller to the chipset (preventing SATA cable errors, bad connection, noise inside the cable etc.) then from the chipset to the CPU and so on.

Check the data integrity section: https://www.intel.com/content/dam/s...r-products/Enterprise_vs_Desktop_HDDs_2.0.pdf

If you write corrupted data for example when you have No ECC memory to the storage then yes the data that you store in the disk will be junk but it will not be the storage problem.
 
No, you can't read corrupted data from the disk. It will be reported as bad block if ECC fail and there will be no any faulty data transferred out from the disk controller. There's multiple end to end ECC verification. Inside the disk controller, from the disk controller to the chipset (preventing SATA cable errors, bad connection, noise inside the cable etc.) then from the chipset to the CPU and so on.

But you do see mismatches in practice. E.g. see above the Linux md RAID mismatches on resync.

The truth is that I don't trust the disk vendors to get it right.
 
No, you can't read corrupted data from the disk.
Yes you can. But admittedly, it is very rare.

It will be reported as bad block if ECC fail and there will be no any faulty data transferred out from the disk controller.
Modern disk drives quote an unrecoverable bit error rate of about 10^-14 (known as the BER of the disk). That means if you read 10^14 bits, you are expecting to get at least one case where the disk drive says "sorry, I can not read that block, because I had a read error". In most cases, the read error will actually be an ECC error, so what the disk drive is really saying: "I tried to read this block from disk, but the ECC didn't match, so I kept retrying as many times as is sensible, and none of them worked, so I'm giving up".

However, any ECC is statistical in nature. Once in a while, corrupted data will actually match the ECC. And once in a while, the block being read has a valid ECC, but isn't the block that you wanted to read. In the old days (until a few years ago), the disk drive vendors actually specified this rate as the "uncorrected bit error rate" or UBER, and the typical value was 10^-17. I think the consumer-level disk drive specifications no longer quote this rate, as it is (a) very hard for the user to measure, and (b) very hard for the disk vendor to estimate. Disk customers with an enormous number if disks (typically several million disks) can actually measure this number.

There's multiple end to end ECC verification. Inside the disk controller, from the disk controller to the chipset (preventing SATA cable errors, bad connection, noise inside the cable etc.) then from the chipset to the CPU and so on.
And all of those are statistical in nature. For example, a normal ethernet TCP/IP connection uses a 32 bit checksum. If in a large computer system there are 2^32 packets per second with at least one bit transmission error, then once per second a corrupted packet will be wrongly accepted by the TCP checksum algorithm. And very large computer (clusters) are running at those data rates.

In reality, statistics isn't even the main source of such errors, but implementation bugs. All these error rates are measurable in large computer installations, and they are non zero. As cracauer said: I don't trust the vendors to get it right. I trust them to try really hard, and get it very close to right, but alas, they are human.
 
Back
Top