ZFS (any FS): Aggregated storage & cache flush with/without UPS & battery-backed devices

Mjölnir · Jan 21, 2021

Dear storage gurus,

I need some expertise to verify the following thoughts. Thanks in advance for your feedback.

A laptop has a builtin UPS, the battery. Thus it should be safe to disable the explicit flushing of the drive's (or controller's) cache. I.e. for ZFS on a laptop (in sysctl.conf(5))

Code:

vfs.zfs.cache_flush_disable=1
vfs.zfs.vdev.bio_flush_disable=1

Not that I could notice the difference in daily usage, but I believe it will gain some performance in heavy-write scenarios (and it's nerdy). Some laptops even have two batteries, where the internal one can not easily be removed and only be disabled in the BIOS (e.g. Lenovo ThinkPad). So this tweak should be safe on such machines, even when the 1st battery is removed, as long as the 2nd, internal battery is enabled.

Now I'm asking myself if such a tweak could cause damage for aggregated storage, i.e. RAID1-x, when the machine crashes for another reason than power outage. E.g. could it damage the internal data structures of a RAID5, so that after a crash the RAID is in an irrecoverable, irregular state? I.e.: all devices are online, but the correct data can not be recovered?

Background: while enterprise-grade storage boxes (should) have either a battery-backed storage controller or a battery dedicated to the onboard storage controller chips, most entry-level NAS boxes (or microservers) do not. I found only the Kobol Helios64 NAS to have such a vital feature (and a very outdated model of Buffalo). Likewise, I read that SAS disks have a capacitor-backed cache, while SATA disks have not. Is that correct?

Is it thus fair to say: to be reliable, a NAS box shall have ECC-RAM and either a builtin UPS, or a battery-backed storage controller or use SAS disks? Then it is safe to disable the flushing of the controller's or disk's cache, because even without an external UPS the RAID will be in a recoverable state after a crash (if the minimum #disks are ok).

Matlib · Jan 21, 2021

ECC is always beneficial to data integrity.

Correctly functioning UPS and proper shutdown is pretty much also a must. Having said that, I've experienced many heavy workload server crashes, and none of them were actually fatal (although there were some corrupted or missing files).

As far as hardware write-back cache goes, we use two NVMe PCI-Express cards as zpool cache device. There are also HDDs with build-in SSD cache. A battery backed HBA is also great, but when the battery wears out many years after it's hard to get a replacement.

The most frequent reason of data loss is... human error. Unfortunately.

ralphbsz · Jan 22, 2021

What you are proposing is that the whole RAM of the machine can be used as a write cache, meaning containing dirty data that has to be written to disk. And you are proposing to do this on a consumer-grade piece of hardware (a laptop), used and administered by a consumer.

mjollnir said:
A laptop has a builtin UPS, the battery. Thus it should be safe to disable the explicit flushing of the drive's (or controller's) cache.

If the only danger to RAM were that power fails, that would be correct. But there are many more dangers to RAM. Like the machine crashing, or being rebooted: Imagine you have 16gig of dirty data in cache (many modern laptops have this much memory, and the best use of RAM is as a cache): it would mean that to reboot, you have to write all of that to disk. Even under optimistic assumptions (100 MB/s to disk), that will take 160 seconds, or nearly 3 minutes. Do you think users will want to wait 3 minutes to finish shutdown? No, they'll just press the power button.

And: After a reboot, memory will be cleared. There have been machines which survive a CPU crash without clearing memory (as long as power doesn't fail), but those are special-purpose industrial grade hardware.

And: Batteries and UPSes only last so long. My old IBM ThinkPad could only be in suspend mode for hours, then it would write things to disk and power itself off.

Not that I could notice the difference in daily usage, but I believe it will gain some performance in heavy-write scenarios ...

Why does write-behind cache exist? To improve performance. How does this work? Several mechanisms. First, it allows the file system to straighten out partially written files (often written in parallel by many programs running at the same time), and write the aggregated result sequentially. In a nutshell, this works by increasing the IO size for single sequential IOS. But there is limited benefit in doing so; past a few MB per IO, disks and SSDs don't get much faster. And you need that size buffer for each stream coming from an application, and there may be dozens, but typically not thousands. The other use of having a large write buffer is to not have to write things to disk that get modified in memory multiple times, or files that get deleted before being written. Again, the incremental benefit of that for very large caches gets smaller and smaller: Something that was written to cache yesterday is unlikely to be modified again today.

So what I'm saying: Measure this, and I bet that the incremental benefit from extremely large caches will be small, for typical consumer workloads.

E.g. could it damage the internal data structures of a RAID5, so that after a crash the RAID is in an irrecoverable, irregular state? I.e.: all devices are online, but the correct data can not be recovered?

And lots of other metadata: File systems have loads of internal metadata. And complex data structures. What if the file system is being used by a database? What if the user relies on files that were updated in a certain order to also remain updated (like Make does)?

Background: while enterprise-grade storage boxes (should) have either a battery-backed storage controller or a battery dedicated to the onboard storage controller chips,

Or more. They may have built-in UPSes, diesel generators, and so on.

most entry-level NAS boxes (or microservers) do not.

Don't be so sure. For exampe, a lot of entry-level RAID cards from 20 years ago (made by the like of Mylex or LSI) had small battery packs, about 1cm x 3cm x 3cm.

Likewise, I read that SAS disks have a capacitor-backed cache, while SATA disks have not.

No. Today SAS and SATA disks are nearly identical, with only the interface different. They will only accept writes that they are sure they can write to stable storage (disk or flash). Capacitor-backed cache is not persistent, as the capacitors only work for a few days.

Is it thus fair to say: to be reliable, a NAS box shall have ECC-RAM and either a builtin UPS, or a battery-backed storage controller...

Those are all good things which enhance safety. But they are neither necessary nor sufficient.

Then it is safe to disable the flushing of the controller's or disk's cache, because even without an external UPS the RAID will be in a recoverable state after a crash (if the minimum #disks are ok).

No, because so many other things can happen to RAM, like reboots, corruption, battery running out, mechanical damage.

Now, I have to agree: There are machines that keep live data in RAM only, for performance. They are typically highly specialized industrial systems. A colleague worked on such a system for a while; it had two servers (each with two power supplies), connected via a fast private network interface (so exactly the same data was in both RAMs at all times), and two UPSes, each feeding one power supply. So the system could lose either server and either UPS and any power 3 (of 4) power supplies and survive. In theory. The theory broke when a local field service guy took offense that all the power wires in the rack were crossing over, and put both power supplies of server A onto UPS #1, and both power supplies of server B onto UPS #2, with the wires looking much neater. The system was killed when power failed while one of the server was down for running updates. After that debacle (which cost our company many M$, data loss incidents are not cheap), the system was redesigned.

Snurg · Jan 22, 2021

ralphbsz said:
In theory. The theory broke when a local field service guy took offense that all the power wires in the rack were crossing over, and put both power supplies of server A onto UPS #1, and both power supplies of server B onto UPS #2, with the wires looking much neater. The system was killed when power failed while one of the server was down for running updates. After that debacle (which cost our company many M$, data loss incidents are not cheap), the system was redesigned.

I often asked myself: Why don't computers have the possibility to permanently connect power, like with so many (usually high-powered) appliances?

And as that was not possible, why wasn't the cabling marked clearly so that everybody knew how stuff had to be connected?

That looks like a communication/documentation problem resulting in a disaster waiting to happen, just like the Chernobyl disaster due to the graphite-tipped rods.

ZFS (any FS): Aggregated storage & cache flush with/without UPS & battery-backed devices

Mjölnir

Matlib

ralphbsz

Snurg