ZFS How to reliably break your ZFS pool using real world events

tuaris · Jun 10, 2024

Obviously this is sarcasm, but I've been able to reproduce (on more than 3 occasions) breaking my ZFS pool (unintentionally) using these initial conditions and trigger. Thankfully, I no longer trust ZFS enough to put critical data in the situation described below. The leason to take away here is to never use ZFS without taking backups, setting "copies" to a value greater than one, or using at least a mirror. If you can't use those options, then stick to UFS. UFS is much better at handling the faluire scenerio described below.

How to Reliably Break your ZFS Pool
(Will not work on UFS)

Step 1: Get a Copy of VMware Workstation
I have VMware workstation running on Windows. I suppose this could also be done with VirtualBox on Linux/FreeBSD, but I haven't tried (this is unintentional after all).

Step 2: Install FreeBSD
It can be whatever you want. Just get a fresh install of FreeBSD as a guest. In my case, I am running a graphical desktop, but it will work on a headless server too.

Step 3: Find some Storage
You can use your root volume, but for the purpose of simplicity we will dedicate a new volume to it.

Step 4: Create a ZFS Pool
If you using Vmware, your second volume will be known as da1. So run zfs create MyPool da1

Step 5: Generate IO
Real world use case: lets install net-p2p/litecoin. Set the location of the blockchain and wallet to your ZFS pool. Don't forget to export your private key.

Step 6: Pull the Plug
Simuliating a power faluire or host OS kernel panic.

Step 7: Boot up your Guest & Cry
Observe the falied pool and be happy that you took a backup copy of your private key

Code:

# zpool import MyPool
cannot import 'MyPool': I/O error
    Destroy and re-create the pool from
    a backup source.

bvdw78 · Jun 10, 2024

Your host is lying to ZFS about having all its blocks committed. Pulling the plug exposes that lie. You need to set the host to do synchronous writes.

tuaris · Jun 10, 2024

bvdw78 said:
Your host is lying to ZFS about having all its blocks committed. Pulling the plug exposes that lie. You need to set the host to do synchronous writes.

Most certainly made worse by the fact that not only is Windows doing what they call "delayed write", but I'm sure VMware is caching some writes as well.

I have no ill hatred towards ZFS, I will trust it with my most important data so long as it's on a ZRAID on non-cheap hardware and a battery backup.

gpw928 · Jun 10, 2024

This problem you describe has nothing to do with ZFS. It can happen anywhere you have not taken the care to ensure a reliable data path to the physical media.

Where you have any class of hypervisor, it needs to be doing synchronous writes to guarantee integrity (not lying to the clients about I/O completion). Synchronous writes are pretty much always a complete performance killer -- so it's common to use multiple data paths, asynchronous I/O, and a UPS to mitigate the risk. Of course, UPSs can fail...

Your hardware data paths also need to be reliable -- and the obvious problem area here is consumer class SSDs, which don't have power loss protection. So, even if you are doing synchronous writes, such SSDs can, on loss of power, lose data which has been acknowledged to the OS as safely written.

You need to design an end-to-end reliable data path for storing your wallet. ZFS can certainly do that, quite well.

tuaris · Jun 10, 2024

gpw928 said:
This problem you describe has nothing to do with ZFS. It can happen anywhere you have not taken the care to ensure a reliable data path to the physical media.

Where you have any class of hypervisor, it needs to be doing synchronous writes to guarantee integrity (not lying to the clients about I/O completion). Synchronous writes are pretty much always a complete performance killer -- so it's common to use multiple data paths, asynchronous I/O, and a UPS to mitigate the risk. Of course, UPSs can fail...

Your hardware data paths also need to be reliable -- and the obvious problem area here is consumer class SSDs, which don't have power loss protection. So, even if you are doing synchronous writes, such SSDs can, on loss of power, lose data which has been acknowledged to the OS as safely written.

Not saying are wrong, but what baffles me with these initial conditions and trigger is that I can't get UFS to crash in the same way ZFS is crashing.

You need to design an end-to-end reliable data path for storing your wallet. ZFS can certainly do that, quite well.

Yes, I do have something much better designed for storing crypto wallets where it matters. It's XigmaNAS (FreeBSD based) with ZFS using enterprise class hardware, It's been running great for years, I am replacing failed drives as needed, taking backups "just in case" (to another XigmaNAS). I also have yet another even beefier XigmaNAS (I have several) that I trust all my important memories, photographs, videos, and entertainment media to. I would trust ZFS with my data thousands of times more than EXT4 or BTRFS.

This little setup I describe in the original post is a typical single drive, consumer level workstation scenario that I'm sure most people would replicate without thinking about it.

Andriy · Jun 10, 2024

If you run zpool import, what output do you get?

Please also check out zdb -G MyPool

Eric A. Borisch · Jun 10, 2024

tuaris said:
Not saying are wrong, but what baffles me with these initial conditions and trigger is that I can't get UFS to crash in the same way ZFS is crashing.

Here’s a different spin on the same data: you can’t reliably get UFS to warn the user when things are unexpected in the way that ZFS is unwilling to just say “everything is fine”.

Files that are actively being modified when the power is pulled are necessarily in an indeterminate state. Even with this, a ZFS pool will remain in recoverable (via rollback during import) shape, as long as the hardware is not actively lying to it about whether or not a chunk of data has been successfully flushed to non-volatile storage. (“Good” here may not be the last attempted write from user space before the power went poof, but will at least be the last successfully sync()-ed write or newer.)

Any file system that says otherwise is playing fast and loose with your data. ZFS is hyper focused on reliability, with checksums “all the way down”, and duplicates of metadata even on non-redundant pools. The fact that it is unwilling to act like everything is OK here is a good thing if you’re interested in your data being what you think it is.

ZFS How to reliably break your ZFS pool using real world events

tuaris

bvdw78

tuaris

gpw928

tuaris

Andriy

Eric A. Borisch