ZFS Copying potentially broken data from a non-ECC pool

dnb · Jun 3, 2024

It so happens that I'm using FreeBSD with ZFS on desktop machine with non-ECC memory. I am sure that this is a more than common situation, because many people use laptops with FreeBSD (and ZFS), which usually do not have ECC memory and there is no redundancy at the disk level.

Therefore, I create incremental snapshots using zfs-autobackup and send them to a machine with ECC and normal disk storage redundancy.

As is widely known, using ZFS without ECC can have more severe consequences than using simpler file systems (UFS/EXT/XFS) without ECC.

But what if we copy snapshots with corrupted data from a non-ECC machine to ECC machine? Can data corruption from the first machine propagate to data that was originally created on the correct machine with ECC memory?

What do you recommend in this case? Ditch ZFS for UFS on a non-ECC machine? Buy a desktop workstation with ECC?

If the discussion confirms that data corruption can be propagated to the healthy part of ZFS on the target machine where the broken data is copied, then I can only do backups from ECC to non-ECC, but not vice versa.

mer · Jun 3, 2024

My opinions only.

dnb said:
using ZFS without ECC can have more severe consequences than using simpler file systems (UFS/EXT/XFS) without ECC.

The link to "widely known" has this as an assumption (paraphrased):
Servers have lots of RAM, most of the RAM is used as cache/buffers.

Why would non-ECC be any better for UFS than ZFS? End result of the scenarios in the link is incorrect data is written to the storage device.
ZFS metadata would be based on this incorrect data, UFS metadata would point at this incorrect data, both filesystems would think the data is valid, no? I think ZFS may have more metadata than UFS (checksums all over the place) so perhaps non-ECC has a higher potential for corrupting metadata.

Given that, a snapshot of data that is incorrect on the storage device (with valid metadata based on incorrect data) would probably be viewed as valid on the ECC machine. You need to keep in mind where the correction actually happens with ECC memory; it typically happens on a read operation. User writes to a file on the device. ZFS writes to cache, creates/adds to a TXG, TXG reads the data in the cache to store on the device. If the bit flip happened between the initial write to cache and the TXG reading, then ECC would correct a single bit flip. Reading data from file pull from device, write to cache (ARC), application reads from cache. The ECC would correct at that point.

What would I do? My home systems, I don't bother with ECC. If I had servers with critical data that many users depended on and it was my job, I sure would use ECC.

As for your last paragraph, I have no opinion.

cracauer@ · Jun 3, 2024

UFS is not better than ZFS in the presence of memory corruption.

SirDice · Jun 3, 2024

dnb said:
Can data corruption from the first machine propagate to data that was originally created on the correct machine with ECC memory?

Garbage in, garbage out.

ZFS Copying potentially broken data from a non-ECC pool

dnb

mer

cracauer@

SirDice

Administrator