ZFS metadata corruption

giannidoe · Mar 12, 2013

One of the RAM modules on my server was badly seated and caused corruption on the zfs filesystem, I have restored all the corrupted files and scrubbed the pool but it still reports errors in the metadata.

How do you guys recommend I deal with this? I have data backup and v.old snapshots.

Code:

  pool: zroot
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub repaired 25.5K in 2h17m with 8 errors on Mon Mar 11 21:07:25 2013
config:

	NAME           STATE     READ WRITE CKSUM
	zroot          ONLINE       0     0     2
	  mirror-0     ONLINE       0     0     8
	    gpt/disk0  ONLINE       0     0     8
	    gpt/disk1  ONLINE       0     0     8

errors: Permanent errors have been detected in the following files:

        zroot:<0x0>
        zroot/usr:<0x0>
        zroot/var/db:<0x0>
        zroot/jails:<0x0>
        zroot/jails:<0x20eb20>

sub_mesa · Mar 12, 2013

Remove the snapshots holding those data references. Then use zpool clear to clear the errors.

giannidoe · Mar 12, 2013

Thanks. Do these errors indicate that the corruption is related to snapshots? If so is there any way of identifying the effected snapshot(s)?

The ZFS docs state:

"If an object in the metaobject set (MOS) is corrupted, then a special tag of <metadata>, followed by the object number, is displayed."

I'm afraid this doesn't mean much to me and I've not found much information about this issue.

sub_mesa · Mar 12, 2013

In my experience the filename affected by corruption will disappear from the [CMD=""]zpool status -v[/cmd] output once you delete the file. However, if one or more snapshots are still referencing this data, the name will disappear and an identifier is put in its place. That is why I recommended destroying snapshots if you want those errors to go away.

ZFS metadata is not easily corrupted. This is because all metadata is written at least twice. If your pool consists of at least 2 disks, then this metadata has copies on multiple disks. If you have redundancy, the redundancy is in addition to the multiple copies. So each copy also gets RAID-Z1/2/3 parity protection when applicable. But in the case both metadata copies are corrupted, the damage could be severe. If the crucial metadata is corrupted or stale, the pool will be FAULTED with a message saying 'corrupted data'. This error is usually fatal.

In your case you had RAM corruption. ZFS has no real protection against RAM corruption, but it can at least detect corruption produced by bad RAM and in most cases also correct the corruption if you scrub after swapping your RAM with good modules. In this sense, ZFS has at least some kind of protection against RAM corruption. Your clients running Windows and Linux generally have no protection against RAM corruption with silent corruption as a result.

giannidoe · Mar 12, 2013

Thanks for the comprehensive explanation.

I have destroyed all snapshots, re-scrubbed and cleared the pool but I'm still getting exactly the same errors so it doesn't appear to be snapshot related.

ZFS metadata corruption

giannidoe

sub_mesa

giannidoe

sub_mesa

giannidoe