Mysterious issue involving GPT and ZFS

sub_mesa · Jan 14, 2011

Dear Forum,

Before I submit a PR (Problem Report), I would like to receive feedback on a very weird issue which tickles my brain.

Okay so let's start with zpool status output:

Code:

  pool: star
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scan: scrub canceled on Thu Jan 13 22:49:14 2011
config:

        NAME           STATE     READ WRITE CKSUM
        star           ONLINE       0     0     0
          gpt/wdgreen  ONLINE       0     0     0

errors: No known data errors

Looks fine, right? I use a GPT label pointing to ada5p2, formatted as:

Code:

# gpart show ada5
=>        34  1953522988  ada5  GPT  (932G)
          34         512     1  freebsd-boot  (256K)
         546        1502        - free -  (751K)
        2048  1953517568     2  freebsd-zfs  (932G)
  1953519616        3406        - free -  (1.7M)

So what happened?
I both created and used this single-disk pool under FreeBSD 8.2-RC1 amd64 with ZFS version 15. The pool/disk never had contact with any other FreeBSD or ZFS version, prior to encountering this issue.

From the beginning: I created GPT partition, created pool, wrote the disk full of data, then I reboot using a clean shutdown -r now command. As the system came up again, I noticed the first sign of trouble with this warning:

Code:

ada5 at ahcich5 bus 0 scbus5 target 0 lun 0
ada5: <WDC WD10EACS-00C7B0 01.01B01> ATA-8 SATA 2.x device
ada5: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada5: Command Queueing enabled
ada5: 953868MB (1953523055 512 byte sectors: 16H 63S/T 16383C)
GEOM: ada5: corrupt or invalid GPT detected.
GEOM: ada5: GPT rejected -- may not be recoverable.

So no more /dev/gpt/wdgreen label; no more /dev/ada5p1 and p2 GPT partitions; gone! Both the primary and backup are corrupted; how?!

As a result, ZFS doesn't see the pool anymore; it needs to see a device where ZFS filesystem starts at LBA=0 I presume; so it would need that ada5p2 partition. GPT has both primary and backup metadata; how can they both turn corrupt? gpart recover did not write anything to the device and did not yield any output either; same for other commands on that drive. FYI: I already copied the raw disk contents to a file on other filesystem before I did any tinkering, so I can reproduce this scenario at any time.

Now it get's even more weird!
As i began analyzing this issue, it got even more weird! This is what I did:

1) I created a new GPT partition with same alignment; so ada5p2 would start at 2048 sector offset again; this should not have damaged any ZFS data on the drive.
2) Rebooted; now ZFS sees the ada5p2 partition but zpool import shows my pool as corrupt!
3) Now I booted the system with an experimental FreeBSD 9.0-CURRENT (late December) + ZFS v28 patch. A zpool import worked fine and it reports no corruption or any other issues even after a partial (28%) scrub.
4) I export the pool with zpool export star
5) I reboot again in FreeBSD 8.2-RC1 + ZFS v15 environment
6) Now the ZFS v15 system still reports the pool to be corrupt, output:

Code:

  pool: star
    id: 6057642741777115521
 state: UNAVAIL
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
   see: [url]http://www.sun.com/msg/ZFS-8000-5E[/url]
config:

        star           UNAVAIL  insufficient replicas
          gpt/wdgreen  UNAVAIL  corrupted data

I think I ruled out hardware errors like memory errors and general instability. Since the scrub shows no errors, HDD corruption also seems unlikely; SMART is also clean for that drive. No UDMA_CRC or cabling errors reported. So with a lot of things ruled out, I'm beginning to think this may be a bug of some kind. There's two weird things that I can't explain:

1) Why do I lose my GPT partition after a simple clean reboot even without a power cycle involved! I compared the old (corrupt) GPT partition and new GPT partition with cmp, and found only a few different values; the rest of the first 1MiB is all the same; which include the GPT table and first GPT boot partition ada5p1.

2) Why does ZFS v15 show my pool as corrupt, while ZFS v28 which that disk never before had any contact with, shows the pool as normal and a scrub shows no problems.

I am not interested in data recovery; only in bug solving. I maintain the ZFSguru distribution so losing a GPT label like this, is unacceptable to my users. I need to research and report this.

Thanks for any feedback or insights you guys can offer, cheers!

sub_mesa · Jan 29, 2011

*bump* anyone who can comment on this before i submit PR?

gkontos · Jan 29, 2011

Have you tried reproducing this with a different disk drive ?

sub_mesa · Jan 29, 2011

I don't know how to reproduce the loss/corruption of GPT label. But I did have a raw image of the single-drive pool in question. I snapshotted that image, and made a mdconfig vnode pointing to it. Then I observed the same behavior, where zpool import reported the corrupt pool.

I just can't explain losing a GPT label in this circumstance, and the corruption issue where ZFS v28 thinks everything is alright and scrub reveals no errors at all. I don't know if there is a relation between the two, or separate issues. I would assume they are related since corruption of GPT somehow could have happened to ZFS data as well. But that does not explain ZFS v28 having no problems with it, and revealing no errors, while reverting back to ZFS v15 would again say it is corrupt. If there was damage that ZFS v28 could fix, wouldn't that cause v15 to accept it?

Instead, I think this is a bug in GEOM-domain, where the GPT label somehow pointed towards a wrong disk or wrong offset somehow. This could explain the relation between the two instances where I both lose a GPT label and have the corruption issue all at the same time. I feel investigating the GPT label issue might be worthwhile. Perhaps reviewing any commits that could be suspected of having something to do with this.

Might be worthwhile to note this might have happened to at least one other person as well, as can be seen here:
http://hardforum.com/showpost.php?p=1036733764&postcount=78

loos · Jan 29, 2011

It looks like you're using glabel(8) and this can overwrite the GPT backup copy on the last sector of disk (the same place used by glabel(8) to store its metadata).

Please take a look at this thread (which explain the correct way to setup GPT with disk/slice labels): http://lists.freebsd.org/pipermail/freebsd-hackers/2011-January/034063.html

sub_mesa · Jan 29, 2011

Even if that were true, that still does not explain corruption of the primary GPT table at the begin of the disk, where geom_label does not touch anything.

The link you gave me uses poor alignment, and is not related to my issue. He uses GEOM labels which wouldn't work and produced error output on the console. I used only GPT labels and as can be seen in the page you linked to, the GPT labels would still work for him, only the GEOM label wouldn't. GPT stores its data at both begin and end of the provider, like a redundant MBR.

The strange thing is that in my case both primary and secondary had become corrupted.

loos · Jan 29, 2011

Yep, you are right... you're already using the GPT labels...

It's really weird... i've been using GPT for some time now and never had any problem with it.

danbi · Jan 29, 2011

Did you restore the gpt label wgreen? That is, do you now have /dev/gpt/wdgreen?

sub_mesa · Jan 29, 2011

I first read the entire raw disk to a single file using dd, I then restored the GPT partition by re-creating it. I noticed with 'cmp' that this did cause some changed values, but only a few bytes. This would suggest the label was indeed corrupted. If wrong/bad metadata was written due to a bug, this would explain this issue quite well.

The other issue of corrupt ZFS pool may be unrelated, but is weird nonetheless. A pool that shows as corrupt under V15, would show as healthy in V28 even though it was created with V15 and never had 'contact' to V28 system.

Weird bugs can often be a combination of bugs. So far I can reproduce the ZFS corruption issue but I can't reproduce the disappearance/corruption of GPT labels.

chrcol · Jan 30, 2011

The thing I would test is zfsv28 on the FreeBSD 8 install. See if that reports no errors same as on FreeBSD 9.

danbi · Jan 31, 2011

sub_mesa, My line of thinking was thatZFS v28 may have either

better code to deal with geometry changes;
not yet fully integrated into the GEOM framework, so it tried on it's own and found where you ZFS data structure is

Also, you did not indicate whether /dev/gpt/wdgreen is available on your 'new' system. If not, this might be confusing the old code.

In any case, I do not see anything wrong with the fact that newer software knows more and handles more border cases than older software (v15 vs v28). So I guess the more important issue is how corruption happened in the first place.

I have seen labels disappear in cases where I added labels after using the disks with ZFS. So a good measure would be to wipe out the beginning and the end of the disk (not just partition data) before trying new layout. With the current FreeBSD GEOM magic, many times the details are hidden from the user.

sub_mesa · Feb 1, 2011

Good feedback danbi, thanks! Your idea about ZFS v28 having more tricks at its sleeve regarding offset/geom/geometry might be a reason why ZFS v28 can use it without problems.

After the GPT label problems, I wrote a new (identical) label, this process:
- writes zeroes to the first 1MiB of the disk
- creates a new GPT partition scheme
- creates a freebsd-boot boot partition
- creates a data partition spanning the rest of the drive minus 1MB

So every time you do this, using the same name, the first 1MiB should be identical. However with hex output from 'cmp', comparing the corrupted GPT label against my newly written GPT label, shows that a few bytes did change, so the GPT label was indeed corrupt, I guess.

So yes, after I wrote the new label I found that ZFS v15 found the pool but saying it was corrupted. Then I tried ZFS v28 and found both GPT label and pool to be intact. But I did not test whether FreeBSD 9 accepted my corrupt label or not; at that time I already replaced the GPT label.

I realize this issue may be quite hard to reproduce and/or diagnose. Still may be worth investigating, since losing GPT labels like this is quite potentially quite serious, especially if the user does not know how to recover from it. Still waiting on other people's feedback who reported similar issues with losing GPT/GEOM labels. One user, linked above, reports losing 3 GEOM labels at once. If these issues are related, they really ought to be fixed.

phoenix · Feb 1, 2011

ZFS uses two separate GPT labels, one at the start of the drive, one at the end of the drive (backup label). If those two are different, it complains. I'm guessing some versions of ZFS are better at figuring out which one is correct and "fixing" the other one to match. Probably by comparing the labels on other devices in the vdev?

However, if you partition the drive first (using GPT) and then use the partition for ZFS, ZFS will write its two GPT labels inside the partition (I believe). This is where it gets complicated, with all the nesting.

sub_mesa · Feb 1, 2011

phoenix said:
ZFS uses two separate GPT labels, one at the start of the drive, one at the end of the drive (backup label). If those two are different, it complains. I'm guessing some versions of ZFS are better at figuring out which one is correct and "fixing" the other one to match. Probably by comparing the labels on other devices in the vdev?

Well that's just it, they were BOTH corrupted! I know how GPT complains if only the second was missing somehow, which can occur if the device suddenly grew in size so that of course the last sector has changed, then you will get such a message.

But in my case, both GPT labels were corrupted, without recovery possible (see first post). That's also why I lost the /dev/gpt/<label> device entry. If there would be a bug in writing GPT metadata (which is triggered aggressively, as I understand) then this could explain both primary and backup metadata to be corrupt.

o0larry0o · Jun 4, 2012

Hi,

I'm sorry for digging up this old thread. I've been struggling with the exact same issue lately.

I'm no FreeBSD nor ZFS expert, I've been using FreeNAS 8 for a year now; my set up is the following:

Virtualized FreeNAS on ESXi, FreneNAS is installed on a 8GB vdisk, my storage pool is 3 x 1TB HDD, configured as Raw Direct Access (RDM) so FreeNAS accesses them directly bypassing the vKernel of VmWare.

Explanation of the problem:
Lately I could not reach my CIFS share, tried to reach the webGUI, unreachable, tried to reboot the system within vSphereClient, no luck. So I did a reset in vSphere (the hard way then) and the system went back to normal.

Couple of weeks after that, same symptoms, tried the first two solutions, same results, so I reset the VM, like last time, and I lost my pool according to the FreeNAS WebGUI.

After some talk on the FreeNAS forum I ended up doing some commands and here's what I got:

Code:

[root@freenas] ~# camcontrol devlist
<NECVMWar VMware IDE CDR10 1.00> at scbus1 target 0 lun 0 (cd0,pass0)
<VMware Virtual disk 1.0> at scbus2 target 0 lun 0 (da0,pass1)
<ATA SAMSUNG HD103SI 1AG0> at scbus2 target 1 lun 0 (da1,pass2)
<ATA SAMSUNG HD103SI 1AG0> at scbus2 target 2 lun 0 (da2,pass3)
<ATA SAMSUNG HD103SI 1AG0> at scbus2 target 3 lun 0 (da3,pass4)

Looks ok to me.

Code:

[root@freenas] ~# sysctl kern.disks
kern.disks: da3 da2 da1 da0 cd0

also ok

Code:

[root@freenas] ~# gpart show
=> 63 16777152 da0 MBR (8.0G)
63 1930257 1 freebsd [active] (943M)
1930320 63 - free - (32K)
1930383 1930257 2 freebsd (943M)
3860640 3024 3 freebsd (1.5M)
3863664 41328 4 freebsd (20M)
3904992 12872223 - free - (6.1G)

=> 0 1930257 da0s1 BSD (943M)
0 16 - free - (8.0K)
16 1930241 1 !0 (943M)

Unlike you, my data are of great value (mostly personal, pictures and stuff) and I have to say this thread gave me some hope.

One of the first advices I had on the other forum is to try another ZFSv28 implementation as FreeNAS uses v15. But according to your first post I understand I have to recover or recreate my GPT first, right?

Before going any further, I'm planning to buy a new set of 3 HDDs to make a raw copy of everything. I read quite a few things about ZFS structure and GPT but it's still a bit blurred in my mind. And especialy when I read in this thread that you could have a GPT entry at the begining of each HDD AND a GPT entry created by ZFS for itself. I don't really get it.

Could you give me some advices?
What to check first to be sure of the issue?
What to do?
What NOT to do to protect my data?
What is the "dev/gpt/wdgreen label"?
Is it wise to use a tool for partition recovery? I mainly use TestDisk (http://www.cgsecurity.org/wiki/TestDisk) which is able to find and recover GPT partition table, but as I saw in your post maybe it's more effective to recreate a new one?
You were using a single HDD at the time, I use 3, does it make a difference in the recovery process?

Some many questions, I'm sorry

I hope someone will answer me and save my life Thanks for your help!