Dear Forum,
Before I submit a PR (Problem Report), I would like to receive feedback on a very weird issue which tickles my brain.
Okay so let's start with zpool status output:
Looks fine, right? I use a GPT label pointing to ada5p2, formatted as:
So what happened?
I both created and used this single-disk pool under FreeBSD 8.2-RC1 amd64 with ZFS version 15. The pool/disk never had contact with any other FreeBSD or ZFS version, prior to encountering this issue.
From the beginning: I created GPT partition, created pool, wrote the disk full of data, then I reboot using a clean shutdown -r now command. As the system came up again, I noticed the first sign of trouble with this warning:
So no more /dev/gpt/wdgreen label; no more /dev/ada5p1 and p2 GPT partitions; gone! Both the primary and backup are corrupted; how?!
As a result, ZFS doesn't see the pool anymore; it needs to see a device where ZFS filesystem starts at LBA=0 I presume; so it would need that ada5p2 partition. GPT has both primary and backup metadata; how can they both turn corrupt? gpart recover did not write anything to the device and did not yield any output either; same for other commands on that drive. FYI: I already copied the raw disk contents to a file on other filesystem before I did any tinkering, so I can reproduce this scenario at any time.
Now it get's even more weird!
As i began analyzing this issue, it got even more weird! This is what I did:
1) I created a new GPT partition with same alignment; so ada5p2 would start at 2048 sector offset again; this should not have damaged any ZFS data on the drive.
2) Rebooted; now ZFS sees the ada5p2 partition but zpool import shows my pool as corrupt!
3) Now I booted the system with an experimental FreeBSD 9.0-CURRENT (late December) + ZFS v28 patch. A zpool import worked fine and it reports no corruption or any other issues even after a partial (28%) scrub.
4) I export the pool with zpool export star
5) I reboot again in FreeBSD 8.2-RC1 + ZFS v15 environment
6) Now the ZFS v15 system still reports the pool to be corrupt, output:
I think I ruled out hardware errors like memory errors and general instability. Since the scrub shows no errors, HDD corruption also seems unlikely; SMART is also clean for that drive. No UDMA_CRC or cabling errors reported. So with a lot of things ruled out, I'm beginning to think this may be a bug of some kind. There's two weird things that I can't explain:
1) Why do I lose my GPT partition after a simple clean reboot even without a power cycle involved! I compared the old (corrupt) GPT partition and new GPT partition with cmp, and found only a few different values; the rest of the first 1MiB is all the same; which include the GPT table and first GPT boot partition ada5p1.
2) Why does ZFS v15 show my pool as corrupt, while ZFS v28 which that disk never before had any contact with, shows the pool as normal and a scrub shows no problems.
I am not interested in data recovery; only in bug solving. I maintain the ZFSguru distribution so losing a GPT label like this, is unacceptable to my users. I need to research and report this.
Thanks for any feedback or insights you guys can offer, cheers!
Before I submit a PR (Problem Report), I would like to receive feedback on a very weird issue which tickles my brain.
Okay so let's start with zpool status output:
Code:
pool: star
state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.
scan: scrub canceled on Thu Jan 13 22:49:14 2011
config:
NAME STATE READ WRITE CKSUM
star ONLINE 0 0 0
gpt/wdgreen ONLINE 0 0 0
errors: No known data errors
Looks fine, right? I use a GPT label pointing to ada5p2, formatted as:
Code:
# gpart show ada5
=> 34 1953522988 ada5 GPT (932G)
34 512 1 freebsd-boot (256K)
546 1502 - free - (751K)
2048 1953517568 2 freebsd-zfs (932G)
1953519616 3406 - free - (1.7M)
So what happened?
I both created and used this single-disk pool under FreeBSD 8.2-RC1 amd64 with ZFS version 15. The pool/disk never had contact with any other FreeBSD or ZFS version, prior to encountering this issue.
From the beginning: I created GPT partition, created pool, wrote the disk full of data, then I reboot using a clean shutdown -r now command. As the system came up again, I noticed the first sign of trouble with this warning:
Code:
ada5 at ahcich5 bus 0 scbus5 target 0 lun 0
ada5: <WDC WD10EACS-00C7B0 01.01B01> ATA-8 SATA 2.x device
ada5: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada5: Command Queueing enabled
ada5: 953868MB (1953523055 512 byte sectors: 16H 63S/T 16383C)
GEOM: ada5: corrupt or invalid GPT detected.
GEOM: ada5: GPT rejected -- may not be recoverable.
So no more /dev/gpt/wdgreen label; no more /dev/ada5p1 and p2 GPT partitions; gone! Both the primary and backup are corrupted; how?!
As a result, ZFS doesn't see the pool anymore; it needs to see a device where ZFS filesystem starts at LBA=0 I presume; so it would need that ada5p2 partition. GPT has both primary and backup metadata; how can they both turn corrupt? gpart recover did not write anything to the device and did not yield any output either; same for other commands on that drive. FYI: I already copied the raw disk contents to a file on other filesystem before I did any tinkering, so I can reproduce this scenario at any time.
Now it get's even more weird!
As i began analyzing this issue, it got even more weird! This is what I did:
1) I created a new GPT partition with same alignment; so ada5p2 would start at 2048 sector offset again; this should not have damaged any ZFS data on the drive.
2) Rebooted; now ZFS sees the ada5p2 partition but zpool import shows my pool as corrupt!
3) Now I booted the system with an experimental FreeBSD 9.0-CURRENT (late December) + ZFS v28 patch. A zpool import worked fine and it reports no corruption or any other issues even after a partial (28%) scrub.
4) I export the pool with zpool export star
5) I reboot again in FreeBSD 8.2-RC1 + ZFS v15 environment
6) Now the ZFS v15 system still reports the pool to be corrupt, output:
Code:
pool: star
id: 6057642741777115521
state: UNAVAIL
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
see: [url]http://www.sun.com/msg/ZFS-8000-5E[/url]
config:
star UNAVAIL insufficient replicas
gpt/wdgreen UNAVAIL corrupted data
I think I ruled out hardware errors like memory errors and general instability. Since the scrub shows no errors, HDD corruption also seems unlikely; SMART is also clean for that drive. No UDMA_CRC or cabling errors reported. So with a lot of things ruled out, I'm beginning to think this may be a bug of some kind. There's two weird things that I can't explain:
1) Why do I lose my GPT partition after a simple clean reboot even without a power cycle involved! I compared the old (corrupt) GPT partition and new GPT partition with cmp, and found only a few different values; the rest of the first 1MiB is all the same; which include the GPT table and first GPT boot partition ada5p1.
2) Why does ZFS v15 show my pool as corrupt, while ZFS v28 which that disk never before had any contact with, shows the pool as normal and a scrub shows no problems.
I am not interested in data recovery; only in bug solving. I maintain the ZFSguru distribution so losing a GPT label like this, is unacceptable to my users. I need to research and report this.
Thanks for any feedback or insights you guys can offer, cheers!