zpool degraded after disk replace

daniel · Oct 13, 2010

Hi!

I have some problems with my raid system. One disk was replaced, but it seems that back then something was configurated the wrong way.

Below is what zpool status says, how could I definitely remove ad14/old and replace is with ad14 and move the pool from DEGRADED to ONLINE?

Code:

zpool status tank

  pool: tank
 state: DEGRADED
 scrub: none requested
config:

	NAME            STATE     READ WRITE CKSUM
	tank            DEGRADED     0     0     0
	  raidz2        DEGRADED     0     0     0
	    ad4         ONLINE       0     0     0
	    ad6         ONLINE       0     0     0
	    ad8         ONLINE       0     0     0
	    ad10        ONLINE       0     0     0
	    ad12        ONLINE       0     0     0
	    replacing   DEGRADED     0     0     0
	      ad14/old  UNAVAIL      0 22.5K     0  cannot open
	      ad14      ONLINE       0     0     0

errors: No known data errors

Thanks for any help, Daniel

phoenix · Oct 13, 2010

Run a manual scrub:
# zpool scrub tank

That should resilver the disk and move the pool to a normal state.

daniel · Oct 27, 2010

Thanks for the hint. Unfortunately it didn't work out in my case. The scrub command didn't finish in 3 days, and it wasn't able to shut down the server normally.

I still get errors from this ad14-disk in the system logs:

Code:

GEOM: ad14: the primary GPT table is corrupt or invalid.
GEOM: ad14: using the secondary instead -- recovery strongly advised.

I think, I might have to recreate the pool from a backup.

Thanks,
Daniel

Crivens · Oct 27, 2010

Ouch, I guess I know what happened.

The disk you used as a replacement was in use before and was not cleared by dd-ing it with zeros? The GEOM-GPT can see the old partition scheme, maybe only the signature, because ZFS does only overwrite the areas which are needed by it. Could be that a wipe of the proper disk area and a new resilver can solve the problem. Could also be that I am off the track here.

daniel · Oct 27, 2010

Thanks! Yeah, the disk was causing problems and was replaced, without any further cleaning.
What can I do so to wipe the disk area?

Crivens · Oct 27, 2010

To make this sure: Did you use a disk that was already in service to replace the faulting one? Your answer does allow for ambiguity, and that should be avoided when dealing with things you can not roll back easily

The scenario I pictured: One disk (ad14) goes bad, you bring in a replacement. That replacement disk was in service before, it holds a GPT. You "zfs replace"ed the bad one with the new one. Is this right so far?

If yes, I would now ask one of the ZFS experts here how to un-replace that disc and then level the first few MBs of it. Then retry to replace your bad disc with it.

In case the resilver is complete and the bad disc is already removed, ZFS should be able to repair the parts in the first MBs it cares about (given sufficient redundancy), leaving the GPT area as zero.

Knowing that I do not know enough about what already was done to the pool and what ZFS may think of my suggestions I would ask f.e. phoenix for further comments on this.

daniel · Nov 1, 2010

I didn't do the replacement myself, thus I cannot say for sure if and what for the new disk was used before the replacement. The replacement might be caused by autoreplacement as the option is enabled.

I can't do anything with the device now, as any attempt to offline new or old device fails -

Code:

cannot offline ad14/old: no valid replicas

phoenix · Nov 1, 2010

One thing you can try:

zpool export <poolname>
shutdown the ZFS box
physically remove the drive
zero the drive (see below)
physically attach the drive
boot the ZFS box to single-user mode
/etc/rc.d/hostid start
zpool import <poolname> (should come up DEGRADED with ad14 marked as missing)
zpool replace <poolname> ad14

That should force it to resilver ad14 and bring everything back to a normal state.

To zero the drive:
In a different system, connect the drive you removed from the pool, and clear it out like so:
# dd if=/dev/zero of=/dev/ad-whatever bs=16M
Be sure to double-check the device node so that you zero out the correct disk!!

gildenman · Nov 12, 2010

Blank the first and last part of the drive to prevent confusion with whatever else was on there before. Something like:

  disk=md0 ; disksize=`diskinfo $disk | awk '{ print $4 }'` ; dd if=/dev/zero of=/dev/$disk count=10000 ; dd if=/dev/zero of=/dev/$disk 

seek=`dc -e "$disksize 10000 - p"` count=10000

Assuming the disk has more than 10000 sectors

I have a USB SATA/IDE adapter. I usually use that to blank the start & end of the drive, using my laptop, before booting into FreeBSD. Or you could just use single user mode.

zpool degraded after disk replace

daniel

phoenix

daniel

Crivens

Administrator

daniel

Crivens

Administrator

daniel

phoenix

gildenman