ZFS replace issue

Leander · Aug 5, 2013

Hi,

I have some issues with my current ZFS setup. One of the disk failed due to power supply issues .. additionally the physical drive died as well. So result of this was to get a new, equivalent disc in order to replace it asap. Old disc was located as /dev/ada2p1 - new one as well. So when the disk arrived I applied following commands:

Code:

gpart destroy -F /dev/ada2
gpart create -s GPT /dev/ada2
gpart add -l "2TB-2" -t freebsd-zfs /dev/ada2
zpool replace myPool /dev/ada2p1 /dev/ada2p1

and it started to resilver. Unfortunately something did not go as expected - and currently the ZFS pool is in the following status:

Code:

  pool: myPool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scan: resilvered 723G in 12h45m with 2 errors on Sun Aug  4 06:07:41 2013
config:

	NAME                        STATE     READ WRITE CKSUM
	zStar                       ONLINE       0     0     8
	  raidz1-0                  ONLINE       0     0    32
	    ada7p1                  ONLINE       0     0     0
	    ada8p1                  ONLINE       0     0     0
	    ada9p1                  ONLINE       0     0     0
	    ada10p1                 ONLINE       0     0     0
	    ada1p1                  ONLINE       0     0     0
	    ada3p1                  ONLINE       0     0     0
	    replacing-6             ONLINE       0 8,25K 2,58K
	      3002745444615580681   UNAVAIL      0     0     0  was /dev/ada2p1/old
	      13133932722334411954  UNAVAIL      0     0     0  was /dev/ada2p1
	      ada2p1                ONLINE       0     0     0
	    ada4p1                  ONLINE       0     0     0
	    ada6p1                  ONLINE       0     0     0
	logs
	  gpt/ZIL                   ONLINE       0     0     0
	cache
	  gpt/L2ARC                 ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /mnt/myPool/Backup/Workstation-02-084bd72a-9c94-61/20130624-2100.xva

The pool seems fine - yet I cannot get rid of this part:

Code:

	    replacing-6             ONLINE       0 8,25K 2,58K
	      3002745444615580681   UNAVAIL      0     0     0  was /dev/ada2p1/old
	      13133932722334411954  UNAVAIL      0     0     0  was /dev/ada2p1
	      ada2p1                ONLINE       0     0     0

...

Code:

zpool offline myPool 13133932722334411954
cannot offline 13133932722334411954: no valid replicas

zpool offline myPool 13133932722334411954
cannot offline 13133932722334411954: no valid replicas

zpool offline myPool ada2p1
cannot offline ada2p1: no valid replicas

Currently it looks like my pool has been fallen back to raidz0?

How can I fix this? Any ideas? Recreating the pool could be a bit difficult insce it would be arround 6 TB to back up.

Savagedlight · Aug 5, 2013

Some observations: You are showing commands referring a pool named "zStar", while the status is of a pool named "myPool". You're defining GPT labels but not using them. And you're using a RaidZ1 with 9 drives in a single vdev. The "replacing-6" node lists three drives, with the last showing as "ONLINE". I haven't had this happen on any of my systems, so I'm not sure how that happened.

However, the most important bit is the actual error message which states there are permanent errors in one file. Try deleting this file and scrub the pool again.

I strongly urge you to recreate this pool using a RaidZ2 vdev, as RaidZ1 has a rather low redundancy rate when there's 9 total drives in the vdev.

PS: The vdev name is "radiz1-0", which means it's the first RaidZ1 vdev in this pool.

Leander · Aug 6, 2013

Hi Savagedlight,

You observed right - sorry my bad - didn't modify it all the way through. The reason why I didn't use labels anymore is, because after some reboots, imports, and exports it turns out that ZFS is using raw device names anyway again. I experienced a similar issue with gmirror, gstripe etc. According to my research it's a bug.

Anyway, neither do I know how this could have happened - I never experienced this before. Fact is I would like to offline drives by their UID instead of their device name like ada2p1.

Savagedlight · Aug 6, 2013

Export the pool, then # zpool import -d /dev/gptid poolname. Resolve the present issue first, though.

usdmatt · Aug 6, 2013

I'm not sure how you've managed to get a pool that identifies itself as myPool and zStar in the same output. Has it been renamed or imported with the altroot property set or something?

Anyway, 'replacing' vdevs function similar a mirror. Have you tried 'detaching' the UNAVAIL disks?

Code:

# zpool detach myPool 13133932722334411954

Note: ZFS can not "fallback to running raidz0". A raidz1 vdev will always function as a raidz1 vdev. (There is also isn't really a "raidz0", it's just referred to as a stripe)

Leander · Aug 9, 2013

Hi,

Unfortunately it didn't help...

Code:

  pool: myPool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scan: resilvered 722G in 13h7m with 2 errors on Fri Aug  9 10:41:33 2013
config:

	NAME                                              STATE     READ WRITE CKSUM
	myPool                                             DEGRADED     0     0     8
	  raidz1-0                                        DEGRADED     0     0    32
	    gptid/ff7f4b1d-afbb-11e1-b174-f46d04afa0a7    ONLINE       0     0     0
	    gptid/ff9df20b-afbb-11e1-b174-f46d04afa0a7    ONLINE       0     0     0
	    gptid/ffbb34b4-afbb-11e1-b174-f46d04afa0a7    ONLINE       0     0     0
	    gptid/ffd8824d-afbb-11e1-b174-f46d04afa0a7    ONLINE       0     0     0
	    gptid/fff58269-afbb-11e1-b174-f46d04afa0a7    ONLINE       0     0     0
	    gptid/00140937-afbc-11e1-b174-f46d04afa0a7    ONLINE       0     0     0
	    replacing-6                                   DEGRADED     0     0     0
	      3002745444615580681                         UNAVAIL      0     0     0  was /dev/ada2p1/old
	      13133932722334411954                        UNAVAIL      0     0     0  was /dev/ada2p1
	      gptid/3251ba4d-fc50-11e2-b9fe-f46d04afa0a7  ONLINE       0     0     0
	    gptid/b2ef0cc2-4279-11e2-b210-f46d04afa0a7    ONLINE       0     0     0
	    gptid/0f608fd1-b0c5-11e1-b174-f46d04afa0a7    ONLINE       0     0     0
	logs
	  gptid/0ed0e768-364a-11e2-b210-f46d04afa0a7      ONLINE       0     0     0
	cache
	  gpt/L2ARC                                       ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /mnt/Hald-Bau/Backup/Workstation-02-084bd72a-9c94-61/20130624-2100.xva
root@Storage-01 [~]$ zpool offline myPool /dev/ada2p1/old
cannot offline /dev/ada2p1/old: no valid replicas
root@Storage-01 [~]$ zpool offline myPool /dev/ada2p1
cannot offline /dev/ada2p1: no valid replicas
root@Storage-01 [~]$ zpool offline myPool gptid/3251ba4d-fc50-11e2-b9fe-f46d04afa0a7
cannot offline gptid/3251ba4d-fc50-11e2-b9fe-f46d04afa0a7: no valid replicas

Neither did the other tip of @usdmatt:

Code:

[CMD]zpool detach myPool 13133932722334411954[/CMD]
cannot detach 13133932722334411954: no valid replicas

It seems like I do not have redundancy anymore?!

Any more hints to fix this issue?

Thanks

Leander · Aug 12, 2013

I still have the same issue... it's kind of bugging me since a fail of another drive will bring ZFS down

and currently I'm unable to replace this "replacing-6" part of my vdev.

Any ideas?

bthomson · Aug 12, 2013

I think you can detach the original missing disk to cancel the replacement with # zpool detach myPool 3002745444615580681. I have done that before.

Once the original disk has been detached you will have no choice but to resilver the entire disk with zpool replace and there will be no redundancy until the operation completes. You wouldn't want to do zpool detach if you still had access to the data on the original disk, but it sounds like you don't.

usdmatt · Aug 12, 2013

Hmm, there's no real reason detaching the UNAVAIL disks should of given that error, unless it has something to do with the errors you had during the initial replace.

Firstly, I'm intrigued on the version of FreeBSD being used. 'No valid replicas' is a common ZFS issue on older releases (See one of my earlier posts on the same subject such as http://forums.freebsd.org/showthread.php?p=226036#post226036). At first I though you box looked relatively new due to the way vdevs are labelled and the syntax of the command prompt.

However, the 'scan: ' line being slightly out of line in zpool status is a minor bug I remember, and I know it was fixed a while ago - almost two years ago according to SVN (http://svnweb.freebsd.org/socsvn/mi...ool_main.c?r1=224390&r2=226863&pathrev=239962). Is it possible this is a fairly old build of FreeBSD or has been incorrectly upgraded at some point? My 9.1 machine definitely has everything in line, having just done a resilver today:

Code:

root@core0:/home/matt # uname -a
FreeBSD core0.backup.x.y 9.1-RELEASE-p5
root@core0:/home/matt # zpool status
  pool: storage
 state: ONLINE
  scan: resilvered 197G in 6h40m with 0 errors on Mon Aug 12 15:37:37 2013
config:

        NAME           STATE     READ WRITE CKSUM
        storage        ONLINE       0     0     0
          raidz2-0     ONLINE       0     0     0
            gpt/disk0  ONLINE       0     0     0
            gpt/disk1  ONLINE       0     0     0
            gpt/disk2  ONLINE       0     0     0
            gpt/disk3  ONLINE       0     0     0
            gpt/disk4  ONLINE       0     0     0
            gpt/disk5  ONLINE       0     0     0

(ZFS also references illumos.org in all error messages rather than sun.com these days)

I would suggest booting a live cd of the most recent FreeBSD you can get (8.4,9.[12]) and trying to fix from there (or upgrade).

If you still can't detach those disks, It may be worth deleting the file it's complaining about (and unfortunately probably all snapshots that might reference it) and running a zpool clear and zpool scrub first to see if you can get it to a consistent point where there are no data errors. Then reattempt to remove the UNAVAIL disks with zpool detach.

Leander · Aug 17, 2013

usdmatt said:
If you still can't detach those disks, It may be worth deleting the file it's complaining about (and unfortunately probably all snapshots that might reference it) and running a zpool clear and zpool scrub first to see if you can get it to a consistent point where there are no data errors. Then reattempt to remove the UNAVAIL disks with zpool detach.

Unfortunately it also didn't work out as hoped after zpool clear myPool and zpool scrub myPool it still looks like

Code:

root@Storage-01 [~]$ clear ; zpool status -v

  pool: zStar
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scan: scrub repaired 0 in 10h40m with 2 errors on Sat Aug 17 22:41:30 2013
config:

	NAME                                              STATE     READ WRITE CKSUM
	zStar                                             DEGRADED     0     0     0
	  raidz1-0                                        DEGRADED     0     0     0
	    gptid/ff7f4b1d-afbb-11e1-b174-f46d04afa0a7    ONLINE       0     0     0
	    gptid/ff9df20b-afbb-11e1-b174-f46d04afa0a7    ONLINE       0     0     0
	    gptid/ffbb34b4-afbb-11e1-b174-f46d04afa0a7    ONLINE       0     0     0
	    gptid/ffd8824d-afbb-11e1-b174-f46d04afa0a7    ONLINE       0     0     0
	    gptid/fff58269-afbb-11e1-b174-f46d04afa0a7    ONLINE       0     0     0
	    gptid/00140937-afbc-11e1-b174-f46d04afa0a7    ONLINE       0     0     0
	    replacing-6                                   DEGRADED     0     0     0
	      3002745444615580681                         UNAVAIL      0     0     0  was /dev/ada2p1/old
	      13133932722334411954                        UNAVAIL      0     0     0  was /dev/ada2p1
	      gptid/3251ba4d-fc50-11e2-b9fe-f46d04afa0a7  ONLINE       0     0     0
	    gptid/b2ef0cc2-4279-11e2-b210-f46d04afa0a7    ONLINE       0     0     0
	    gptid/0f608fd1-b0c5-11e1-b174-f46d04afa0a7    ONLINE       0     0     0
	logs
	  gptid/0ed0e768-364a-11e2-b210-f46d04afa0a7      ONLINE       0     0     0
	cache
	  gpt/L2ARC                                       ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        zStar/Hald-Bau:<0x1d4>

root@Storage-01 [~]$  zpool detach myPool 3002745444615580681
cannot detach 3002745444615580681: no valid replicas

root@Storage-01 [~]$  zpool detach myPool 13133932722334411954
cannot detach 13133932722334411954: no valid replicas

root@Storage-01 [~]$  zpool detach myPool replacing-6
cannot detach replacing-6: no such device in pool

Code:

root@Storage-01 [~]$ uname -a
FreeBSD Storage-01.NetOcean.de 9.0-RELEASE FreeBSD 9.0-RELEASE #0: Tue Jan  3 07:46:30 UTC 2012     root@farrell.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64

root@Storage-01 [~]$ zpool get all myPool
NAME   PROPERTY       VALUE       SOURCE
[...]
myPool  version        28          default
[...]
root@Storage-01 [~]$

Any more ideas of how to get solve this issue? Thanks a lot for the hints thus far!

Best regards.

kpa · Aug 17, 2013

I'd guess you'll have to import the pool on a system that is at least 9.1-RELEASE. I recall that the "double fault" problem wasn't fixed until 9.1-RELEASE.