ZFS RAID: Disk fails while replacing another disk

hessi · Mar 3, 2010

Good morning guys,

first of all apologies if I've chosen the wrong Forum to ask the question, I'm in doubt whether a problem related to software should be in "System Hardware", but on the other hand, it's clearly internal storage...

I have a ZFS pool comprised of two 3-disk RAIDs which I've recently moved from OS X to FreeBSD (8 stable).

One harddisk failed last weekend with lots of shouting, SMART messages and even a kernel panic.
I attached a new disk and started the replacement.
Unfortunately, about 20% into the replacement, a second disk in the same RAID showed signs of misbehaviour by giving me read errors. The resilvering did finish, though, and it left me with only three broken files according to zpool status:

Code:

[root@camelot /]# zpool status -v tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 10h42m with 136 errors on Tue Mar  2 07:55:05 2010
config:

        NAME           STATE     READ WRITE CKSUM
        tank           DEGRADED   137     0     0
          raidz1       ONLINE       0     0     0
            ad17p2     ONLINE       0     0     0
            ad18p2     ONLINE       0     0     0
            ad20p2     ONLINE       0     0     0
          raidz1       DEGRADED   326     0     0
            replacing  DEGRADED     0     0     0
              ad16p2   OFFLINE      2  169K     6
              ad4p2    ONLINE       0     0     0  839G resilvered
            ad14p2     ONLINE       0     0     0  5.33G resilvered
            ad15p2     ONLINE     418     0     0  5.33G resilvered

errors: Permanent errors have been detected in the following files:

        tank/DVD:<0x9cd>
        tank/DVD@20100222225100:/Memento.m4v
        tank/DVD@20100222225100:/Payback.m4v
        tank/DVD@20100222225100:/TheManWhoWasntThere.m4v

I have the feeling the problems on ad15p2 are related to a cable issue, since it doesn't have any SMART errors, is quite a new drive (3 months old) and was IMHO sufficiently "burned in" by repeatedly filling it to the brim and checking the contents (via ZFS). So I'd like to switch off the server, replace the cable and do a scrub afterwards to make sure it doesn't produce additional errors.

Unfortunately, although it says the resilvering completed, I can't detach ad16p2 (the first faulted disk) from the system:

Code:

[root@camelot /]# zpool detach tank ad16p2
cannot detach ad16p2: no valid replicas

To be honest, I don't know how to proceed now. It feels like my system is in a very unstable state right now, with a replacement not yet finished and errors on two drives in one RAID.Z1.

I deleted the files affected, but have about 20 snapshots of this filesystem and think these files are in most of them since they're quite old.

So, what should I do now? Delete all snapshots? Move all other files from this filesystem to a new filesystem and destroy the whole filesystem? Try to export and import the pool? Is it even safe to reboot the machine right now?

Any help would be appreciated.

Thank you.

hessi

phoenix · Mar 3, 2010

Shutdown the box. Boot the box. And run a manual scrub # zpool scrub tank. That should complete the replacement. After that the old device should no longer appear. You may need to do a "zpool clear" first, to zero out the error counts.

I've had this happen (resilver complete, old device still listed) a couple times on our storage boxes at work. Re-doing the scrub fixes things.

ZFS RAID: Disk fails while replacing another disk

hessi

phoenix