ZFS cannot replace failed drive

rbauer_snow

New Member

Reaction score: 1
Messages: 3

Hey all,

I have a zfs raidz1 running under 8.0-release. One of the drives started giving ECC errors, so I proceeded as follows:

zpool offline storage da5
<shutdown the server and replaced the failed drive>
zpool online storage da5

I next started a scrub, but from watching the drive lights, it didn't seem that was hitting the replaced drive at all, but just verifying what was on the other drives in the raidz1 volume. So I canceled the scrub. At this point, here is what it looks like:

Code:
anchor# zpool status
  pool: storage
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: scrub stopped after 0h3m with 0 errors on Mon May 10 15:00:55 2010
config:

        NAME        STATE     READ WRITE CKSUM
        storage     DEGRADED     0     0     0
          raidz1    DEGRADED     0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     FAULTED      0 3.74K     0  corrupted data
            da5     ONLINE       0     0     0

errors: No known data errors
Replacing the drive doesn't work:

Code:
anchor# zpool replace storage da5
invalid vdev specification
use '-f' to override the following errors:
/dev/da5 is part of active pool 'storage'

anchor# zpool replace -f storage da5
invalid vdev specification
the following errors must be manually repaired:
/dev/da5 is part of active pool 'storage'
I have tried taking the drive back offline:

Code:
anchor# zpool offline storage da5
cannot offline da5: no valid replicas
What else can I try?
 

phoenix

Administrator
Staff member
Administrator
Moderator

Reaction score: 1,262
Messages: 4,099

rbauer_snow said:
I have a zfs raidz1 running under 8.0-release. One of the drives started giving ECC errors, so I proceeded as follows:

zpool offline storage da5
<shutdown the server and replaced the failed drive>
zpool online storage da5
There's your error. The correct process for replacing a drive in ZFS is:
  1. zpool offline storage da5
  2. shutdown and replace drive
  3. zpool replace storage da5

The offline tells ZFS to ignore the da5 device. You don't online a new drive, that will confuse ZFS as it sees the correct device name ... but the signature (UUID) of the drive is incorrect, so it thinks the drive is bad. Instead, you have to tell it to replace the drive.

I next started a scrub, but from watching the drive lights, it didn't seem that was hitting the replaced drive at all, but just verifying what was on the other drives in the raidz1 volume.
Correct. It won't use the new drive, because you haven't told it to use the new drive (zpool replace). Instead, you've told it to online a non-existent drive.

Code:
anchor# zpool status
  pool: storage
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: scrub stopped after 0h3m with 0 errors on Mon May 10 15:00:55 2010
config:

        NAME        STATE     READ WRITE CKSUM
        storage     DEGRADED     0     0     0
          raidz1    DEGRADED     0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     FAULTED      0 3.74K     0  corrupted data
            da5     ONLINE       0     0     0
I'm guessing the vdev is made up of 5 devices. You'll need to "zpool offline" da5, and then "zpool replace" it.

If that doesn't work, remove the drive, and reboot without the drive attached. Then you should be able to "zpool offline" da5, reboot, and "zpool replace" da5.
 
OP
OP
R

rbauer_snow

New Member

Reaction score: 1
Messages: 3

First, thanks for your suggestions.

I have some good news to report. Not sure if this will help anyone, but here it is anyway.

Since I was not able to offline the drive, I tried your suggestion of pulling the new drive, and rebooting without it, and then booting with the drive reinstalled. During this, it occurred to me that my raid controller requires the new drive to be initialized, even though I'm not using any raid levels on the controller. I also noticed that in my earlier zpool status output, da5 was listed twice, but not da6. Apparently my controller moved da6 down to da5 when I pulled the original (configured) drive. Since I had not configured the new drive, it was never made visible to the OS.

Once I did this, and the new drive became visible to the OS, things got worse. The zpool was listed as FAULTED. At this point, I figured it would be best to just destroy and re-create the zpool. So I issued the destroy command and rebooted for good measure. I don't remember if the destroy command succeeded, or gave an error.

After rebooting, I tried to re-create the zpool. However, zpool stated that da1 might be part of another zpool, and advised to use the -f option if I really wanted to ignore this possibility. On a whim, I tried zpool import. To my utter surprise, it worked! It began resilvering the good drives. I issued a zpool replace for the new drive, and it started rebuilding that as well.

Kind of a strange little odyssey, but everything is back to normal, with no data errors and all drives in the zpool appear to be functioning.
 

phoenix

Administrator
Staff member
Administrator
Moderator

Reaction score: 1,262
Messages: 4,099

That is definitely bizarre. Not sure what happened there. :) But, at least things are working now.
 

hirokik

New Member


Messages: 2

I don't know why but sometimes suffix number of device name may move over like ad5 to ad6.
I got same problem and my pool has been DEGRADED but I recovered using command like below;

Code:
zpool export [I]poolname[/I]
zpool import [I]poolname[/I]
 

carlton_draught

Well-Known Member

Reaction score: 32
Messages: 288

I just ran into one problem that seems to be new. It appears you can't test a drive replacement with the same drive (even with the -f option). You need to use a different drive it appears. Which is ok, since after the first has finished resilvering you can replace it with the original.
 

Leander

Well-Known Member

Reaction score: 3
Messages: 264

In case you want to replace with the condition "Bad Diskname == New Diskname" then you take dead disk offline by:
Code:
zpool offline tank /dev/da5
and when new disk physically connected then simply tag it online again:
Code:
zpool online tank /dev/da5
rResilvering will be initialized automatically, no replace needed if the device name stays the same. Don't forget to maybe gpart new disk before inserting.

P.S. Don't play around with -f (force option) on a hot pool! You may end up in tears.

Greetings L
 
Top