ZFS Drive failed while replacing another drive in raidz 1, pool I/O is suspended and zpool clear hangs

harywilke · Jan 14, 2015

I have a 4 drive raidz1 setup. I wanted to replace my 3TB drives with 6TB drives. lets call them 3{abcd} and 6{abcd}. i shutdown the machine. physically removed the first drive, 3a, and replaced it with 6a. booted the machine and did a zpool replace data-pool 3a 6a. days later the resilver was done and all was good. Problems started when i was replacing 3b with 6b. while doing the resilver I lost power, when the machine came back up 3d started throwing up errors all over the place. i panicked a bit, read a bunch, maybe not the right things, and restarted the machine. After the restart the resilver did manage to finish. at one point zpool status showed both the replacement drive 6b and the failing drive 3d as resilvering. When it finished there were errors. a lot of errors. and the replacing status never cleared.

I had hoped I could cancel the replace and put 3b back in the raid, then replace the failing 3d, but here I am stuck. I've read that you can cancel a replace by using a zpool detach command on the new drive, but I get:

Code:

# zpool detach data-pool ata-WDC_WD60EFRX-68MYMN1_WD-
WXL1H643R6JD
cannot detach ata-WDC_WD60EFRX-68MYMN1_WD-WXL1H643R6JD: pool I/O is currently suspended

Currently my zpool status gives the following:

Code:

# zpool status
  pool: data-pool
state: DEGRADED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
  see: http://zfsonlinux.org/msg/ZFS-8000-HC
  scan: resilvered 5.78G in 92h24m with 13919120 errors on Sat Jan 10 01:01:44 2015
config:

  NAME   STATE  READ WRITE CKSUM
  data-pool   DEGRADED  0  0  1
  raidz1-0   DEGRADED  0  0  6
  ata-WDC_WD60EFRX-68MYMN1_WD-WXL1H644C1W2  ONLINE  0  0  0
  replacing-1   DEGRADED  0  0  0
  ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0770602  ONLINE  0  0  0
  ata-WDC_WD60EFRX-68MYMN1_WD-WXL1H643R6JD  UNAVAIL  0  0  0
  ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0891608  ONLINE  0  0  0
  ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0891942  ONLINE  0  0  1

errors: 13919121 data errors, use '-v' for a list

zpool clear appears to hang as i've left it for over 24 hours and don't see any activity on my drive lights which were going like crazy on the replace. top didn't show any activity for zfs related things that I could tell. I'm sure there are better ways to inspect if it's actually doing something, but I don't know how.

smartctl shows all the drives as PASSED.

What to do now?

Is it possible for me to cancel the replace and bring 3b back on line into the raid?
Should I be waiting longer for the zpool clear command to run? (resilvering 6a took around 8 days I think).
Should I offline or somehow remove 3d? I've tried a few reboots and sometimes having 3d connected hangs the boot. Other times not.

I'm runing Centos 6.6 (final).

Code:

#uname -r
2.6.32-504.3.3.el6.x86_64

I'm running version 0.6.3-1.2.el6 of zfs but I'm not sure what my pool is actually running. Not sure how to check or if it matters.

Thank you for taking the time to read all that. Any help would be appreciated. If this is not the right place for these kinds of questions can you please point me to the right place.

SirDice · Jan 14, 2015

harywilke said:
I'm runing Centos 6.6 (final)

Code:

#uname -r 2.6.32-504.3.3.el6.x86_64

This is a FreeBSD forum my friend.

If this is not the right place for these kinds of questions can you please point me to the right place.

Sorry, can't really help with that. I don't know of Linux forums that deal with ZFS.

Thread closed.

ZFS Drive failed while replacing another drive in raidz 1, pool I/O is suspended and zpool clear hangs

harywilke

SirDice

Administrator