zfs:replacing failed

goddard94 · Jan 8, 2018

Hi all,

So, I wanted to replace a failed disk by a good one.
The new disk has never been used. I made some GPT operations to add partitions, exactly like other disks.

Code:

uname -ar
FreeBSD myserver.mynet.net 8.2-RELEASE FreeBSD 8.2-RELEASE

gpart create -s gpt ad10
gpart add -b 128 -s 4194304 -t freebsd-swap -l swap-disk6 ad10
gpart add -t freebsd-zfs -l disk6 ad10

1) power down the sever
2) replace the disk
3) power up the server
4) zpool replace stockage gpt/disk6

it created a disk named "old" and I got this ( I supposed the disk1 have to be changed too ...)

Code:

    NAME                 STATE     READ WRITE CKSUM
   stockage             DEGRADED     0     0    32
     raidz2             DEGRADED     0     0   126
       gpt/disk7        ONLINE       0     0     0
       gpt/disk1        ONLINE       0     0     0  127K resilvered
       replacing        DEGRADED     0     0     0
         gpt/disk6/old  UNAVAIL      0     0     0  cannot open
         gpt/disk6      ONLINE       0     0     0  1.22T resilvered
       gpt/disk0        ONLINE       0     0     0
       gpt/disk3        ONLINE       0     0     0
       gpt/disk2        ONLINE       0     0     0
       gpt/disk5        ONLINE       0     0     0
       gpt/disk4        ONLINE       0     0     0

in this post:
https://forums.freebsd.org/threads/18519/
this method was recommended:

zpool export <poolname>
shutdown the ZFS box
physically remove the drive
zero the drive (see below)
physically attach the drive
boot the ZFS box to single-user mode
/etc/rc.d/hostid start
zpool import <poolname> (should come up DEGRADED with ad14 marked as missing)
zpool replace <poolname> ad14

But ... humm ... it seems to be done on single user mode, the resilver operation takes a day, even more, and my users can not wait such a long time, and I'm not sure to do that during the weekend ....
so, I need some helps, thanks.

regards

SirDice · Jan 8, 2018

goddard94 said:
FreeBSD myserver.mynet.net 8.2-RELEASE FreeBSD 8.2-RELEASE

FreeBSD 8.2 has been End-of-Life since July 2012(!) and is not supported any more.

Topics about unsupported FreeBSD versions
https://www.freebsd.org/security/unsupported.html

goddard94 · Jan 8, 2018

Yes, sure. So, is there a better place to ask such questions?

usdmatt · Jan 8, 2018

Try zpool detach stockage gpt/disk6/old

In all seriousness though, you're likely having problems due to using an incredibly old version of FreeBSD, with a version of ZFS that was fairly new and buggy at the time.

goddard94 · Jan 9, 2018

usdmatt said:
Try zpool detach stockage gpt/disk6/old

In all seriousness though, you're likely having problems due to using an incredibly old version of FreeBSD, with a version of ZFS that was fairly new and buggy at the time.

Yes, you're right, the release is too old to make zfs working correctly. But, everybody couldn't make every old style IT's from scratch right? that makes our job mazing and interesting. So, I'll try this command in a couple of hours on my bsd VM, and on the server later. you'll be informed today or tomorrow.

regards.

PS: again, yeah.... all data will be migrated on a newer (lasted) release as soon as I receive new server, but I just can not keep my datas on a degraded zfs pool.

goddard94 · Jan 9, 2018

usdmatt said:
Try zpool detach stockage gpt/disk6/old

In all seriousness though, you're likely having problems due to using an incredibly old version of FreeBSD, with a version of ZFS that was fairly new and buggy at the time.

this command can not be applied, it's only applicable to zmirror, not to raidz2.

usdmatt · Jan 9, 2018

Did you get that error when trying to run it? When replacing a disk, the "replacing" entry effectively works like a mirror vdev. Back in the early, buggy days of ZFS-on-FreeBSD this would often get stuck after a rebuild and you'd have to manually detach the old disk. I remember testing this and actually writing notes for doing replacements that included the detach command. This is mainly why I said that you're probably having these problems due to using an old version as I've not had a replace get stuck like this for years.

goddard94 · Jan 10, 2018

usdmatt said:
Did you get that error when trying to run it? When replacing a disk, the "replacing" entry effectively works like a mirror vdev. Back in the early, buggy days of ZFS-on-FreeBSD this would often get stuck after a rebuild and you'd have to manually detach the old disk. I remember testing this and actually writing notes for doing replacements that included the detach command. This is mainly why I said that you're probably having these problems due to using an old version as I've not had a replace get stuck like this for years.

Yes, I did that on my VM with a raidz2, and this url gives explanations:

https://docs.oracle.com/cd/E19253-01/819-5461/gcfhe/index.html

attach/detach works with mirror, not raidz, as shown on that example. on my test server, I got this:

Code:

[root@bsd ~]# zpool status -v
  pool: tank
 state: ONLINE
  scan: resilvered 294K in 0h0m with 0 errors on Wed Jan 10 09:11:56 2018
config:

   NAME             STATE     READ WRITE CKSUM
   tank             ONLINE       0     0     0
     raidz2-0       ONLINE       0     0     0
       label/disk1  ONLINE       0     0     0
       label/disk2  ONLINE       0     0     0
       label/disk3  ONLINE       0     0     0

errors: No known data errors
[root@bsd ~]# zpool detach tank label/disk3
cannot detach label/disk3: only applicable to mirror and replacing vdevs

usdmatt · Jan 10, 2018

only applicable to mirror and replacing vdevs

See the bit where it says "replacing vdevs"?
Yes you can't use detach on a RAIDZ2 vdev, however when replacing a disk you effectively get a sub-vdev to handle the replacement. This sub-vdev works like a mirror.

Your example above shows you trying to detach a disk from a raidz, not detach a disk from a replacing vdev inside a raidz.

Code:

   NAME                 STATE     READ WRITE CKSUM
   stockage             DEGRADED     0     0    32
    raidz2             DEGRADED     0     0   126           <-- this is a raidz2 vdev
      gpt/disk7        ONLINE       0     0     0
      gpt/disk1        ONLINE       0     0     0  127K resilvered
      replacing        DEGRADED     0     0     0           <-- this functions like a mirror
        gpt/disk6/old  UNAVAIL      0     0     0  cannot open
        gpt/disk6      ONLINE       0     0     0  1.22T resilvered
      gpt/disk0        ONLINE       0     0     0
      gpt/disk3        ONLINE       0     0     0
      gpt/disk2        ONLINE       0     0     0
      gpt/disk5        ONLINE       0     0     0
      gpt/disk4        ONLINE       0     0     0

It's possible that disk6 has finished rebuilding (unless the status output you haven't posted says it's still going). As I said above, I know from personal experience that when ZFS was new in FreeBSD, a replacement would get stuck and the old drive wouldn't disappear. You'd just end up with the "replacing" vdev stuck there permanently. The way to fix this was to manually detach the old disk from the replacing vdev - which yes, does act like a mirror, even though it's under a raidz2 vdev.

This may not work for you, especially if it thinks that disk6 hasn't finished rebuilding or there's been some other error, but I've been in the situation where a replacement gets stuck and had to use detach - inside a raidz vdev.

goddard94 · Jan 10, 2018

usdmatt said:
See the bit where it says "replacing vdevs"?
Yes you can't use detach on a RAIDZ2 vdev, however when replacing a disk you effectively get a sub-vdev to handle the replacement. This sub-vdev works like a mirror.

Your example above shows you trying to detach a disk from a raidz, not detach a disk from a replacing vdev inside a raidz.

Code:

NAME STATE READ WRITE CKSUM stockage DEGRADED 0 0 32 raidz2 DEGRADED 0 0 126 <-- this is a raidz2 vdev gpt/disk7 ONLINE 0 0 0 gpt/disk1 ONLINE 0 0 0 127K resilvered replacing DEGRADED 0 0 0 <-- this functions like a mirror gpt/disk6/old UNAVAIL 0 0 0 cannot open gpt/disk6 ONLINE 0 0 0 1.22T resilvered gpt/disk0 ONLINE 0 0 0 gpt/disk3 ONLINE 0 0 0 gpt/disk2 ONLINE 0 0 0 gpt/disk5 ONLINE 0 0 0 gpt/disk4 ONLINE 0 0 0

It's possible that disk6 has finished rebuilding (unless the status output you haven't posted says it's still going). As I said above, I know from personal experience that when ZFS was new in FreeBSD, a replacement would get stuck and the old drive wouldn't disappear. You'd just end up with the "replacing" vdev stuck there permanently. The way to fix this was to manually detach the old disk from the replacing vdev - which yes, does act like a mirror, even though it's under a raidz2 vdev.

This may not work for you, especially if it thinks that disk6 hasn't finished rebuilding or there's been some other error, but I've been in the situation where a replacement gets stuck and had to use detach - inside a raidz vdev.

ok, got it. that's much clear.
so, is there any risks to detach "old" disk in live? I mean, I may survive until receiving the new server, if we are only talking about a mirrored disk which does not really exist.

usdmatt · Jan 10, 2018

I can't give any guarantees at all, especially on an old ZFS version, but as the disk is marked UNAVAIL anyway, attempting to detach it should not cause a problem.

goddard94 · Jan 11, 2018

usdmatt said:
See the bit where it says "replacing vdevs"?
Yes you can't use detach on a RAIDZ2 vdev, however when replacing a disk you effectively get a sub-vdev to handle the replacement. This sub-vdev works like a mirror.

Your example above shows you trying to detach a disk from a raidz, not detach a disk from a replacing vdev inside a raidz.

Code:

NAME STATE READ WRITE CKSUM stockage DEGRADED 0 0 32 raidz2 DEGRADED 0 0 126 <-- this is a raidz2 vdev gpt/disk7 ONLINE 0 0 0 gpt/disk1 ONLINE 0 0 0 127K resilvered replacing DEGRADED 0 0 0 <-- this functions like a mirror gpt/disk6/old UNAVAIL 0 0 0 cannot open gpt/disk6 ONLINE 0 0 0 1.22T resilvered gpt/disk0 ONLINE 0 0 0 gpt/disk3 ONLINE 0 0 0 gpt/disk2 ONLINE 0 0 0 gpt/disk5 ONLINE 0 0 0 gpt/disk4 ONLINE 0 0 0

It's possible that disk6 has finished rebuilding (unless the status output you haven't posted says it's still going). As I said above, I know from personal experience that when ZFS was new in FreeBSD, a replacement would get stuck and the old drive wouldn't disappear. You'd just end up with the "replacing" vdev stuck there permanently. The way to fix this was to manually detach the old disk from the replacing vdev - which yes, does act like a mirror, even though it's under a raidz2 vdev.

This may not work for you, especially if it thinks that disk6 hasn't finished rebuilding or there's been some other error, but I've been in the situation where a replacement gets stuck and had to use detach - inside a raidz vdev.

"zpool detach stockage gpt/disk6/old" made it.

[root@filerblr ~]# zpool status -v
pool: stockage
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: resilver completed after 23h39m with 3 errors on Fri Dec 29 15:02:02 2017
config:

NAME STATE READ WRITE CKSUM
stockage ONLINE 0 0 39
raidz2 ONLINE 0 0 154
gpt/disk7 ONLINE 0 0 0
gpt/disk1 ONLINE 0 0 0 127K resilvered
gpt/disk6 ONLINE 0 0 0 1.22T resilvered
gpt/disk0 ONLINE 0 0 0
gpt/disk3 ONLINE 0 0 0
gpt/disk2 ONLINE 0 0 0
gpt/disk5 ONLINE 0 0 0
gpt/disk4 ONLINE 0 0 0

so, the pool is OK. the "old" disk has gone.
thanks.

usdmatt · Jan 11, 2018

It may be worth doing the following now to clear the checksum errors, then find any corrupt data:

Code:

zpool clear stockage
zpool scrub stockage

If it reports corrupt files ideally those should be removed (which may involve deleting snapshots if the corrupted files are part of them), then clear & scrub again until the pool is completely healthy.

Edit: Also, move it to a current FreeBSD version. ZFS works much better these days...

goddard94 · Jan 12, 2018

usdmatt said:
It may be worth doing the following now to clear the checksum errors, then find any corrupt data:

Code:

zpool clear stockage zpool scrub stockage

If it reports corrupt files ideally those should be removed (which may involve deleting snapshots if the corrupted files are part of them), then clear & scrub again until the pool is completely healthy.

Edit: Also, move it to a current FreeBSD version. ZFS works much better these days...

scrub takes long time(24h), and I'm afraid to do that since other disks may fail during operation. some old files are corrupted, but that's not very important. by the way, I did scrub on the zpool a couple of weeks ago, and the disk6 failed after that ..... And, yes, all datas will be moved to a new server later.
thanks for helps.

zfs:replacing failed

Administrator