ZFS Update: HDD went offline while resilvering is in progress || Is this a valid & up2date guide for replacing existing disks with larger disks?

Hi @all,

I want to replace my old small disks with larger ones in a VDEV (RAID-Z1). Is this guide valid and the easiest way of doing this? https://madaboutbrighton.net/articles/2016/increase-zfs-pool-by-adding-larger-disks

Maybe there are any features that might be implemented since 2016 which would make the process even easier?

I also have a spare SATA port available, so I would do this by replacing and resilvering every single device on the vdev until the vdev fully consists of the larger disks. And expanding the pool afterwards, by remounting it if the autoexpand does not work.

/e: typos...
 
Last edited:
I don't believe so. This is just going to be a painstaking process and you want to be very careful to make sure you know which drives are which. If you didn't do it previously, it's a good idea to label them as they go in.

Now, if you had enough ports to mirror the operation, that would be extremely easy, you'd set up the spare as it's own VDEV and attaching it to the previous one as a second VDEV in a mirror would be much easier. But, that would require a lot of ports that you don't have.
 
Or simply: build a new machine with a larger zfs pool, then move all data from the old one. Easier process, the downside is increased cost and hardware.
 
My pool is consisting of two vdevs, so thats not an option for me.
Unfortunately, there are often restrictions that make things less convenient. As long as you are careful about determining which disk is which and only try to replace one at a time, you should be fine. Just make sure that you've got backups and make it's a good idea to label the device and the physical drive if you haven't already done that.

Apart from this taking a number of days, it shouldn't cause any other problems.
 
Okay, looks like Murphy's law hit me on this one.

I attached the first disk to my NAS and started replacing device slot1 with slot1b. 4 hours later:


Code:
Aug 20 21:18:26 NAS kernel: ada7 at ahcich6 bus 0 scbus6 target 0 lun 0
Aug 20 21:18:26 NAS kernel: ada7: <ST4000VN008-2DR166 SC60> s/n ZDH1RAYB detached
Aug 20 21:18:26 NAS kernel: GEOM_ELI: g_eli_read_done() failed (error=6) label/slot2.eli[READ(offset=634798956544, length=45056)]
Aug 20 21:18:26 NAS kernel: GEOM_ELI: Device label/slot2.eli destroyed.
Aug 20 21:18:26 NAS kernel: GEOM_ELI: Detached label/slot2.eli on last close.
Aug 20 21:18:26 NAS kernel: (ada7:ahcich6:0:0:0): Periph destroyed
Aug 20 21:18:26 NAS ZFS: vdev state changed, pool_guid=13798682662516583972 vdev_guid=10024804857788748450
Aug 20 21:18:26 NAS ZFS: vdev is removed, pool_guid=13798682662516583972 vdev_guid=10024804857788748450
Aug 20 21:18:33 NAS kernel: ada7 at ahcich6 bus 0 scbus6 target 0 lun 0
Aug 20 21:18:33 NAS kernel: ada7: <ST4000VN008-2DR166 SC60> ACS-3 ATA SATA 3.x device
Aug 20 21:18:33 NAS kernel: ada7: Serial Number XXXXXX
Aug 20 21:18:33 NAS kernel: ada7: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Aug 20 21:18:33 NAS kernel: ada7: Command Queueing enabled
Aug 20 21:18:33 NAS kernel: ada7: 3815447MB (7814037168 512 byte sectors)

This is the current output of 'zpool status':
Code:
  pool: tank0
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Aug 20 17:12:51 2020
        6.75T scanned at 408M/s, 5.50T issued at 332M/s, 44.1T total
        607G resilvered, 12.46% done, 1 days 09:52:31 to go
config:

        NAME                      STATE     READ WRITE CKSUM
        tank0                     DEGRADED     0     0     0
          raidz1-0                DEGRADED     0     0     0
            replacing-0           ONLINE       0     0     0
              label/slot1.eli     ONLINE       0     0     0
              label/slot1b.eli    ONLINE       0     0     0
            10024804857788748450  REMOVED      0     0     0  was /dev/label/slot2.eli
            label/slot3.eli       ONLINE       0     0     0
            label/slot4.eli       ONLINE       0     0     0
          raidz1-1                ONLINE       0     0     0
            label/slot5.eli       ONLINE       0     0     0
            label/slot6.eli       ONLINE       0     0     0
            label/slot7.eli       ONLINE       0     0     0
            label/slot8.eli       ONLINE       0     0     0

errors: No known data errors


So this looks like the unpleasant situation, on which slot1 is being replaced with slot1b, while another HDD (slot2) was disconnected from the pool? What to do now, should I wait for the resilvering to complete, or should I reboot and trying to attach slot2 back to the pool?


/e: I did not perform any write operations to the pool while the replacement took place. Can I reattach slot2 while the pool is online? Or will it get resilvered anyways since the disruption took place? My next step would be to replace slot2 anyways with the bigger disk.
 
Well, it's not directly helpful to my question, but yes. I made backups. But due to the amounts of data there is some "none critical" data which would be lost. So for me it is desireable to avoid breaking my zfs pool here.

I can still access everything within tank0, so I believe zfs treats slot1 + slot1b as one fully functional device?
 
I would be very cautious with how to proceed at this point, since the output suggests something confusing in the 'zpool status' output. As I understand it, you have a raidz1 VDEV that you were already resilvering a disk when another one dropped, which to me would suggest a double fault = pool fault. But ZFS either hasn't "figured" that out yet, or will be fine with it, as soon as the resilver is complete, can't say. In my experience, rebooting the machine now would be a serious mistake, because ZFS "forgets" what was going on before, making things even more confusing. Please, let the resilver complete before doing anything else!
 
Thanks! I'm hoping for a seamless switch here between slot1 and slot1b without any interruption (else I'm screwed). For the time given, it looks like I'm sitting on a hot chair for ~2 more days. Inconvenient.
 
Everything went faster than expected, I must note for the future: zfs ist plainly awesome! It replaced the disks without degrading the pool further. So it's PRETTY resilient with it's ability to replace disks plus withstanding an additional disk failure.
 
Back
Top