ZFS ZFS disk replacement: seeking advice

Hi all,

I'm seeking advice on how to safely replace a disk that's currently part of a ZFS array. In particular, I don't recall exactly how I installed things in the first place (it was a while ago), so help reconstructing necessary gpart and zfs steps would be greatly appreciated.

As I understand it, the fix will involve:
  • Backups (done)
  • Removing the bad disk from the ZFS array.
  • Identifying the physical disk to replace.
  • Replacing the bad old disk with a new disk.
  • Partitioning the new disk to match the partition scheme of the old disk.
  • Adding the new, partitioned disk to the ZFS pool
  • Rebuilding the array.
Assistance with specifying those steps would be terrific.

I'm running 10.3:
Code:
$ uname -a
FreeBSD thinkbsd.######.home 10.3-RELEASE FreeBSD 10.3-RELEASE #0 r297264: Fri Mar 25 02:10:02 UTC 2016     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64

Regarding the disk, ada0 is the target for replacement because SMART info looks less than good:
Code:
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD20EFRX-68EUZN0
Serial Number:    WD-WCC4M7DJNX7X
LU WWN Device Id: 5 0014ee 20c26406b
Firmware Version: 82.00A82
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Feb  8 20:59:30 2018 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       484
  3 Spin_Up_Time            0x0027   178   172   021    Pre-fail  Always       -       4066
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       32
  5 Reallocated_Sector_Ct   0x0033   185   185   140    Pre-fail  Always       -       451
  7 Seek_Error_Rate         0x002e   200   195   000    Old_age   Always       -       322
  9 Power_On_Hours          0x0032   076   076   000    Old_age   Always       -       17532
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       32
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       15
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       707
194 Temperature_Celsius     0x0022   120   111   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   156   156   000    Old_age   Always       -       44
197 Current_Pending_Sector  0x0032   001   001   000    Old_age   Always       -       65471
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

My ZFS setup is as follows:
Code:
$ zpool status
  pool: system
state: ONLINE
  scan: scrub canceled on Mon Feb  5 01:00:55 2018
config:

        NAME             STATE     READ WRITE CKSUM
        system           ONLINE       0     0     0
          mirror-0       ONLINE       0     0     0
            gpt/system0  ONLINE       0     0     0
            gpt/system1  ONLINE       0     0     0
          mirror-1       ONLINE       0     0     0
            gpt/system2  ONLINE       0     0     0
            gpt/system3  ONLINE       0     0     0

Each of the four disks (ada[0-3]) is partitioned as follows:
Code:
$ sudo gpart show ada0
Password:
=>        34  3907029101  ada0  GPT  (1.8T)
          34           6        - free -  (3.0K)
          40        1024     1  freebsd-boot  (512K)
        1064     4194304     2  freebsd-swap  (2.0G)
     4195368  3902833760     3  freebsd-zfs  (1.8T)
  3907029128           7        - free -  (3.5K)

More specifically:
Code:
$ sudo diskinfo -v /dev/gpt/boot0
/dev/gpt/boot0
        512             # sectorsize
        524288          # mediasize in bytes (512K)
        1024            # mediasize in sectors
        4096            # stripesize
        0               # stripeoffset
        1               # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        WD-WCC4M7DJNX7Xs0       # Disk ident.

$ sudo diskinfo -v /dev/gpt/swap0
/dev/gpt/swap0
        512             # sectorsize
        2147483648      # mediasize in bytes (2.0G)
        4194304         # mediasize in sectors
        4096            # stripesize
        0               # stripeoffset
        4161            # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        WD-WCC4M7DJNX7Xs0       # Disk ident.

$ sudo diskinfo -v /dev/gpt/system0
/dev/gpt/system0
        512             # sectorsize
        1998250885120   # mediasize in bytes (1.8T)
        3902833760      # mediasize in sectors
        4096            # stripesize
        0               # stripeoffset
        3871858         # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        WD-WCC4M7DJNX7Xs0       # Disk ident.

Please let me know if I can provide any additional information, and thanks in advance!
 
You use 4k sectors, that's good. Check first that sysctl vfs.zfs.min_auto_ashift=12.
If it's 9, then set it to 12 and echo "vfs.zfs.min_auto_ashift=12" >> /etc/sysctl.conf so you can forget about
that, when you replace disks in the future.

And for the new disk you create a matching layout like this....
Code:
# gpart create -s gpt adaN
# gpart add -a 4k -t freebsd-boot -l boot -s 512k adaN
# gpart add -a 4k -t freebsd-swap -l swap -s 2g adaN
# gpart add -a 4k -t freebsd-zfs -l system0_march18 adaN (I label new disks with date of replacement)
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i1 adaN
 
Why take the old disk out before you put the new disk in? I would add the new disk first, then let ZFS copy onto it, then remove the old disk from the mirror. Like that, you have on more disk during the copy. If there are lots of errors on the other mirror disks (unlikely but possible), that might save you.
 
Why take the old disk out before you put the new disk in?
True. It's a better strategy to keep the old disk connected while it still works so the pool does not have to go into DEGRADED state at all.

zpool replace system gpt/systemOLD gpt/systemNEW
Replaces old_device with new_device. This is equivalent to attaching new_device, waiting for it to resilver, and then detaching old_device.
 
Hi all,

Success — thanks to all!
Code:
$ zpool status system
  pool: system
 state: ONLINE
  scan: scrub repaired 0 in 1h21m with 0 errors on Sun Mar 11 12:02:32 2018
config:

        NAME                     STATE     READ WRITE CKSUM
        system                   ONLINE       0     0     0
          mirror-0               ONLINE       0     0     0
            gpt/system1          ONLINE       0     0     0
            gpt/system0_2018_03  ONLINE       0     0     0
          mirror-1               ONLINE       0     0     0
            gpt/system2          ONLINE       0     0     0
            gpt/system3          ONLINE       0     0     0

errors: No known data errors

Also, Ralph and Swegen, thanks for the tip about attach-resilver-remove — I didn't take advantage of that approach this time, but I will as I continue rotating new disks into the system.
 
Back
Top