A disk in a pool disk that I can't remove, replace or online.

I have a backup machine that, well takes backups. One of the SATA PCI cards acted up and a few disks dropped out. I fixed that issue and now I have a pool with a disk that I can't remove, replace or online.

This being a backup I could blow away the pool and start again, but I am keen to learn how to remedy such a situation.

Any tips greatly appreciated.

Code:
backup01# zpool status                                          
  pool: backup
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: none requested
config:

        NAME                      STATE     READ WRITE CKSUM
        backup                    DEGRADED     0     0     0
          raidz1-0                DEGRADED     0     0     0
            ada1                  ONLINE       0     0     0
            ada4                  ONLINE       0     0     0
            ada5                  ONLINE       0     0     0
            ada3                  ONLINE       0     0     0
            11843460613588413425  OFFLINE      0     0     0  was /dev/ada2s1

errors: No known data errors


Code:
backup01# zpool replace -f backup 11843460613588413425 /dev/ada2
invalid vdev specification
the following errors must be manually repaired:
/dev/ada2 is part of active pool 'backup'

backup01# zpool replace -f backup  /dev/ada2s1  /dev/ada2       
invalid vdev specification
the following errors must be manually repaired:
/dev/ada2 is part of active pool 'backup'

backup01# zpool remove   backup 11843460613588413425          
cannot remove 11843460613588413425: only inactive hot spares, cache, top-level, or log devices can be removed

backup01# zpool remove   backup /dev/ada2s1         
cannot remove /dev/ada2s1: only inactive hot spares, cache, top-level, or log devices can be removed

backup01# zpool remove   backup /dev/ada2  
cannot remove /dev/ada2: no such device in pool


Code:
backup01# ls -l  /dev/ada?
crw-r-----  1 root  operator    0,  77 Jun  7 03:00 /dev/ada0
crw-r-----  1 root  operator    0,  79 Jun  7 03:00 /dev/ada1
crw-r-----  1 root  operator    0,  91 Jun  7 03:00 /dev/ada2
crw-r-----  1 root  operator    0,  95 Jun  7 03:00 /dev/ada3
crw-r-----  1 root  operator    0,  98 Jun  7 03:00 /dev/ada4
crw-r-----  1 root  operator    0, 100 Jun  7 03:00 /dev/ada5



Code:
backup01# zdb
backup:
    version: 28
    name: 'backup'
    state: 0
    txg: 657704
    pool_guid: 9132880163016784154
    hostid: 2564703826
    hostname: 'backup01.rizal'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 9132880163016784154
        children[0]:
            type: 'raidz'
            id: 0
            guid: 8030133583652634975
            nparity: 1
            metaslab_array: 30
            metaslab_shift: 32
            ashift: 9
            asize: 1600340623360
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 14524876765037359285
                path: '/dev/ada1'
                phys_path: '/dev/ada1'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 11732660662153425513
                path: '/dev/ada4'
                phys_path: '/dev/ada4'
                whole_disk: 1
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 141930984528754587
                path: '/dev/ada5'
                phys_path: '/dev/ada5'
                whole_disk: 1
                create_txg: 4
            children[3]:
                type: 'disk'
                id: 3
                guid: 5943112966149375914
                path: '/dev/ada3'
                phys_path: '/dev/ada3'
                whole_disk: 1
                create_txg: 4
            children[4]:
                type: 'disk'
                id: 4
                guid: 11843460613588413425
                path: '/dev/ada2s1'
                phys_path: '/dev/ada2s1'
                whole_disk: 1
                not_present: 1
                DTL: 34
                create_txg: 4
                offline: 1
 
The following could work.

zpool offline <pool name> <disk>
zpool replace <pool name> <failed_device> <spare_device>
zpool detach <pool name> <disk>

Using labels for disks could make this a lot easier in long term.
 
pillai_hfx said:
The following could work.

zpool offline <pool name> <disk>
zpool replace <pool name> <failed_device> <spare_device>
zpool detach <pool name> <disk>

Using labels for disks could make this a lot easier in long term.

Thanks. Yes, I like the idea of labels. On my Linux machines I use the /dev/disk/by-id path which would be the same as using labels here in FreeBSD, I assume.

Is the fact I didn't use labels why I am in this situation?

I haven't got a spare disk attached, but I probably can power off and attach one. I am sure I have a spare 320 GB lying around here somewhere.
 
pillai_hfx said:
The following could work.

zpool offline <pool name> <disk>
zpool replace <pool name> <failed_device> <spare_device>
zpool detach <pool name> <disk>

Using labels for disks could make this a lot easier in long term.

Ok they worked (actually physically replacing it). I will check to see if there is an issue with the disk I have replaced. I assume there must have been? If not, it then means I need to keep a disk handy of the same size or bigger than the disks in this pool in case this happens again. Interesting.


Code:
backup01# zpool status                                        
  pool: backup
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: none requested
config:

        NAME                      STATE     READ WRITE CKSUM
        backup                    DEGRADED     0     0     0
          raidz1-0                DEGRADED     0     0     0
            ada1                  ONLINE       0     0     0
            ada4                  ONLINE       0     0     0
            ada5                  ONLINE       0     0     0
            ada3                  ONLINE       0     0     0
            11843460613588413425  OFFLINE      0     0     0  was /dev/ada2

errors: No known data errors

backup01# zpool replace -f backup  11843460613588413425  label/9QF9ABST

backup01# zpool status                                                 
  pool: backup
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Jun  9 23:46:33 2013
        91.5M scanned out of 645G at 13.1M/s, 14h1m to go
        17.4M resilvered, 0.01% done
config:

        NAME                        STATE     READ WRITE CKSUM
        backup                      DEGRADED     0     0     0
          raidz1-0                  DEGRADED     0     0     0
            ada1                    ONLINE       0     0     0
            ada4                    ONLINE       0     0     0
            ada5                    ONLINE       0     0     0
            ada3                    ONLINE       0     0     0
            replacing-4             OFFLINE      0     0     0
              11843460613588413425  OFFLINE      0     0     0  was /dev/ada2
              label/9QF9ABST        ONLINE       0     0     0  (resilvering)

errors: No known data errors
 
In FreeBSD, you could try either glabel or gpt labels. Labeling is for easier identification of disks down the road and to prevent the case where the device names could get rearranged after a reboot.

You would be able to check the logs to see why the failed disk was presumed faulty and may be use the smartmontools to see if the disk is indeed healthy.

One thing to keep in mind about spare disks in ZFS on FreeBSD - these are cold spares. Not hot spares like the ones you would see on a hardware RAID setup in Linux or Linux md RAID.
 
Thanks for the response.

I did some reading and re-did the whole zpool using GPT labels: # gpart add -t freebsd-zfs -l [i]some_label[/i].

Thanks for the info!
 
Back
Top