ZFS ZFS raid groups with swapped devices (devices appear twice)

chromatin · Mar 20, 2024

I had a hard time describing this problem succinctly to generate a title. Thanks also for your patience, as I am not a professional sysadmin but a scientist.

The bottom line is that my 4U JBOD power-cycled (or the controller dropped it) for some reason. When the disks came back online, there were two problems:

1. Devices were re-numbered (/dev/daNN)
2. Two devices never came back online (not recognized by CAM subsystem).

I think either of these problems alone would be no big deal for the system, but as as result of these two problems interacting, ZFS now shows duplicated device names in different RAID groups (see below).

With respect to #1, this is a known issue. In linux, I avoid this by referring to devices by SCSI/WWN (/dev/disk/by-id/...) when creating the zpool. I'm not sure what I should have done differently other than manually labeling all 24 disks prior to creating the pool, but it seems ridiculous to have a human do something that a computer should be able to handle without difficulty. In any event, this is not the primary problem.

With respect to #2, the devices were recognized by the MPR driver, but not cam subsystem. `camcontrol rescan` brought them back online and assigned them the last two ids, `da58` and `da59`. (Thanks to this forum post [0])

Now note carefully below: `da50` and `da57` are list in both `raidz2-1` and `raidz2-2` groups (whereas `da58` and `da59` should be listed)

Code:

```
# zpool status bigpool
  pool: bigpool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 12:53:37 with 0 errors on Sat Mar 16 04:36:15 2024
config:


    NAME        STATE     READ WRITE CKSUM
    bigpool     DEGRADED     0     0     0
      raidz2-0  ONLINE       0     0     0
        da53    ONLINE       0     0     0
        da54    ONLINE       0     0     0
        da36    ONLINE       0     0     0
        da56    ONLINE       0     0     0
        da51    ONLINE       0     0     0
        da55    ONLINE       0     0     0
        da38    ONLINE       0     0     0
        da39    ONLINE       0     0     0
      raidz2-1  DEGRADED     0     0     0
        da52    ONLINE       0     0     0
        da57    ONLINE       0     0     0
        da37    ONLINE       0     0     0
        da40    ONLINE       0     0     0
        da50    FAULTED      0     0     0  corrupted data
        da47    ONLINE       0     0     0
        da42    ONLINE       0     0     0
        da41    ONLINE       0     0     0
      raidz2-2  DEGRADED     0     0     0
        da44    ONLINE       0     0     0
        da48    ONLINE       0     0     0
        da45    ONLINE       0     0     0
        da49    ONLINE       0     0     0
        da43    ONLINE       0     0     0
        da50    ONLINE       0     0     0
        da57    FAULTED      0     0     0  corrupted data
        da46    ONLINE       0     0     0


errors: No known data errors
```

Now, I'm terrified to issue a `zpool replace` command for two reasons:
1. naively replacing, say, `da50` might replace the working device in `raidz2-2` rather than the faulted device in `raidz2-1`, leaving me with zero redundancy while resilvering. zpool-replace may be smart enough to prevent this unless I am dumb enough to include `-f`.
2. I don't actually know which of the two now-available devices (`da58` and `da59`) belongs with which raid group

Problem #1 could be solved -- I think -- by using GUID (obtained from `zdb`) instead of device number with the `zpool replace` command.

**How do I solve problem #2?**
For example, if the GUID associated with the FAULTED drive in `raidz2-1` is 10104343158814001513, and I issue `zpool replace 10104343158814001513 <daNN>`, it will certainly work no matter which of 58 or 59 I pick, but one will resilver nearly instantly, and the other take quite a long time.
(Bonus Question: DId I screw up by not GPT labeling my disks before creating the pool? Chapter 22 of the handbook[1] never recommends this. Is there some other way to add devices by a stable identifier to prevent this problem in the future?)
[0] https://muc.lists.freebsd.scsi.nark...ll-disks-come-back-after-power-cycling-a-jbod
[1] https://docs.freebsd.org/en/books/handbook/zfs/

Eric A. Borisch · Mar 20, 2024

One of the big features of ZFS is that metadata on the drive (and not what /dev/daNN it was assigned) determines what ZFS thinks it is, so this renumbering shouldn't be any real concern. The bigger issue is that some things came online after import for whatever reason, but it appears you have them back alive (assigned /dev/ddNN nodes).

You can also use zpool list -vg bigpool to show GUIDs, which is a nicer interface to parse than zdb typically.

I would first try zpool clear bigpool 10104343158814001513 followed by zpool online bigpool 10104343158814001513. Assuming that works, do the same for the other drive. The first says "I know you had problems with this device, but be willing to try using it again", and the second says "Remember that device? You should be able to access it again."

You shouldn't need to replace devices, you really just want them brought back online and have any missing transactions played out, which is what the above should do.

chromatin · Mar 20, 2024

FANTASTIC THANK YOU

In fact, zpool clear bigpool <guid> was enough to kick off the resilver, without needing to zpool online.

2.87M worth of diffs resilvered in 2 seconds with zero errors. Thank you again.

(I also infer from your answer that it is reasonable to not bother with gpt or geom labels in large zpools)

Eric A. Borisch · Mar 20, 2024

chromatin said:
FANTASTIC THANK YOU

In fact, zpool clear bigpool <guid> was enough to kick off the resilver, without needing to zpool online.

2.87M worth of diffs resilvered in 2 seconds with zero errors. Thank you again.

(I also infer from your answer that it is reasonable to not bother with gpt or geom labels in large zpools)

Glad I could help.

They (labels) are not needed for the functionality/stability of the pool, but they can be nice if you want device names to relate to some attribute (of your choice) that sticks to the drives.

I personally don’t bother with them for ZFS pools.

ralphbsz · Mar 20, 2024

Eric A. Borisch said:
They (labels) are not needed for the functionality/stability of the pool, ...

But they are useful for debugging if something goes wrong. Debugging these things with just giant hex numbers (like WWNs or the ZFS-internal drive IDs) is much harder. It is particularly hard if the ID is not physically visible on the drive, in particular not without removing the drive (the WWN is typically printed in very small font or as a barcode on the paper label).

Personally, I use gpart labels on all disks, even if a system has only one disk right now.

mer · Mar 20, 2024

chromatin said:
(I also infer from your answer that it is reasonable to not bother with gpt or geom labels in large zpools)

My opinion only:
"It depends on how you label". labels like "disk1, disk2, ...diskN" are basically the same as /dev/daN but if you always refer by the label any renumbering under the hood is irrelevant.
I like to label with something visible: if the serial number of the drive is visible on the drive, I use that. Makes it easier to replace a specific device. If you have a lot, labels that indicate a specific drive in a specific enclosure can be easier.

I agree with ralphbsz I try to use labels on everything.

Eric A. Borisch · Mar 20, 2024

If you're lucky enough to have the appropriate hardware, sesutil(8) locate can be very nice for finding the right drive to pull, too.

Hiroo Ono · Mar 20, 2024

For the first problem, there is /dev/diskid.
I have a pool configured with diskid instead of names like ada or da.

Code:

        NAME                            STATE     READ WRITE CKSUM
        samd                            ONLINE       0     0     0
          raidz2-0                      ONLINE       0     0     0
            diskid/DISK-46GHK3K2FSDAp2  ONLINE       0     0     0
            diskid/DISK-46H3K5FLFSDAp2  ONLINE       0     0     0
            diskid/DISK-25S4K243FSDAp2  ONLINE       0     0     0
            diskid/DISK-3534KHPTFp2     ONLINE       0     0     0

Hiroo Ono · Mar 20, 2024

and if you set kern.geom.label.disk_ident.enable to 0 (be careful you do not use /dev/diskid, for it will disappear), /dev/gptid and /dev/gpt should appear. If I remember correctly, /dev/gptid should refer disks by UUID. Though I did not test it myself.

ralphbsz · Mar 21, 2024

Eric A. Borisch said:
If you're lucky enough to have the appropriate hardware, sesutil(8) locate can be very nice for finding the right drive to pull, too.

And if you are not so lucky, and you have a big enclosure with lots of disks, you should put some serious thought into how you will identify disks. Labeling them with gpart is a good starting point. One of the problems with all this is: if a disk is so broken that it doesn't communicate, how do you identify it? Usually, you have to resort to either sesutil (or equivalent commands), or you have to identify the missing disk by exclusion. The latter technique fails if two disks are unreachable. In large disk systems, identifying the physical location of disks is a huge and complex task.

As usual, an anecdote: At one former employer, we had a group that measured the causes for data loss, and the winner was: One disk fails on a customer's system, the system keeps working just fine because the storage is configured to be redundant (RAID), field service is sent to replace the failed disk, field service pulls the wrong disk and replaces it with a blank spare disk, system is now missing two disks (one failed, one pulled by mistake), hilarity ensues.