ZFS ZFS-8000-4J on pool - don't really know how I ended up there

rainer_d · Aug 5, 2024

Hi,

I have a pool (raid-z2) that has a disk that dropped out.
The disk is a bit difficult to locate, so there were instances were the wrong disk was pulled out.
Just re-adding the disk didn't work, I had to wipe a couple of MB from the start and the end of the disk.

Now, somehow I seem to have messed it up.

Code:

(server2 </root>) 1 # zpool status datapool
  pool: datapool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: resilvered 589G in 10 days 17:49:11 with 0 errors on Sat Jul 27 08:01:29 2024
config:

    NAME                      STATE     READ WRITE CKSUM
    datapool                  DEGRADED     0     0     0
      raidz2-0                DEGRADED     0     0     0
        da3                   ONLINE       0     0     0
        da2                   ONLINE       0     0     0
        da1                   ONLINE       0     0     0
        da0                   ONLINE       0     0     0
        da4                   ONLINE       0     0     0
        da5                   ONLINE       0     0     0
        da6                   REMOVED      0     0     0
        da6                   ONLINE       0     0     0
      raidz2-1                ONLINE       0     0     0
        da11                  ONLINE       0     0     0
        da10                  ONLINE       0     0     0
        da9                   ONLINE       0     0     0
        da8                   ONLINE       0     0     0
        da12                  ONLINE       0     0     0
        da13                  ONLINE       0     0     0
        da14                  ONLINE       0     0     0
        da15                  ONLINE       0     0     0
      raidz2-2                DEGRADED     0     0     0
        da16                  ONLINE       0     0     0
        da17                  ONLINE       0     0     0
        da20                  ONLINE       0     0     0
        da19                  ONLINE       0     0     0
        da18                  ONLINE       0     0     0
        da22                  ONLINE       0     0     0
        da21                  ONLINE       0     0     0
        14130565798947560696  FAULTED      0     0     0  was /dev/da23
      raidz2-3                ONLINE       0     0     0
        da23                  ONLINE       0     0     0
        da24                  ONLINE       0     0     0
        da25                  ONLINE       0     0     0
        da26                  ONLINE       0     0     0
        da27                  ONLINE       0     0     0
        da28                  ONLINE       0     0     0
        da29                  ONLINE       0     0     0
        da30                  ONLINE       0     0     0
      raidz2-4                ONLINE       0     0     0
        da31                  ONLINE       0     0     0
        da32                  ONLINE       0     0     0
        da33                  ONLINE       0     0     0
        da34                  ONLINE       0     0     0
        da35                  ONLINE       0     0     0
        da36                  ONLINE       0     0     0
        da37                  ONLINE       0     0     0
        da38                  ONLINE       0     0     0
      raidz2-5                ONLINE       0     0     0
        da39                  ONLINE       0     0     0
        da40                  ONLINE       0     0     0
        da41                  ONLINE       0     0     0
        da42                  ONLINE       0     0     0
        da43                  ONLINE       0     0     0
        da44                  ONLINE       0     0     0
        da45                  ONLINE       0     0     0
        da46                  ONLINE       0     0     0

Is there a way to salvage this?
I believe the "actual" disk is da7....

Code:

(server2 </root>) 0 # camcontrol inquiry da7                                                                                                                                                                               
pass8: <HP EG001200JWFVA HPD3> Fixed Direct Access SPC-4 SCSI device
pass8: Serial Number 48D0A1GFFQXE1815
pass8: 1200.000MB/s transfers, Command Queueing Enabled

usdmatt · Aug 5, 2024

There's no obvious reason it shouldn't be salvageable given the displayed status. Most awkward issue is that da6 is part of the pool, but the disk that was originally da6 before messing with the hardware is no longer in the pool, and so ZFS is showing its last known device node, which was also da6.

I would suggest running zpool status -g to show the ZFS ID for each disk. This will give you the ZFS ID for the disk that shows as removed. From what you've said, the relevant disk has been cleared and is now da7 (which doesn't appear anywhere else in the pool).

As such you should be able to do a zpool replace datapool {id_of_disk_showing_REMOVED} da7. Do not use -f to force anything.

You'll need to let the pool resilver and then attend to the other disk which is showing as FAULTED.

VladiBG · Aug 5, 2024

try with sesutil(8)

rainer_d · Aug 5, 2024

ah, ok. Thanks a lot!
It's resilvering now, which typically takes 10 days(!) - which in itself is a bit worrying.
These are 1.2TB SAS disks. I can't imagine how long a resilver of an 8 tb disk would take.

VladiBG · Aug 5, 2024

WD Red SMR vs CMR Tested Avoid Red SMR

We tested WD Red SMR v CMR drives to see if there was indeed a significant impact with the change. We found SMR can put data at risk 13-16x longer than CMR

www.servethehome.com

ZFS ZFS-8000-4J on pool - don't really know how I ended up there

rainer_d

usdmatt

VladiBG

rainer_d

VladiBG

WD Red SMR vs CMR Tested Avoid Red SMR