ZFS Pool Recovery

Hi All,

kind of a weird problem with my pool

The server itself has 2 lsi 8i cards with each port broken out into 4 sata drive connectors.
(8 drives on 1, 4 on the other .. and the other 4 is a separate pool)

the 12 vdev's are arranged in raidz3

normally, reboot. and enter the key passphrase, geli calculates drive 0-11 and everything is fine.

over the past couple of days, one of the mini sas cables died (with vdev 4/5/6/7)
this caused the same drive (da7) to keep getting ejected out of the pool.

after fighting with it and trying to resilver.. I realized it was probably the cable .. so pulled the 4th cable that was attached to the other pool and replaced the cable with 4/5/6/7 on it .. the rebuild worked fine with no more errors..

but...

I can't fix the pool..

now when I reboot and enter the GELI Decryption key...
Code:
Calculating GELI Decryption Key for disk0p4: xxx iterations
Calculating GELI Decryption Key for disk1p3: xxx iterations
Calculating GELI Decryption Key for disk2p3: xxx iterations
Calculating GELI Decryption Key for disk3p3: xxx iterations
Calculating GELI Decryption Key for disk4p3: xxx iterations
Calculating GELI Decryption Key for disk5p3: xxx iterations
Calculating GELI Decryption Key for disk6p3: xxx iterations
Calculating GELI Decryption Key for disk7p3: xxx iterations
Calculating GELI Decryption Key for disk8p3: xxx iterations
Calculating GELI Decryption Key for disk9p3: xxx iterations
Calculating GELI Decryption Key for disk11p3: xxx iterations
notice it skips da10

during the boot process, I get prompted to ask for the passphrase for da6p3

when you enter the passphrase.. the system boots after failing to mount the second pool..

then I get this.

Code:
$ zpool status
  pool: abyss
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: resilvered 364G in 00:27:27 with 0 errors on Sun Apr  3 18:28:07 2022
config:

        NAME                     STATE     READ WRITE CKSUM
        abyss                    DEGRADED     0     0     0
          raidz3-0               DEGRADED     0     0     0
            da10p3.eli           ONLINE       0     0     0
            1501231572552313979  FAULTED      0     0     0  was /dev/da7p3.eli
            da3p3.eli            ONLINE       0     0     0
            da0p3.eli            ONLINE       0     0     0
            da6p3.eli            ONLINE       0     0     0
            da4p3.eli            ONLINE       0     0     0
            da5p3.eli            ONLINE       0     0     0
            da7p3.eli            ONLINE       0     0     0
            da9p3.eli            ONLINE       0     0     0
            da8p3.eli            ONLINE       0     0     0
            da1p3.eli            ONLINE       0     0     0
            da2p3.eli            ONLINE       0     0     0
        logs
          mirror-1               ONLINE       0     0     0
            ada1                 ONLINE       0     0     0
            ada2                 ONLINE       0     0     0

notice da6 is fine, but there is no da11 .. and pool status seems to think that da11 was da7 ..


any idea where to go from here?! any help would be awesome, thanks!


Code:
zfs-2.0.0-FreeBSD_gf11b09dec
zfs-kmod-2.0.0-FreeBSD_gf11b09dec
freebsd-version 13.0-RELEASE-p7
 
Because of where da11p3 supposed to be zpool says it was formerly da7p3, which is in the pool points cables may mixed up.
(Tip: Label drives and cables. A simple magic marker already may do the job.)

Besides that the ID of the drive is shown is not a good sign. This drive needs to be removed and tested extra. It does not necessarily means there is serious hardware damage, but in any case the partition(table) is scraped.

(Tip: Always have the partition tables of all of your drives saved on an extra system/drive. gpart backup produces a small file within no time. And with gpart restore the partition scheme is restored or copied to a new drive such as quickly.)
 
thanks! to make a long story short I did add stickies with the serial numbers and hba card to each drive .. basically pulled the entire chain apart and powered it down for 30 mins (orginally I didnt realize that the power actually went off) after rebuilding the chain it detected everything properly and started a resilver.

the only wierd part is it still asks for the passphrase at boot and for one drive during userland setup.. so theres something wierd still going on there.. its like GELI is not able to decrypt it.

once the resilver was complete I started a scrub, 11% into it, it has 6 checksum errors.. but the pool is all online.. guess Ill have to wait a day or two to see if it actually finishes the scrub.

ill look into dumping the lables for the pool tho, thanks

cheers
 
the only wierd part is it still asks for the passphrase at boot and for one drive during userland setup.. so theres something wierd still going on there.. its like GELI is not able to decrypt it.
Did you replace the disk without putting geli(8) on it first? Or accidentally added the disk instead of the .eli device?
 
Did you replace the disk without putting geli(8) on it first? Or accidentally added the disk instead of the .eli device?

I didnt actually replace any disks.. however after reseating the entire chain and taking the good old lable maker and adding the serial numbers to the baks of them .. I tossed/replaced the cable... the resilver finished properly and then I did a scrub to rebuild the checksums. .. it finished all of that with 0 errors and is back online now.

I'm guessing theres a way to change the GELI passphrase.. "properly" .. and that update process should apply it to all of the disks in the pool?

but I really need to understand the risk and have a plan before I willynilly try something like that ... its a primary storage pool and rebuilding it becasue of an accident would be terrible at best.

thanks again!
 
… a way to change the GELI passphrase.. "properly"

<https://old.reddit.com/r/freebsd/comments/sduhkv/-/huk0791/?context=1>

(From the manual page, it wasn't clear to me.)

and that update process should apply it to all of the disks in the pool? …

I don't know (don't have any comparable test facility).

With the setup below, geli setkey /dev/ada0p3 succeeded (I entered a new passphrase then restarted the OS prior to adding this comment):

Code:
root@mowa219-gjp4-8570p-freebsd:~ # geli status
      Name  Status  Components
ada0p3.eli  ACTIVE  ada0p3
ada0p2.eli  ACTIVE  ada0p2
root@mowa219-gjp4-8570p-freebsd:~ # lsblk ada0
DEVICE         MAJ:MIN SIZE TYPE                                          LABEL MOUNT
ada0             0:119 932G GPT                                               - -
  ada0p1         0:121 260M efi                                    gpt/efiboot0 -
  <FREE>         -:-   1.0M -                                                 - -
  ada0p2         0:123  16G freebsd-swap                              gpt/swap0 SWAP
  ada0p2.eli     2:45   16G freebsd-swap                                      - SWAP
  ada0p3         0:125 915G freebsd-zfs                                gpt/zfs0 <ZFS>
  ada0p3.eli     0:131 915G zfs                                               - -
  <FREE>         -:-   708K -                                                 - -
root@mowa219-gjp4-8570p-freebsd:~ # geli setkey /dev/ada0p2
geli: Cannot read metadata from /dev/ada0p2: Invalid argument.
root@mowa219-gjp4-8570p-freebsd:~ # geli setkey /dev/ada0p3
Enter new passphrase:

root@mowa219-gjp4-8570p-freebsd:~ #
 
Still having a hard time with this pool... if anyone can help.

Code:
root@abyss:/ # zpool status
  pool: abyss
 state: ONLINE
  scan: resilvered 21.2M in 00:00:02 with 0 errors on Thu Jun  9 00:31:33 2022
config:

        NAME            STATE     READ WRITE CKSUM
        abyss           ONLINE       0     0     0
          raidz3-0      ONLINE       0     0     0
            da3p3.eli   ONLINE       0     0     0
            da8p3.eli   ONLINE       0     0     0
            da5p3.eli   ONLINE       0     0     0
            da4p3.eli   ONLINE       0     0     0
            da10p3.eli  ONLINE       0     0     0
            da11p3.eli  ONLINE       0     0     0
            da9p3.eli   ONLINE       0     0     0
            da2p3.eli   ONLINE       0     0     0
            da1p3.eli   ONLINE       0     0     0
            da0p3.eli   ONLINE       0     0     0
            da7p3.eli   ONLINE       0     0     0
            da6p3.eli   ONLINE       0     0     0
        logs
          mirror-1      ONLINE       0     0     0
            ada1        ONLINE       0     0     0
            ada2        ONLINE       0     0     0

everything is awesome..

then out of the blue I get this..

Code:
root@abyss:/dragon/vm/titan # zpool status
  pool: abyss
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: resilvered 21.2M in 00:00:02 with 0 errors on Thu Jun  9 00:31:33 2022
config:

        NAME            STATE     READ WRITE CKSUM
        abyss           DEGRADED     0     0     0
          raidz3-0      DEGRADED     0     0     0
            da3p3.eli   ONLINE       0     0     0
            da8p3.eli   ONLINE       0     0     0
            da5p3.eli   ONLINE       0     0     0
            da4p3.eli   ONLINE       0     0     0
            da10p3.eli  FAULTED    170   154     0  too many errors
            da11p3.eli  ONLINE       0     0     0
            da9p3.eli   ONLINE       0     0     0
            da2p3.eli   FAULTED    122   131     0  too many errors
            da1p3.eli   ONLINE       0     0     0
            da0p3.eli   ONLINE       0     0     0
            da7p3.eli   ONLINE       0     0     0
            da6p3.eli   ONLINE       0     0     0
        logs
          mirror-1      ONLINE       0     0     0
            ada1        ONLINE       0     0     0
            ada2        ONLINE       0     0     0

If I reboot it goes back to no errors..

I changed all of the cables.

I just replaced da10

Code:
$ zpool status
  pool: abyss
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-2Q
  scan: resilvered 228K in 00:00:02 with 0 errors on Thu Jun  9 06:37:06 2022
config:

        NAME                      STATE     READ WRITE CKSUM
        abyss                     DEGRADED     0     0     0
          raidz3-0                DEGRADED     0     0     0
            da3p3.eli             ONLINE       0     0     0
            da8p3.eli             ONLINE       0     0     0
            da5p3.eli             ONLINE       0     0     0
            da4p3.eli             ONLINE       0     0     0
            15763533421214287197  UNAVAIL      0     0     0  was /dev/da10p3.eli
            da11p3.eli            ONLINE       0     0     0
            da9p3.eli             ONLINE       0     0     0
            da2p3.eli             ONLINE       0     0     0
            da1p3.eli             ONLINE       0     0     0
            da0p3.eli             ONLINE       0     0     0
            da7p3.eli             ONLINE       0     0     0
            da6p3.eli             ONLINE       0     0     0
        logs
          mirror-1                ONLINE       0     0     0
            ada1                  ONLINE       0     0     0
            ada2                  ONLINE       0     0     0

any time I do a scrub or resilver, da 2 and some othe rrandom drive spits up errors..

i issued a replace command..

zpool replace abyss 15763533421214287197 da10

and currently get this

Code:
root@abyss:/dev # zpool status
  pool: abyss
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jun  9 13:49:38 2022
        8.58T scanned at 9.94G/s, 1.41T issued at 1.64G/s, 39.9T total
        7.26M resilvered, 3.54% done, 06:41:24 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        abyss                       DEGRADED     0     0     0
          raidz3-0                  DEGRADED     0     0     0
            da3p3.eli               ONLINE       0     0     0
            da8p3.eli               ONLINE       0     0     0
            da5p3.eli               ONLINE       0     0     0
            da4p3.eli               ONLINE       0     0     0
            replacing-4             UNAVAIL      0 1.08K   394  insufficient replicas
              15763533421212871947  UNAVAIL      0     0     0  was /dev/da10p3.eli
              da10                  FAULTED      3 1.28K     0  too many errors  (resilvering)
            da11p3.eli              FAULTED  1.15K 1.38K     0  too many errors
            da9p3.eli               ONLINE       0     0     0
            da2p3.eli               FAULTED  1.27K 1.16K     0  too many errors
            da1p3.eli               ONLINE       0     0     0
            da0p3.eli               ONLINE       0     0     0
            da7p3.eli               ONLINE       0     0     0
            da6p3.eli               ONLINE       0     0     0
        logs
          mirror-1                  ONLINE       0     0     0
            ada1                    ONLINE       0     0     0
            ada2                    ONLINE       0     0     0

kernel messages are all over the place like..

GOEM_ELI: g_eli_read_done() failed (error=6) da11p3.eli{READ(offset=79992347475324, length=4096)]

its currently resilvering..

any ideas?

thanks
 
for completness the full output / hook up is as such..

the system has 2 lsi 9103 cards, each card has 2 mini sass connectors. each of those use a 4 port split.

hba0 has the first 8 drives of abyss
hba1 has the last 4 drives of abyss and 4 ssd's with dragon

ghost is a single drive plugged into the mb sata port (its a dirty pool used for /ftp/incoming)
titan is also a singe drive plugged into the mb sata port (boot drive)
the mirrors (ada1/2) are also plugged into the sata port of the mb

for reference no other pool than abyss has ever had an error..
abyss has been running for 4+ years without a single error

I also replaced all 4 sas break out cables.

Code:
root@abyss:/dev # zpool status
  pool: abyss
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jun  9 13:49:38 2022
        18.2T scanned at 1.25G/s, 13.1T issued at 924M/s, 39.9T total
        7.26M resilvered, 32.80% done, 08:27:24 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        abyss                       DEGRADED     0     0     0
          raidz3-0                  DEGRADED   289     2     0
            da3p3.eli               ONLINE     300    26     0
            da8p3.eli               ONLINE       0     0     0
            da5p3.eli               ONLINE       0     0     0
            da4p3.eli               ONLINE       0     0     0
            replacing-4             UNAVAIL      0 1.08K   394  insufficient replicas
              15763533421212871947  UNAVAIL      0     0     0  was /dev/da10p3.eli
              da10                  FAULTED      3 1.28K     0  too many errors  (resilvering)
            da11p3.eli              FAULTED  1.15K 1.38K     0  too many errors
            da9p3.eli               ONLINE       0     0     0
            da2p3.eli               FAULTED  1.27K 1.16K     0  too many errors
            da1p3.eli               ONLINE       0     0     0
            da0p3.eli               ONLINE       0     0     0
            da7p3.eli               ONLINE       0     0     0
            da6p3.eli               ONLINE       0     0     0
        logs
          mirror-1                  ONLINE       0     0     0
            ada1                    ONLINE       0     0     0
            ada2                    ONLINE       0     0     0

errors: 292 data errors, use '-v' for a list

  pool: dragon
 state: ONLINE
  scan: scrub repaired 0B in 00:06:08 with 0 errors on Sat Jan 29 12:05:24 2022
config:

        NAME        STATE     READ WRITE CKSUM
        dragon      ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da12    ONLINE       0     0     0
            da15    ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            da14    ONLINE       0     0     0
            da13    ONLINE       0     0     0

errors: No known data errors

  pool: ghost
 state: ONLINE
  scan: scrub repaired 0B in 01:39:36 with 0 errors on Mon Apr  4 09:33:24 2022
config:

        NAME        STATE     READ WRITE CKSUM
        ghost       ONLINE       0     0     0
          ada3      ONLINE       0     0     0

errors: No known data errors

  pool: titan
 state: ONLINE
  scan: scrub repaired 0B in 00:01:49 with 0 errors on Mon Apr  4 07:55:44 2022
config:

        NAME          STATE     READ WRITE CKSUM
        titan         ONLINE       0     0     0
          ada0p4.eli  ONLINE       0     0     0

thanks!

the vm is running on dragon
there are a few jails that do run from abyss

Code:
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
abyss   87.2T  39.9T  47.3T        -         -     0%    45%  1.00x  DEGRADED  -
dragon   928G   237G   691G        -         -    41%    25%  1.00x    ONLINE  -
ghost   5.44T  1.28T  4.15T        -         -     9%    23%  1.00x    ONLINE  -
titan   87.5G  53.8G  33.7G        -         -    38%    61%  1.00x    ONLINE  -

the pool is currently online, but it keeps getting errors during the resliver process..

the errors show up on all differant drives and its just mess.
 
(Tip: Always have the partition tables of all of your drives saved on an extra system/drive. gpart backup produces a small file within no time. And with gpart restore the partition scheme is restored or copied to a new drive such as quickly.)
This gets done by default with the periodic(8) housekeeping run out of /etc/crontab.
The control file is /etc/periodic/daily/221.backup-gpart. Jails are excluded.
The backups are kept in /var/backups -- which, as you suggest, should get backed up.
 
:-/ ...there still is the possibility that the hardware may be faulty, such as one controller of one of your cards is defected like by an ESD impact....
 
Yes, I reseated it, changed cables .. although it does still get errors during the resilver the behaviour has changed to just removing the device within the first 2-3 mins of starting the resilver.

that been said the resilver did complete, I cleared the errors on the pool however it still comes back like so

Code:
root@abyss:~ # zpool status
  pool: abyss
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Jun 10 05:45:42 2022
        32.7T scanned at 1.89G/s, 25.7T issued at 1.48G/s, 39.8T total
        2.14T resilvered, 64.51% done, 02:42:47 to go

        NAME                        STATE     READ WRITE CKSUM
        abyss                       DEGRADED     0     0     0
          raidz3-0                  DEGRADED     0     0     0
            da3p3.eli               REMOVED      0     0     0
            da10p3.eli              ONLINE       0     0     0
            da5p3.eli               ONLINE       0     0     0
            da4p3.eli               ONLINE       0     0     0
            replacing-4             DEGRADED     0     0     0
              15763533421212871947  FAULTED      0     0     0  was /dev/da10p3.eli
              da11                  ONLINE       0     0     0  (resilvering)
            da8p3.eli               ONLINE       0     0     0
            da9p3.eli               ONLINE       0     0     0
            da2p3.eli               REMOVED      0     0     0  (resilvering)
            da1p3.eli               ONLINE       0     0     0
            da0p3.eli               ONLINE       0     0     0
            da7p3.eli               ONLINE       0     0     0
            da6p3.eli               ONLINE       0     0     0
        logs
          mirror-1                  ONLINE       0     0     0
            ada1                    ONLINE       0     0     0
            ada2                    ONLINE       0     0     0

I do not see why it now magically thinks its replacing da11 insted of da10 .. nor why the replace command has not completed..

Im not sure what needs to be done at this point.
 
You've
Yes, I reseated it, changed cables .. although it does still get errors during the resilver the behaviour has changed to just removing the device within the first 2-3 mins of starting the resilver.

that been said the resilver did complete, I cleared the errors on the pool however it still comes back like so

Code:
        NAME                        STATE     READ WRITE CKSUM
        abyss                       DEGRADED     0     0     0
          raidz3-0                  DEGRADED     0     0     0
            da3p3.eli               REMOVED      0     0     0
            da10p3.eli              ONLINE       0     0     0
            da5p3.eli               ONLINE       0     0     0
            da4p3.eli               ONLINE       0     0     0
            replacing-4             DEGRADED     0     0     0
              15763533421212871947  FAULTED      0     0     0  was /dev/da10p3.eli
              da11                  ONLINE       0     0     0  (resilvering)
            da8p3.eli               ONLINE       0     0     0
            da9p3.eli               ONLINE       0     0     0
            da2p3.eli               REMOVED      0     0     0  (resilvering)
            da1p3.eli               ONLINE       0     0     0
            da0p3.eli               ONLINE       0     0     0
            da7p3.eli               ONLINE       0     0     0
            da6p3.eli               ONLINE       0     0     0
        logs
          mirror-1                  ONLINE       0     0     0
            ada1                    ONLINE       0     0     0
            ada2                    ONLINE       0     0     0

I do not see why it now magically thinks its replacing da11 insted of da10 .. nor why the replace command has not completed..

Im not sure what needs to be done at this point.

You've cut off the progress portion of the status, but I assume it says it is progressing?

You say you reseated cables — was this while the system was on? I could certainly see devices getting renumbered depending on the order of operations...
 
last night...

errors were been generated throughout the process of resilvering.. so when I started it.. it would progress, within about an hr.. there were lots of read/write/checksum errors..
so the resilver would just fail, hang or need to be restarted.

after powering down the machine I reseated both hba cards because all the errors appeared to be on 4 drives connected to the same sas card.. I changed the data cables on those 4 drives and reseated the power connectors.

after that was done I powered it back on, during the boot process it spit up a bunch of errors on da2 and ejected the drive from the pool. with (resilvering) message

then I started the resilver process.. It was completed without any more errors.
however, the "replacing" message was still there..


this morning I zpool clear abyss

the "replacing" did not go away, and upon scanning the pool again I got the same 1.3k errors, it split out the two disks (da2 and da3) and is currently resilvering again.

again, the process seems to be progressing.. (it takes about 8hrs to resilver the pool) its currently at 53.65%


I'm not sure what to do now, or how to get out of this "endless" loop ..

thanks!
 
the resilver finished...

Code:
root@abyss:~ # zpool status
  pool: abyss
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 3.33T in 07:45:44 with 0 errors on Fri Jun 10 13:31:26 2022
config:

        NAME            STATE     READ WRITE CKSUM
        abyss           DEGRADED     0     0     0
          raidz3-0      DEGRADED     0     0     0
            da3p3.eli   REMOVED      0     0     0
            da10p3.eli  ONLINE       0     0     0
            da5p3.eli   ONLINE       0     0     0
            da4p3.eli   ONLINE       0     0     0
            da11        ONLINE       0     0     0
            da8p3.eli   ONLINE       0     0     0
            da9p3.eli   ONLINE       0     0     0
            da2p3.eli   REMOVED      0     0     0
            da1p3.eli   ONLINE       0     0     0
            da0p3.eli   ONLINE       0     0     0
            da7p3.eli   ONLINE       0     0     0
            da6p3.eli   ONLINE       0     0     0
        logs
          mirror-1      ONLINE       0     0     0
            ada1        ONLINE       0     0     0
            ada2        ONLINE       0     0     0

as you can see it finished...

then I rebooted.. and .. presto ...

Code:
$ zpool status
  pool: abyss
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Jun 10 06:54:36 2022
        7.22T scanned at 300M/s, 6.37T issued at 265M/s, 39.8T total
        0B resilvered, 16.00% done, 1 days 12:49:12 to go
config:

        NAME            STATE     READ WRITE CKSUM
        abyss           ONLINE       0     0     0
          raidz3-0      ONLINE       0     0     0
            da3p3.eli   ONLINE       0     0     0
            da10p3.eli  ONLINE       0     0     0
            da5p3.eli   ONLINE       0     0     0
            da4p3.eli   ONLINE       0     0     0
            da11        ONLINE       0     0     0
            da8p3.eli   ONLINE       0     0     0
            da9p3.eli   ONLINE       0     0     0
            da2p3.eli   ONLINE       0     0     2
            da1p3.eli   ONLINE       0     0     0
            da0p3.eli   ONLINE       0     0     0
            da7p3.eli   ONLINE       0     0     0
            da6p3.eli   ONLINE       0     0     0
        logs
          mirror-1      ONLINE       0     0     0
            ada1        ONLINE       0     0     0
            ada2        ONLINE       0     0     0

then 10-15 mins later its back to ..

Code:
$ zpool status
  pool: abyss
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Jun 10 06:54:36 2022
        28.6T scanned at 1.12G/s, 27.7T issued at 1.09G/s, 39.8T total
        361M resilvered, 69.65% done, 03:09:38 to go
config:

        NAME            STATE     READ WRITE CKSUM
        abyss           DEGRADED     0     0     0
          raidz3-0      DEGRADED 4.83K     4     0
            da3p3.eli   FAULTED    332   395     0  too many errors  (resilvering)
            da10p3.eli  ONLINE       0     0     0
            da5p3.eli   ONLINE       0     0     0
            da4p3.eli   ONLINE       0     0     0
            da11        REMOVED      0     0     0
            da8p3.eli   ONLINE   9.67K    60     0
            da9p3.eli   ONLINE       0     0     0
            da2p3.eli   FAULTED  1.18K 1.14K     2  too many errors
            da1p3.eli   ONLINE       0     0     0
            da0p3.eli   ONLINE       0     0     0
            da7p3.eli   ONLINE       0     0     0
            da6p3.eli   ONLINE       0     0     0
        logs
          mirror-1      ONLINE       0     0     0
            ada1        ONLINE       0     0     0
            ada2        ONLINE       0     0     0

and back to resliver..


I did order a pair of new controllers and a new set of mini-sas connectors .... Im guessing at this point its hardware .. assuming the "online status" means the data its self was repaired..



does switching the host adapter card affect the encrypted pool at all? Im guessing it should be fine as they are all in hba mode? unlike a hardware raid controller..

thanks
 
I replaced 2 disks, resilvered and broght the pool back on line ..

It has been running fine at a normal work load for a few days...

everything seems ok ... EXCEPT...

before a scrub reachers 10% .. seemly random drives throw "TO MANY ERRORS" and they are faluted out of the pool...

yet, stopping the scrub triggers a resilver .. and that completes with no errors and the pool comes fully back online.


what is going on?
is the pool healthy? or not
why would only a scrub fail, yet the normal operation of the pool is fine

what kind of wierid junk is going on with 2.0?

started scrub after the pool was online and 100% fine for 2 days

Code:
root@abyss:~ # zpool status
  pool: abyss
 state: ONLINE
  scan: scrub in progress since Mon Jun 13 13:37:34 2022
        7.17T scanned at 47.3G/s, 31.1G issued at 205M/s, 39.9T total
        0B repaired, 0.08% done, 2 days 08:31:22 to go
config:

        NAME            STATE     READ WRITE CKSUM
        abyss           ONLINE       0     0     0
          raidz3-0      ONLINE       0     0     0
            da3p3.eli   ONLINE       0     0     0
            da8p3.eli   ONLINE       0     0     0
            da5p3.eli   ONLINE       0     0     0
            da4p3.eli   ONLINE       0     0     0
            da10        ONLINE       0     0     0
            da11p3.eli  ONLINE       0     0     0
            da9p3.eli   ONLINE       0     0     0
            da2         ONLINE       0     0     0
            da1p3.eli   ONLINE       0     0     0
            da0p3.eli   ONLINE       0     0     0
            da7p3.eli   ONLINE       0     0     0
            da6p3.eli   ONLINE       0     0     0
        logs
          mirror-1      ONLINE       0     0     0
            ada5        ONLINE       0     0     0
            ada6        ONLINE       0     0     0

errors: No known data errors


less than .5% in drive is ejected
Code:
root@abyss:~ # zpool status
  pool: abyss
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub in progress since Mon Jun 13 13:37:34 2022
        7.17T scanned at 31.4G/s, 171G issued at 747M/s, 39.9T total
        0B repaired, 0.42% done, 15:29:47 to go
config:

        NAME            STATE     READ WRITE CKSUM
        abyss           DEGRADED     0     0     0
          raidz3-0      DEGRADED     0     0     0
            da3p3.eli   ONLINE       0     0     0
            da8p3.eli   ONLINE       0     0     0
            da5p3.eli   ONLINE       0     0     0
            da4p3.eli   ONLINE       0     0     0
            da10        ONLINE       0     0     0
            da11p3.eli  ONLINE       0     0     0
            da9p3.eli   FAULTED    425   463     0  too many errors
            da2         ONLINE       0     0     0
            da1p3.eli   ONLINE       0     0     0
            da0p3.eli   ONLINE       0     0     0
            da7p3.eli   ONLINE       0     0     0
            da6p3.eli   ONLINE       0     0     0
        logs
          mirror-1      ONLINE       0     0     0
            ada5        ONLINE       0     0     0
            ada6        ONLINE       0     0     0

2nd drive ejected
Code:
root@abyss:~ # zpool status
  pool: abyss
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub in progress since Mon Jun 13 13:37:34 2022
        7.43T scanned at 28.5G/s, 201G issued at 772M/s, 39.9T total
        0B repaired, 0.49% done, 14:58:19 to go
config:

        NAME            STATE     READ WRITE CKSUM
        abyss           DEGRADED     0     0     0
          raidz3-0      DEGRADED     0     0     0
            da3p3.eli   ONLINE       0     0     0
            da8p3.eli   FAULTED    424   480     0  too many errors
            da5p3.eli   ONLINE       0     0     0
            da4p3.eli   ONLINE       0     0     0
            da10        ONLINE       0     0     0
            da11p3.eli  ONLINE       0     0     0
            da9p3.eli   FAULTED    425   463     0  too many errors
            da2         ONLINE       0     0     0
            da1p3.eli   ONLINE       0     0     0
            da0p3.eli   ONLINE       0     0     0
            da7p3.eli   ONLINE       0     0     0
            da6p3.eli   ONLINE       0     0     0
        logs
          mirror-1      ONLINE       0     0     0
            ada5        ONLINE       0     0     0
            ada6        ONLINE       0     0     0

5% in another drive ejected

Code:
root@abyss:~ # zpool status
  pool: abyss
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub in progress since Mon Jun 13 13:37:34 2022
        9.09T scanned at 7.27G/s, 1.87T issued at 1.50G/s, 39.9T total
        0B repaired, 4.69% done, 07:13:15 to go
config:

        NAME            STATE     READ WRITE CKSUM
        abyss           DEGRADED     0     0     0
          raidz3-0      DEGRADED     0     0     0
            da3p3.eli   ONLINE       0     0     0
            da8p3.eli   FAULTED    424   480     0  too many errors
            da5p3.eli   ONLINE       0     0     0
            da4p3.eli   ONLINE       0     0     0
            da10        FAULTED    583   680     0  too many errors
            da11p3.eli  ONLINE       0     0     0
            da9p3.eli   FAULTED    425   463     0  too many errors
            da2         ONLINE       0     0     0
            da1p3.eli   ONLINE       0     0     0
            da0p3.eli   ONLINE       0     0     0
            da7p3.eli   ONLINE       0     0     0
            da6p3.eli   ONLINE       0     0     0
        logs
          mirror-1      ONLINE       0     0     0
            ada5        ONLINE       0     0     0
            ada6        ONLINE       0     0     0

errors: No known data errors

zpool scrub -s abyss
zpool clear abyss

triggers a resilver

Code:
root@abyss:~ # zpool status
  pool: abyss
 state: ONLINE
  scan: resilvered 9.44G in 00:12:50 with 0 errors on Mon Jun 13 14:12:35 2022
config:

        NAME            STATE     READ WRITE CKSUM
        abyss           ONLINE       0     0     0
          raidz3-0      ONLINE       0     0     0
            da3p3.eli   ONLINE       0     0     0
            da8p3.eli   ONLINE       0     0     0
            da5p3.eli   ONLINE       0     0     0
            da4p3.eli   ONLINE       0     0     0
            da10        ONLINE       0     0     0
            da11p3.eli  ONLINE       0     0     0
            da9p3.eli   ONLINE       0     0     0
            da2         ONLINE       0     0     0
            da1p3.eli   ONLINE       0     0     0
            da0p3.eli   ONLINE       0     0     0
            da7p3.eli   ONLINE       0     0     0
            da6p3.eli   ONLINE       0     0     0
        logs
          mirror-1      ONLINE       0     0     0
            ada5        ONLINE       0     0     0
            ada6        ONLINE       0     0     0

errors: No known data errors

resilver completes no problem, no errors, pool is still online

/shrug?!
 
that makes more sense..

in this case will a geli setkey da2 password do it? or does the drive need to be taken out of the pool and rebuilt?
or whats the best way to correct this..

gpart show
Code:
=>         40  15628053088  da11  GPT  (7.3T)
           40         1024     1  freebsd-boot  (512K)
         1064          984        - free -  (492K)
         2048      8388608     2  freebsd-swap  (4.0G)
      8390656  15619661824     3  freebsd-zfs  (7.3T)
  15628052480          648        - free -  (324K)

as you mentioned there is no gpart information for da2/10 however the other 10 disks are identical.

is there a way to copy the layout from one of the other drives safely?
does this drove have to be offline first?
once done this should be fixes with just a geli setkey?

thanks
 
or does the drive need to be taken out of the pool and rebuilt?
It has to be taken out and rebuilt with the *.eli device. All your other drives appear to have partitions on them too (they're all on *p3). You should probably fix that too. I suspect the other two partitions are freebsd-boot and efi.
 
just for reference..
to correct da2

zpool offline abyss da2
gpart backup da1 | gpart restore da2
geli init -l 256 -s 4096 da2
enter password from pool
geli attach da2
enter password
geli status
new device should be in list
zpool replace abyss da2 /dev/da2.eli
this will trigger a resilver and attach the device
 
Do you have spare HBA controller for a test?
yes, however other than a loose connector changing them did not make any differance.

gpart backup da1 | gpart restore da2

this apparently did not work as expected..

gpart show
does not show any sign of da2 having a gpart table, or other layout ..
Code:
=>         40  15628053088  da1  GPT  (7.3T)
           40         1024    1  freebsd-boot  (512K)
         1064          984       - free -  (492K)
         2048      8388608    2  freebsd-swap  (4.0G)
      8390656  15619661824    3  freebsd-zfs  (7.3T)
  15628052480          648       - free -  (324K)

=>         40  15628053088  da3  GPT  (7.3T)
           40         1024    1  freebsd-boot  (512K)
         1064          984       - free -  (492K)
         2048      8388608    2  freebsd-swap  (4.0G)
      8390656  15619661824    3  freebsd-zfs  (7.3T)
  15628052480          648       - free -  (324K)


yet the resilver process shows a da2.eli

Code:
            replacing-7  DEGRADED     0     0     0
              da2        OFFLINE      0     0     0
              da2.eli    ONLINE       0     0     0  (resilvering)

so it seems to indicate its using the entire disk and not da2p3.eli

gpart status

cut ..
Code:
 da1p1      OK  da1
 da1p2      OK  da1
 da1p3      OK  da1
 da3p1      OK  da3
 da3p2      OK  da3
 da3p3      OK  da3
cut ..

there does not appear to be any da2 device ..

yet

sysctl kern.disks
Code:
da11 da10 da9 da8 da7 da6 da5 da4 da3 da2 da1 da0

returns all 12 disks.
 
If my disks were playing up, the first two things I would do would be:
  1. Look in /var/log/messages for error messages. These may be overwhelming, and may be indecipherable. But they will identify which spindles have problems, and bringing the messages to this forum may help diagnose the situation.
  2. Run a SMART report on each spindle with smartctl -a /dev/daX. Depending on the vendor, these may not be easy to interpret, but you should be able to spot outliers, and publish them for comment.
I realise that you have replaced spindles, and this advice could have been more timely, but you still have to replace more spindles to get the encryption back in order. So recording and examining a smart report on each spindle is still a very sound approach.

Once you have done that, and record an error message summary plus smart report against each disk serial number, it will be possible to see whether problems follow spindles or controller ports (provided you number your cables and ports reliably).
 
after powering down the machine I reseated both hba cards because all the errors appeared to be on 4 drives connected to the same sas card.. I changed the data cables on those 4 drives and reseated the power connectors.
Is this commonality still there? Are/were these 4 drives on the same SAS card port, or on the different ports of the same card?

You need to differentiate between several possible failure points:
1. Bad multiple drives
2. Bad power line common to several drives
3. Bad fanout cable (1x SAS -> 4x SATA), optionally with bad replacement.
4. Bad single controller port
5. Bad controller (or overheating controller)
6. Bad mainboard slot with the controller
7. Some kind of memory or CPU issue, possibly involving overheating, causing checksums to be calculated incorrectly when under heavy load (which is rare but possible).

With 4x disks malfunctioning on the same fanout cable, I would say one of 1 through 4 would be the likely cause, but I am not sure from your description if the failure is still confined to a single SAS port on a single controller.
 
Back
Top