ZFS Changing a bad disk in a geom-eli encrypted zfs pool? Howto?

Hi all,

I have a machine (FreeBSD 13.5-STABLE) running for several years now with 4x Netapp 900GB Disks that are geom eli encrypted.
... and now one of the disks is doing strange things. The disk is ticking for a while, the machine more or less halts:
Code:
Jan  2 22:41:31 trollo kernel: (da1:mpt1:0:1:0): CAM status: SCSI Status Error
Jan  2 22:41:31 trollo kernel: (da1:mpt1:0:1:0): SCSI status: Check Condition
Jan  2 22:41:31 trollo kernel: (da1:mpt1:0:1:0): SCSI sense: ABORTED COMMAND asc:2f,10 (Reserved ASC/ASCQ pair)
Jan  2 22:41:31 trollo kernel: (da1:mpt1:0:1:0): Retrying command (per sense data)
Jan  2 22:43:31 trollo kernel: (da1:mpt1:0:1:0): WRITE(10). CDB: 2a 00 22 ee 8d c8 00 00 60 00
Jan  2 22:43:31 trollo kernel: (da1:mpt1:0:1:0): CAM status: SCSI Status Error
Jan  2 22:43:31 trollo kernel: (da1:mpt1:0:1:0): SCSI status: Check Condition
Jan  2 22:43:31 trollo kernel: (da1:mpt1:0:1:0): SCSI sense: ABORTED COMMAND asc:2f,10 (Reserved ASC/ASCQ pair)
Jan  2 22:43:31 trollo kernel: (da1:mpt1:0:1:0): Retrying command (per sense data)

..after a while it is working again.
I've had this a year before..bit after that it worked flawlessly for a year...

smartctl means: SMART Health Status: OK
..but there are many delayed errors:

Code:
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0    43300        61         0          0    2311534.093          61
write:         0      182         0         0          0     302668.868           0
verify:        0        0         0         0          0      65984.039           0

Non-medium error count:    20455

this is smells strange..or better it stinks..

I do have 5 of NETAPP 900GB disks laying around here as reserve.
At the moment I'm reformating one of those to a 512 byte sector size (they are 520byte normally).

Code:
<ST1000DM010-2EP102 CC43>          at scbus2 target 0 lun 0 (ada0,pass0)
<WDC WD10EZEX-00RKKA0 80.00A80>    at scbus3 target 0 lun 0 (ada1,pass1)
<NETAPP X423_TAL13900A10 NA01>     at scbus5 target 0 lun 0 (da0,pass2)
<NETAPP X423_TAL13900A10 NA01>     at scbus5 target 1 lun 0 (da1,pass3)
<NETAPP X423_TAL13900A10 NA01>     at scbus5 target 2 lun 0 (da2,pass4)
<NETAPP X423_TAL13900A10 NA01>     at scbus5 target 3 lun 0 (da3,pass5)
<NETAPP X423_HCOBE900A10 NA00>     at scbus5 target 7 lun 0 (pass6,da4)

da4 is currently formating, it's a HGST in difference to the others..(forgot the make).


The zfs pool is looking like this currently:
Code:
  pool: zrpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 1.79M in 00:00:02 with 0 errors on Sat Jan  3 18:17:56 2026
config:

        NAME           STATE     READ WRITE CKSUM
        zrpool         ONLINE       0     0     0
          raidz1-0     ONLINE       0     0     0
            da1p3.eli  ONLINE     691 31.8K     1
            da0p3.eli  ONLINE       0     0     0
            da2p3.eli  ONLINE       0     0     0
            da3p3.eli  ONLINE       0     0     0

errors: No known data errors


I want to shut down the machine tomorrow and change the da1 against the freshly formated drive,

..but what todo next ..exactly?
Yes, I have backups of my user data on tapes..but I don't really want todo a fresh install.

Can please someone give me some hints for doing this w/o the glitches that I usually get
when I'm trying something like this on my own?
This is all more or less cryptic for me...

Many thanks in advance and a happy new year to all,
Holm
 
forgot the GTP Table:
Code:
Geom name: da1
modified: false
state: OK
fwheads: 255
fwsectors: 63
last: 1758174727
first: 40
entries: 128
scheme: GPT
Providers:
1. Name: da1p1
   Mediasize: 524288 (512K)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 20480
   Mode: r0w0e0
   efimedia: HD(1,GPT,fb770aa4-5100-11ec-93a4-00d861a1c3ea,0x28,0x400)
   rawuuid: fb770aa4-5100-11ec-93a4-00d861a1c3ea
   rawtype: 83bd6b9d-7f41-11dc-be0b-001560b84f0f
   label: gptboot0
   length: 524288
   offset: 20480
   type: freebsd-boot
   index: 1
   end: 1063
   start: 40
2. Name: da1p2
   Mediasize: 2147483648 (2.0G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 1048576
   Mode: r1w1e1
   efimedia: HD(2,GPT,fbe3eb3a-5100-11ec-93a4-00d861a1c3ea,0x800,0x400000)
   rawuuid: fbe3eb3a-5100-11ec-93a4-00d861a1c3ea
   rawtype: 516e7cb5-6ecf-11d6-8ff8-00022d09712b
   label: swap0
   length: 2147483648
   offset: 1048576
   type: freebsd-swap
   index: 2
   end: 4196351
   start: 2048
3. Name: da1p3
   Mediasize: 898036137984 (836G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 2148532224
   Mode: r1w1e1
   efimedia: HD(3,GPT,fc4a3d12-5100-11ec-93a4-00d861a1c3ea,0x400800,0x688b9000)
   rawuuid: fc4a3d12-5100-11ec-93a4-00d861a1c3ea
   rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
   label: zfs0
   length: 898036137984
   offset: 2148532224
   type: freebsd-zfs
   index: 3
   end: 1758173183
   start: 4196352
Consumers:
1. Name: da1
   Mediasize: 900185481216 (838G)
   Sectorsize: 512
   Mode: r2w2e4

Regards,
Holm
 
Based on the zpool status command, it looks like you created the zpool on top of the eli devices. Is that correct?
Do you recall how you initially created those? Basically did you create each eli device individually, then the zpool on top of all or did you do a geom of all devices, encrypted then pulled into the zpool?
zpool history may give additional information.

A complete guess on my part, if you have an extra device and the system can physcially handle it, I think if you duplicated the partition table, then did geom eli on a partition, you may be able to "zpool replace" the failing device with the new device.

Keep in mind, this is all speculation on my part, I'm making a lot of assumptions, but maybe my questions trigger "the right data" being exposed so others can help.
 
yeah, mer has it right. you can use gpart backup to save the partition table into a file, and then gpart restore it onto the new device. as far as geli goes, idk anything about that, but once you set up geli on the new device again, it'll give you (say) da4p3.eli and you'll want to zpool replace pool da1p3.eli da4p3.eli. if replacing it gives you the same disk name you can just say zpool replace pool da1p3.eli
 
mer:
So far as I remember I've followed a description on a website how to do this. And yes, I've partitioned every single drive in the same way. 512k boot, 2G Swap and 838G data. Then I've configured eli in some way and on top of that the ZFS pool.
This is the cause that the data partitions are da?p3 or da?p3.eli.

I'll try to find the website again (I need some sort of google for my bookmarks) to find out what exactly I've done.

atax1a:
There is no room for another drive in the case, currently it is laying around on the floor and 81.99% done formatting.

I'll try this tomorrow, it's 10:50PM now..far to late for complicated things..

the zpool history:
Code:
History for 'zrpool':
2021-11-29.11:43:01 zpool create -o altroot=/mnt -O compress=lz4 -O atime=off -m none -f zrpool raidz1 da0p3.eli da1p3.eli da2p3.eli da3p3.eli
2021-11-29.11:43:01 zfs create -o mountpoint=none zrpool/ROOT
2021-11-29.11:43:01 zfs create -o mountpoint=/ zrpool/ROOT/default
2021-11-29.11:43:02 zfs create -o mountpoint=/tmp -o exec=on -o setuid=off zrpool/tmp
2021-11-29.11:43:03 zfs create -o mountpoint=/usr -o canmount=off zrpool/usr
2021-11-29.11:43:03 zfs create zrpool/usr/home
2021-11-29.11:43:03 zfs create -o setuid=off zrpool/usr/ports
2021-11-29.11:43:03 zfs create zrpool/usr/src
2021-11-29.11:43:04 zfs create -o mountpoint=/var -o canmount=off zrpool/var
2021-11-29.11:43:04 zfs create -o exec=off -o setuid=off zrpool/var/audit
2021-11-29.11:43:05 zfs create -o exec=off -o setuid=off zrpool/var/crash
2021-11-29.11:43:05 zfs create -o exec=off -o setuid=off zrpool/var/log
2021-11-29.11:43:05 zfs create -o atime=on zrpool/var/mail
2021-11-29.11:43:06 zfs create -o setuid=off zrpool/var/tmp
2021-11-29.11:43:06 zfs set mountpoint=/zrpool zrpool
2021-11-29.11:43:06 zpool set bootfs=zrpool/ROOT/default zrpool
2021-11-29.11:43:07 zpool set cachefile=/mnt/boot/zfs/zpool.cache zrpool
2021-11-29.11:43:12 zfs set canmount=noauto zrpool/ROOT/default
2021-11-29.11:37:14 zfs set compression=on zrpool/usr/ports
2021-11-29.11:41:23 zfs set compression=lz4 zrpool/usr/ports
2021-12-02.22:11:23 zfs snapshot -r zrpool@now
2021-12-02.23:02:58 zfs destroy zrpool@now
2021-12-02.23:03:23 zfs destroy zrpool/ROOT@now
2021-12-02.23:04:36 zfs destroy zrpool/ROOT/default@now
2021-12-02.23:04:44 zfs destroy zrpool/tmp@now
2021-12-02.23:04:51 zfs destroy zrpool/usr@now
2021-12-02.23:05:10 zfs destroy zrpool/usr/home@now
2021-12-02.23:05:16 zfs destroy zrpool/usr/ports@now
2021-12-02.23:05:22 zfs destroy zrpool/usr/src@now
2021-12-02.23:05:29 zfs destroy zrpool/var@now
2021-12-02.23:05:39 zfs destroy zrpool/var/audit@now
2021-12-02.23:05:44 zfs destroy zrpool/var/crash@now
2021-12-02.23:05:51 zfs destroy zrpool/var/log@now
2021-12-02.23:05:58 zfs destroy zrpool/var/mail@now
2021-12-02.23:06:09 zfs destroy zrpool/var/tmp@now
2021-12-02.23:08:26 zfs snapshot -r zrpool/ROOT/default@first
2021-12-20.14:31:22 zpool scrub zrpool
2022-04-01.19:37:25 zpool scrub zrpool
2025-02-07.20:20:58 zpool clear zrpool
2025-02-14.15:32:34 zpool resilver zrpool
2025-03-06.09:12:03 zpool scrub zrpool
2025-03-06.11:23:18 zpool scrub zrpool
2025-04-24.22:42:45 zpool clear zrpool

Regards,
Holm
 
Creating the partitions is easy, mor complicated is geom eli.
It seems that I've initialized all of the eli partitions at once, the manuals says:
(geom)
"init Initialize providers which need to be encrypted. If
multiple providers are listed as arguments, they will all
be initialized with the same passphrase and/or User Key.
A unique salt will be randomly generated for each provider
to ensure the Master Key for each is unique. Here you can
set up the cryptographic algorithm to use, Data Key
length, etc. The last sector of the providers is used to
store metadata. The init subcommand also automatically
writes metadata backups to /var/backups/<prov>.eli file.
The metadata can be recovered with the restore subcommand
described below."

..so all providers used the same key w/o a key file..prompting at boot for the passphrase.
That fell me on the feet later as I wanted to add two more disks using the same key. I had to
add kern.geom.eli.passphrase="passphrase" to /boot/loader.conf to get the additional 2 disks
(that uses the same passphrase as all others) mounted at all, so the passphrase is readable
text in the file system after boot - not that nice.

Since I'm doing now pretty much the same, I think the new disk may don't get decrypted at boot?

What exactly happens if I'm simply dd-ing the old to the new disk and exchange them later?

Regards,
Holm
 
I'll try to find the website again (I need some sort of google for my bookmarks) to find out what exactly I've done.
You don't need that website, all information to create the geli provider on the spare disk are on system.

Do not blindly disconnect the defective device from the pool, follow these steps to replace the disk:
  1. zpool offline zrpool da1p3.eli
  2. power down system, replace disk (make sure you are grounded to avoid electrostatic discharge), power up system
  3. partition the new disk
  4. execute geli list
  5. geli(8) initialize the new provider according the geli list provided information of the other disks. The highlighted information in the following example will help to set the configuration options. Those options may differ on your system:
Rich (BB code):
Geom name: nda0p3.eli
State: ACTIVE
EncryptionAlgorithm: AES-XTS
KeyLength: 256  ...............................[1]
Crypto: accelerated software
Version: 7
UsedKey: 0
Flags: BOOT, GELIBOOT, AUTORESIZE .............[2]
KeysAllocated: 25
KeysTotal: 25
Providers:
1. Name: nda0p3.eli
   Mediasize: 107374178304 (100G)
   Sectorsize: 4096 ...........................[3]
   Mode: r1w1e1
Consumers:
1. Name: nda0p3
   Mediasize: 107374182400 (100G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 2358247424
   Mode: r1w1e1
Use the same passphrase that was used to initialize the other disks. Assuming da1p3 is the new provider:
Rich (BB code):
geli init -g -b [2]   -l 256  [1]  -s 4096  [3]   da1p3

Attach provider:
Code:
# geli attach da1p3

Replace provider in pool:
Code:
# zpool replace zrpool da1p3.eli
The pool will start to resilver the new device.

Check swapinfo(8) if all swap devices are attached, in particular the one on the new disk.

Eventually copy bootcode on "freebsd-boot" partition of the new disk:
Code:
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da1

I wanted to add two more disks using the same key. I had to
add kern.geom.eli.passphrase="passphrase" to /boot/loader.conf to get the additional 2 disks
(that uses the same passphrase as all others) mounted at all, so the passphrase is readable
text in the file system after boot
Check the "Flags" of those providers, they need the "GELIBOOT" (-g) option set. If they are missing, enable booting from the encrypted root filesystem on those disks: geli configure -g daXp3

Call geli list again, check all the root filesystem providers for their "GELIBOOT" flags, then remove the clear text passphrase from /boot/loader.conf.
 
  • Like
Reactions: mer
Yes many thanks from mee too..I've followed this nice description up to before 'zpool replace zrpool da1p3.eli' b'cause of the problem that the machine only boots if I connect the old disk too (currently da4) , otherwise the loader couldn't find /boot/zfsloader (ZFS io error and such things). I have to investigate what's going on here..
Possibly the cause is, that da1p1 has no label:

1. Name: da1p1
Mediasize: 524288 (512K)
Sectorsize: 512
Stripesize: 0
Stripeoffset: 20480
Mode: r0w0e0
efimedia: HD(1,GPT,1882b0cb-e8f2-11f0-9ae7-00d861a1c3ea,0x28,0x400)
rawuuid: 1882b0cb-e8f2-11f0-9ae7-00d861a1c3ea
rawtype: 83bd6b9d-7f41-11dc-be0b-001560b84f0f
label: (null)
length: 524288
offset: 20480
type: freebsd-boot
index: 1
end: 1063
start: 40

..but the old disk has:

1. Name: da4p1
Mediasize: 524288 (512K)
Sectorsize: 512
Stripesize: 0
Stripeoffset: 20480
Mode: r0w0e0
efimedia: HD(1,GPT,fd3b074d-5100-11ec-93a4-00d861a1c3ea,0x28,0x400)
rawuuid: fd3b074d-5100-11ec-93a4-00d861a1c3ea
rawtype: 83bd6b9d-7f41-11dc-be0b-001560b84f0f
label: gptboot1
length: 524288
offset: 20480
type: freebsd-boot
index: 1
end: 1063
start: 40

..but at the moment I'm fighting with mate, marco and some settings ..(at least want my german keyboard layout back)
marco is already running again (and not more dumping cores because of some missing mouse settings..)

I'll be back later..

Regards,
Holm
 
Ok.. so far so good.
Code:
 pool: zrpool
 state: ONLINE
  scan: resilvered 216G in 02:04:24 with 0 errors on Sun Jan  4 19:59:53 2026
config:

        NAME           STATE     READ WRITE CKSUM
        zrpool         ONLINE       0     0     0
          raidz1-0     ONLINE       0     0     0
            da1p3.eli  ONLINE       0     0     0
            da4p3.eli  ONLINE       0     0     0
            da2p3.eli  ONLINE       0     0     0
            da3p3.eli  ONLINE       0     0     0

errors: No known data errors

The new disk is online..
Now I'll try to reboot w/o the old disk.

I don't know what happened, maybe it was an port update that had some glitches..but I'm fighting against several problems now. First the marco window manager wasn't build correctly or something ... mate startet with the twm as fallback.
I've repaired this, but the Clearlooks theme can't be loaded, I'll fix that later. My Keyboard is working with the german layout now again, previously I could select the layout in the Top panel DE/US, now I have to press AltCtrl and Capslock..works for me.
The Sound isn't working anymore..and I don't know why. pcm4 should be the analog rear (Green) output to the PC-speakers...
I only had an hw.snd.default_unit: 4 in sysctl.conf....this worked for 4 years..until now.


hdaa1: Dumping AFG pins:
hdaa1: nid 0x as seq device conn jack loc color misc
hdaa1: 17 4037d540 4 0 CD None Analog 0x00 Res.D 5 DISA
hdaa1: Caps: OUT
hdaa1: 18 411111f0 15 0 Speaker None 1/8 Rear Black 1 DISA
hdaa1: Caps: IN
hdaa1: 20 01014010 1 0 Line-out Jack 1/8 Rear Green 0
hdaa1: Caps: IN OUT HP EAPD Sense: 0x80000000 (connected)
hdaa1: 21 01011012 1 2 Line-out Jack 1/8 Rear Black 0
hdaa1: Caps: IN OUT Sense: 0x00000000 (disconnected)
hdaa1: 22 01016011 1 1 Line-out Jack 1/8 Rear Orange 0
hdaa1: Caps: IN OUT Sense: 0x00000000 (disconnected)
hdaa1: 23 01012014 1 4 Line-out Jack 1/8 Rear Grey 0
hdaa1: Caps: IN OUT Sense: 0x00000000 (disconnected)
hdaa1: 24 01a19030 3 0 Mic Jack 1/8 Rear Pink 0
hdaa1: Caps: IN OUT VREF Sense: 0x80000000 (connected)
hdaa1: 25 02a19040 4 0 Mic Jack 1/8 Front Pink 0
hdaa1: Caps: IN OUT HP VREF Sense: 0x00000000 (disconnected)


..now I have some noise on the speakers..sounds as if the Lines are open circuit insinde.
Now I've put this:
hint.hdaa.1.nid20.config="as=1 seq=0 device=speakers"
hint.hdaa.1.nid18.config="as=1 seq=15 device=Headphones"
into the device.hints.. lets see if it helps..

I'm rebooting now.

Regards,
Holm
 
Ok..here I'm again..

The good news is, the system booted w/o the old disk. The bad news is, that the disks got mixed up in some way
after the replace:
Code:
  pool: zrpool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-2Q
  scan: resilvered 216G in 02:04:24 with 0 errors on Sun Jan  4 19:59:53 2026
config:

        NAME           STATE     READ WRITE CKSUM
        zrpool         DEGRADED     0     0     0
          raidz1-0     DEGRADED     0     0     0
            da1p3.eli  ONLINE       0     0     0
            da4p3.eli  UNAVAIL      0     0     0  cannot open
            da2p3.eli  ONLINE       0     0     0
            da3p3.eli  ONLINE       0     0     0

errors: No known data errors

..jepp, da0p3 is missing.

Should I try todo a zpool replace zrpool da4p3.eli da0p3.eli?

The passphrase thing for the ada0p1 and ada1p1 hasn't worked..probably because of a typo.. moment, I'll try something..

Regards,
Holm
 
..good, that worked.
Code:
 pool: zrdata
 state: ONLINE
  scan: scrub repaired 0B in 01:38:47 with 0 errors on Thu Apr 10 21:17:05 2025
config:

    NAME          STATE     READ WRITE CKSUM
    zrdata        ONLINE       0     0     0
      ada0p1.eli  ONLINE       0     0     0
      ada1p1.eli  ONLINE       0     0     0

errors: No known data errors
The zrdata pool is online and the geli configure -g adaxp1 has worked.

...how was it possible that da0p3.eli was excluded from the zrpool? The zrpool replace displayed that it is replacing da1p3.eli (old) against da1p3.eli, so this looked fine.
The old da1 was as da4 connected to this time..and is disconnected now, what happened to da0 and why?

Sound still isn't working. I'll start a new thread about this in the proper category...

Regards,
Holm
 
Back
Top