Solved zpool HEALTH fault

byrnejb · May 17, 2023

I am preparing to replace a failing disk in a RAIDZ2 array. I received a heads up from our weekly HDD status report:

Code:

vhost03.hamilton.harte-lyne.ca - ZFS pool - HEALTH fault

NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH
 ALTROOT
bootpool  1.98G   234M  1.76G        -         -    17%    11%  1.00x    ONLINE  -
zroot     10.6T  2.81T  7.81T        -         -    54%    26%  1.00x    ONLINE  -

  pool: bootpool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:24 with 0 errors on Thu May  4 13:47:20 2023
config:

        NAME        STATE     READ WRITE CKSUM
        bootpool    ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0
            ada2p2  ONLINE       0     0     0
            ada3p2  ONLINE       0     0     0

errors: No known data errors

  pool: zroot
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 4.20M in 15:07:09 with 0 errors on Fri May  5 04:54:05 2023
config:

        NAME            STATE     READ WRITE CKSUM
        zroot           ONLINE       0     0     0
          raidz2-0      ONLINE       0     0     0
            ada1p4.eli  ONLINE       0     0     0
            ada0p4.eli  ONLINE       0     0     0
            ada2p4.eli  ONLINE       0     0     0
            ada3p4.eli  ONLINE   1.06K     0     0

errors: No known data errors

I checked the SMART reports and discovered this:

Code:

[root@vhost03 ~ (master)]# smartctl -a /dev/ada3
smartctl 7.3 2022-02-28 r5338 [FreeBSD 13.1-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68AX9N0
Serial Number:    WD-WCC1T0371529
LU WWN Device Id: 5 0014ee 208200201
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue May 16 19:15:46 2023 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121)    The previous self-test completed having
                    the read element of the test failed.
Total time to complete Offline
data collection:         (39600) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 398) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x70bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1318
  3 Spin_Up_Time            0x0027   179   179   021    Pre-fail  Always       -       6008
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       61
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   049   049   000    Old_age   Always       -       37507
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       59
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       50
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       10
194 Temperature_Celsius     0x0022   124   110   000    Old_age   Always       -       26
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       35
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     37466         843538015
# 2  Extended offline    Completed: read failure       90%     37299         843538848
# 3  Extended offline    Completed: read failure       90%     37275         843538014
# 4  Extended offline    Completed: read failure       90%     37131         843538015
# 5  Extended offline    Completed: read failure       90%     36963         843538014
# 6  Extended offline    Completed: read failure       90%     36795         843538015
# 7  Extended offline    Completed: read failure       90%     36532         843538014
# 8  Extended offline    Completed: read failure       90%     36364         843538015
# 9  Extended offline    Completed: read failure       90%     36196         843538015
#10  Extended offline    Completed: read failure       90%     36028         843538848
#11  Extended offline    Completed: read failure       90%     35860         843538014
#12  Extended offline    Completed: read failure       90%     35692         843538015
#13  Extended offline    Completed: read failure       90%     35620         843538848
#14  Extended offline    Completed: read failure       90%     35524         843538848
#15  Extended offline    Completed: read failure       90%     35356         843538848
#16  Extended offline    Completed: read failure       90%     35189         843538015
#17  Extended offline    Completed: read failure       90%     35021         843538015
#18  Extended offline    Completed: read failure       90%     34901         843538848
#19  Extended offline    Completed: read failure       90%     34853         843538848
#20  Extended offline    Completed: read failure       90%     34685         843538015
#21  Extended offline    Completed: read failure       90%     34517         843538848

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

All the other drives show no errors. Based on this I inferred that the problem was with /dev/ada3. I therefore planned to replace this drive in the zpool this past weekend. However, I came down with a very severe viral infection and have just now returned to work. And I discover this in my mailbox:

Code:

vhost03.hamilton.harte-lyne.ca - ZFS pool - HEALTH fault

NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH
 ALTROOT
bootpool  1.98G   234M  1.76G        -         -    17%    11%  1.00x  DEGRADED  -
zroot     10.6T  2.82T  7.81T        -         -    54%    26%  1.00x  DEGRADED  -

  pool: bootpool
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0B in 00:00:24 with 0 errors on Thu May  4 13:47:20 2023
config:

        NAME        STATE     READ WRITE CKSUM
        bootpool    DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            ada1p2  ONLINE       0     0     0
            ada0p2  REMOVED      0     0     0
            ada2p2  ONLINE       0     0     0
            ada3p2  ONLINE       0     0     0

errors: No known data errors

  pool: zroot
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 4.61M in 14:49:46 with 0 errors on Wed May 10 02:17:01 2023
config:

        NAME            STATE     READ WRITE CKSUM
        zroot           DEGRADED     0     0     0
          raidz2-0      DEGRADED     0     0     0
            ada1p4.eli  ONLINE       0     0     0
            ada0p4.eli  REMOVED      0     0     0
            ada2p4.eli  ONLINE       0     0     0
            ada3p4.eli  ONLINE   1.16K     0     0

errors: No known data errors

The pool is degraded but the device removed is not /dev/ada3 but /dev/ada0. However, smartctl -a /dev/ada0 says this:

Code:

[root@vhost03 ~ (master)]# smartctl -a /dev/ada0
smartctl 7.3 2022-02-28 r5338 [FreeBSD 13.1-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68AX9N0
Serial Number:    WD-WMC1T3451852
LU WWN Device Id: 5 0014ee 6adfa9e37
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue May 16 19:11:57 2023 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (40320) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 404) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x70bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   190   184   021    Pre-fail  Always       -       5491
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       61
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   049   049   000    Old_age   Always       -       37507
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       58
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       50
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       10
194 Temperature_Celsius     0x0022   122   111   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     37474         -
# 2  Extended offline    Completed without error       00%     37306         -
# 3  Extended offline    Completed without error       00%     37282         -
# 4  Extended offline    Completed without error       00%     37138         -
# 5  Extended offline    Completed without error       00%     36970         -
# 6  Extended offline    Completed without error       00%     36802         -
# 7  Extended offline    Completed without error       00%     36539         -
# 8  Extended offline    Completed without error       00%     36372         -
# 9  Extended offline    Completed without error       00%     36203         -
#10  Extended offline    Completed without error       00%     36035         -
#11  Extended offline    Completed without error       00%     35867         -
#12  Extended offline    Completed without error       00%     35699         -
#13  Extended offline    Completed without error       00%     35627         -
#14  Extended offline    Completed without error       00%     35532         -
#15  Extended offline    Completed without error       00%     35364         -
#16  Extended offline    Completed without error       00%     35196         -
#17  Extended offline    Completed without error       00%     35028         -
#18  Extended offline    Completed without error       00%     34908         -
#19  Extended offline    Completed without error       00%     34860         -
#20  Extended offline    Completed without error       00%     34692         -
#21  Extended offline    Completed without error       00%     34525         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I am very confused about this. The drive that I believed to be the issue due to the repeated read errors is still online. A drive that has no reported errors has been removed from the pool while the system was running. I have checked the unit in its enclosure and the drive caddy is in place. I do not know how it was removed but it is in place at the moment. I am rerunning a scrub to see if this status persists.

The original scrub messages do not indicate which drive is producing the errors. If I had been able I would have replaced ada3, which might not have prevented the pool degradation depending upon what actually is causing it.

How am I supposed to discover which drive is giving the error that the scrub reports before it fails?
Why does the scrub report not provide that data?
What is the scrub discovering that the smart tests are not?

Eric A. Borisch · May 17, 2023

I’m not sure what to make of ada0; it looks almost like it was accidentally pulled and then reseated.

As for ada3, both scrub and smart are telling you it has issues:

Scrub with the non-zero value here:

ada3p4.eli ONLINE 1.06K 0 0

And SMART with both:

197 Current_Pending_Sector 0x0032 200

And:

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 37466 843538015

And perhaps others; SMART error reporting is famously vendor dependent / bordering on useless (depending on the vendor). (I find the temperature to be the most useful, myself, although any number of pending or “offline uncorrectable” sectors is also typically a strong indicator of an issue.)

Also note that SMART can only (well, primarily) report about on-device issues. A device being physically removed or powered down isn’t necessarily an error, as far as the device is concerned. The host OS may have a very different opinion about that.

You’ll likely also find messages about both in /var/log/messages. grep ada /var/log/messages.

Eric A. Borisch · May 17, 2023

As for what to do, you can try bringing the ada0 vdevs back online with zpool-online(8). Make sure that works and let the resilver finish (has to write out the transactions that drive has missed) before moving on.

You can then offline the ada3-backed vdevs, and watch for which device no longer has blinky lights, or if you have the hardware that supports it, sesutil locate … is pretty slick. (But I’ll be honest, I have hardware that supports it, and I’ll still usually just make sure there is some forced IO and then see which drive is idle; works well as long as you don’t have spares in the mix that are idle but good.) If you really can’t tell, and don’t have faith in any physical mapping, note the SN from smartctl, shutdown, and look at the drives to pull the bad one.

Replace the physical drive, and then you’ll want zpool-replace(8).

gpw928 · May 17, 2023

Looking at the situation makes me very nervous. You have two spindles with problems. Hindsight says that the Z2 decision was wise...

Since ada0 is offline, you must deal with it first. Check /var/log/messages to see if ada0 went offline suddenly. That should be fairly obvious, if it happened. Unless you know why it went offline, and are sure that it was because of an accident, I would replace it and re-silver the vdevs (ada0p2 and ada0p4) one at a time. Then test the disk away from production before deciding whether to keep it or dump it.

Once ada0 is fixed, and re-silvering complete, you need to deal with ada3 in much the same way. As Eric A. Borisch observed above, "197 Current_Pending_Sector" non-zero indicates a serious problem, so there is no question that ada3 (Serial Number: WD-WCC1T0371529) needs to be replaced. The TrueNAS Hard Drive Troubleshooting Guide is worth reading and bookmarking.

No matter how you figure out which drive to pull, you should always do it in a way that does not threaten your redundancy. So, that means always start by mapping the device number to the serial number (use smartctl, or camcontrol identify). What you do next depends on your hardware and planning. You may not be able to make a dead drive busy. One well trodden path is to shut down the system, pull the drives one at a time, and (double) check the serial numbers.

byrnejb · May 17, 2023

gpw928 said:
Looking at the situation makes me very nervous. You have two spindles with problems. Hindsight says that the Z2 decision was wise...

Since ada0 is offline, you must deal with it first. Check /var/log/messages to see if ada0 went offline suddenly. That should be fairly obvious, if it happened. Unless you know why it went offline, and are sure that it was because of an accident, I would replace it and re-silver the vdevs (ada0p2 and ada0p4) one at a time. Then test the disk away from production before deciding whether to keep it or dump it.

Code:

[root@vhost03 ~ (master)]#  grep ada0 /var/log/messages
May 11 10:23:58 vhost03 kernel: ada0 at ahcich0 bus 0 scbus1 target 0 lun 0
May 11 10:23:58 vhost03 kernel: ada0: <WDC WD30EFRX-68AX9N0 80.00A80> s/n WD-WMC1T3451852 detached
May 11 10:23:59 vhost03 kernel: GEOM_ELI: Device ada0p3.eli destroyed.
May 11 10:23:59 vhost03 kernel: GEOM_ELI: Detached ada0p3.eli on last close.
May 11 10:24:01 vhost03 kernel: GEOM_ELI: Device ada0p4.eli destroyed.
May 11 10:24:01 vhost03 kernel: GEOM_ELI: Detached ada0p4.eli on last close.
May 11 10:24:01 vhost03 kernel: (ada0:ahcich0:0:0:0): Periph destroyed
May 11 10:24:09 vhost03 kernel: ada0 at ahcich0 bus 0 scbus1 target 0 lun 0
May 11 10:24:09 vhost03 kernel: ada0: <WDC WD30EFRX-68AX9N0 80.00A80> ACS-2 ATA SATA 3.x device
May 11 10:24:09 vhost03 kernel: ada0: Serial Number WD-WMC1T3451852
May 11 10:24:09 vhost03 kernel: ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
May 11 10:24:09 vhost03 kernel: ada0: Command Queueing enabled
May 11 10:24:09 vhost03 kernel: ada0: 2861588MB (5860533168 512 byte sectors)
May 11 10:24:09 vhost03 kernel: ada0: quirks=0x1<4K>

This looks like I must have popped the caddy on ada0 on Thursday, which is when I became ill. It was immediately replaced. However, a scrub was probably going on at the time and likely I messed that up. In any case this is what happened when I tried to put ada0p4.eli online:

Code:

[root@vhost03 ~ (master)]# zpool online zroot ada0p4.eli
warning: device 'ada0p4.eli' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present
[root@vhost03 ~ (master)]# zpool status
  pool: bootpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 496K in 00:00:01 with 0 errors on Tue May 16 20:13:57 2023
config:

    NAME        STATE     READ WRITE CKSUM
    bootpool    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        ada1p2  ONLINE       0     0     0
        ada0p2  ONLINE       0     0   124
        ada2p2  ONLINE       0     0     0
        ada3p2  ONLINE       0     0     0

errors: No known data errors

  pool: zroot
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 4.05M in 15:47:21 with 0 errors on Wed May 17 12:01:18 2023
config:

    NAME            STATE     READ WRITE CKSUM
    zroot           DEGRADED     0     0     0
      raidz2-0      DEGRADED     0     0     0
        ada1p4.eli  ONLINE       0     0     0
        ada0p4.eli  REMOVED      0     0     0
        ada2p4.eli  ONLINE       0     0     0
        ada3p4.eli  ONLINE   2.18K     0     0

errors: No known data errors

byrnejb · May 17, 2023

It occurs to me that an issue with putting ada0p4.eli back online might be the fact that zroot is encrypted. Is that a consideration?

Eric A. Borisch · May 17, 2023

byrnejb said:
It occurs to me that an issue with putting ada0p4.eli back online might be the fact that zroot is encrypted. Is that a consideration?

Yes, you will need to service geli start to bring your .eli device back. (You can see in /etc/rc.d/geli that it checks to see if each configured .eli device is already present before trying to create it.)

byrnejb · May 17, 2023

I am feeling my way through this. I still cannot put ada0 online.

Code:

[root@vhost03 ~ (master)]#  service geli start
[root@vhost03 ~ (master)]# zpool online zroot ada0p4.eli
warning: device 'ada0p4.eli' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present

Code:

camcontrol devlist
<ATA WDC WD30EFRX-68A 0A80>        at scbus0 target 0 lun 0 (da1,pass6)
<ATA WDC WD1002FAEX-0 1D05>        at scbus0 target 1 lun 0 (da0,pass0)
<WDC WD30EFRX-68AX9N0 80.00A80>    at scbus1 target 0 lun 0 (ada0,pass1)
<WDC WD30EFRX-68AX9N0 80.00A80>    at scbus2 target 0 lun 0 (pass2,ada1)
<WDC WD30EFRX-68AX9N0 80.00A80>    at scbus3 target 0 lun 0 (pass3,ada2)
<WDC WD30EFRX-68AX9N0 80.00A80>    at scbus4 target 0 lun 0 (pass4,ada3)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus7 target 0 lun 0 (ses0,pass5)

byrnejb · May 18, 2023

Code:

[root@vhost03 ~ (master)]# zpool clear zroot ada0p4.eli
[root@vhost03 ~ (master)]# zpool online zroot ada0p4.eli
warning: device 'ada0p4.eli' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present

How do I clear the faulted state for ada0p4.eli? Is this a viable approach: zpool replace zroot /dev/ada0p4.eli /dev/ada0p4.eli? Where does the encryption fit in?

Eric A. Borisch · May 18, 2023

Make sure /dev/ada0p4.eli exists. (If it doesn't that needs to be addressed first.)
Assuming it does exist, are there any errors in /var/log/messages when you try to online it? (If there are, those need to be addressed / diagnosed first.)
If there are no errors reported, then the device is in a state where ZFS is not willing to have it just brought back online; zpool replace zroot /dev/ada0p4.eli — ref. zpool-replace(8) — should tell ZFS to treat it as a new device at the same location; this will force all writes to be re-played onto ada0p4.eli to bring it up to "silvered." As afa0p4.eli was previously part of a pool (but is apparently not in a healthy-enough state to just zpool online) you may need to add the -f flag to zpool replace; see the manpage for the description.

Eric A. Borisch · May 18, 2023

byrnejb said:
Is this a viable approach: zpool replace zroot /dev/ada0p4.eli /dev/ada0p4.eli? Where does the encryption fit in?

Yes, that's the same as zpool replace zroot /dev/ada0p4.eli, as "If new-device is not specified, it defaults to device."

The encryption layer (GELI) takes writes (in the clear) to /dev/ada0p4.eli, and writes them (encrypted) onto /dev/ada0p4.

byrnejb · May 18, 2023

I am having trouble mapping the various ways things seem to be referenced. I see this:

Code:

[root@vhost03 ~ (master)]# zpool status zroot
  pool: zroot
 state: DEGRADED
status: One or more devices has been removed by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using zpool online' or replace the device with
    'zpool replace'.
  scan: scrub repaired 4.05M in 15:47:21 with 0 errors on Wed May 17 12:01:18 2023
config:

    NAME            STATE     READ WRITE CKSUM
    zroot           DEGRADED     0     0     0
      raidz2-0      DEGRADED     0     0     0
        ada1p4.eli  ONLINE       0     0     0
        ada0p4.eli  REMOVED      0     0     0
        ada2p4.eli  ONLINE       0     0     0
        ada3p4.eli  ONLINE       0     0     0

And this:

Code:

[root@vhost03 ~ (master)]# ll /dev/ada0*
crw-r-----  1 root  operator  0x74 May 11 10:24 /dev/ada0
crw-r-----  1 root  operator  0x76 May 11 10:24 /dev/ada0p1
crw-r-----  1 root  operator  0x7a May 11 10:24 /dev/ada0p2
crw-r-----  1 root  operator  0x7d May 11 10:24 /dev/ada0p3
crw-r-----  1 root  operator  0x7f May 11 10:24 /dev/ada0p4

I replaced the removed drive but it took a couple of attempts to find the correct (I believe) command:

Code:

[root@vhost03 ~ (master)]# zpool replace zroot /dev/ada0p4.eli
cannot resolve path '/dev/ada0p4.eli'

[root@vhost03 ~ (master)]# zpool replace zroot /dev/ada0p4
cannot replace /dev/ada0p4 with /dev/ada0p4: no such device in pool

[root@vhost03 ~ (master)]# zpool replace zroot /dev/ada0p4.eli /dev/ada0p4

[root@vhost03 ~ (master)]# zpool status zroot
  pool: zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu May 18 15:18:56 2023
    527G scanned at 1.00G/s, 42.7G issued at 83.3M/s, 2.82T total
    2.07G resilvered, 1.48% done, 09:42:02 to go
config:

    NAME              STATE     READ WRITE CKSUM
    zroot             DEGRADED     0     0     0
      raidz2-0        DEGRADED     0     0     0
        ada1p4.eli    ONLINE       0     0     0
        replacing-1   DEGRADED     0     0     0
          ada0p4.eli  REMOVED      0     0     0
          ada0p4      ONLINE       0     0     0  (resilvering)
        ada2p4.eli    ONLINE       0     0     0
        ada3p4.eli    ONLINE       0     0     0

errors: No known data errors

I will see what I end up with in about eight or nine hours.

Eric A. Borisch · May 18, 2023

byrnejb said:
[root@vhost03 ~ (master)]# zpool replace zroot /dev/ada0p4.eli
cannot resolve path '/dev/ada0p4.eli'

[root@vhost03 ~ (master)]# zpool replace zroot /dev/ada0p4
cannot replace /dev/ada0p4 with /dev/ada0p4: no such device in pool

[root@vhost03 ~ (master)]# zpool replace zroot /dev/ada0p4.eli /dev/ada0p4

If this is at all critical data, please don't just issue commands until something works.

By choosing to replace /dev/ada0p4.eli with /dev/ada0p4 directly you've removed the encryption on that vdev in your pool. You went through the trouble of setting up encryption at some point, so I assume you would like it to be encrypted.

I would wait until the resilver to complete before doing any further troubleshooting.

~~At a minimum, figure out why you don't have the geli(8) devices (/dev/*.eli) showing up at all anymore (but apparently still open for the rest of your pool?)~~ Edit: I missed that your ls command was for /dev/ada0*, not /dev/ada*; your following post shows the other *.eli devices are still present.

I'm not sure how your geli devices are configured (passphrase only at boot, or a stored key, or a key and a passphrase...) -- but I would review how you created them and make sure that any required secrets are still available, and then I think I would reboot for things to start from a "clean slate" -- hopefully the other three .eli devices are re-opened at boot. (If this is mission-critical data, I would consider backing up the data to another location before reboot while you are certain you have access to it.) In general it would be surprising for this (GELI key information) to get lost, but it would also be surprising to replace your .eli device with the bare device, and yet here we are.

Assuming a reboot brings back the other three *.eli devices, I would be tempted to offline /dev/ada0p4, go through the steps to re-initialize it for GELI, and issue zpool replace zroot /dev/ada0p4 /dev/ada0p4.eli to get back to the fully encrypted (on disk) state.

byrnejb · May 19, 2023

Eric A. Borisch said:
By choosing to replace /dev/ada0p4.eli with /dev/ada0p4 directly you've removed the encryption on that vdev in your pool. You went through the trouble of setting up encryption at some point, so I assume you would like it to be encrypted.

Yes, it is intended that ada0 be encrypted.

Code:

[root@vhost03 ~ (master)]# ll /dev/*eli
crw-r-----  1 root  operator  0xa8 Apr 20 11:01 /dev/ada1p3.eli
crw-r-----  1 root  operator  0xdd Apr 20 11:01 /dev/ada1p4.eli
crw-r-----  1 root  operator  0xd8 Apr 20 11:01 /dev/ada2p3.eli
crw-r-----  1 root  operator  0xe5 Apr 20 11:01 /dev/ada2p4.eli
crw-r-----  1 root  operator  0xe1 Apr 20 11:01 /dev/ada3p3.eli
crw-r-----  1 root  operator  0xed Apr 20 11:01 /dev/ada3p4.eli

The resilver has completed. This is what I have:

Code:

. . .
scan: resilvered 554G in 17:38:27 with 0 errors on Fri May 19 08:57:23 2023
config:

    NAME            STATE     READ WRITE CKSUM
    zroot           ONLINE       0     0     0
      raidz2-0      ONLINE       0     0     0
        ada1p4.eli  ONLINE       0     0     0
        ada0p4      ONLINE       0     0     0
        ada2p4.eli  ONLINE       0     0     0
        ada3p4.eli  ONLINE     936     0   189

After I remove ada0p4 how do I configure ada0p4.eli to rename ada0p4 to ada0p4.eli? When I first tried to rename I got this result:

Code:

[root@vhost03 ~ (master)]# zpool replace zroot /dev/ada0p4.eli
cannot resolve path '/dev/ada0p4.eli'

byrnejb · May 19, 2023

This is what I have in /boot/loader.conf:

Code:

[root@vhost03 ~ (master)]# more /boot/loader.conf
geli_ada0p4_keyfile0_load="YES"
geli_ada0p4_keyfile0_type="ada0p4:geli_keyfile0"
geli_ada0p4_keyfile0_name="/boot/encryption.key"
geli_ada1p4_keyfile0_load="YES"
geli_ada1p4_keyfile0_type="ada1p4:geli_keyfile0"
geli_ada1p4_keyfile0_name="/boot/encryption.key"
geli_ada2p4_keyfile0_load="YES"
geli_ada2p4_keyfile0_type="ada2p4:geli_keyfile0"
geli_ada2p4_keyfile0_name="/boot/encryption.key"
geli_ada3p4_keyfile0_load="YES"
geli_ada3p4_keyfile0_type="ada3p4:geli_keyfile0"
geli_ada3p4_keyfile0_name="/boot/encryption.key"
aesni_load="YES"
geom_eli_load="YES"
geom_eli_passphrase_prompt="YES"

And to prep ada0p4 I understand that this is necessary:

Code:

geli init -l 256 /dev/ada0p4

The steps then are these?

zpool OFFLINE zroot ada0p4
geli init /dev/ada0p4 ### providing encryption key phrase
zpool REPLACE /dev/ada0p4 /dev/ada0p4.eli

Eric A. Borisch · May 19, 2023

byrnejb said:
zpool OFFLINE zroot ada0p4

geli init -l 256 /dev/ada0p4 ### providing encryption key phrase

zpool REPLACE /dev/ada0p4 /dev/ada0p4.eli

Yes, that looks about right, but you'll want to use your keyfile ( -K path/to/file) during the init, and then provide (or not; -P) a passphrase, depending on what you want. (If the others devices don't have an additional passphrase, I would keep them consistent.) You also will likely (I can't recall if init does an attach) need a step 2.a: geli attach <options> /dev/ada0p4 for /dev/ada0p4.eli to be created.

You can also use geli list to see how your other devices are encrypted; I would verify that they match between steps 2 and 3.

Refer to the handbook for a good description of the GELI process.