ZFS HDD Mirror read error for one of two HDDs

4 Months ago I bought 2x12TB WD Red Plus HDDs.
Since 1 week ago I got a unrecovable read geli error on one of the HDDs, and I thought, I will scrub the HDDs.
That error went away, but now I get on the same HDD the same error.
This is a snippet of the error message:
Code:
...
GEOM_ELI: g_eli_read_done() failed (error=5) ada0.eli[READ(offset=16384, length=114688)]
(ada0:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 20 02 00 40 00 00 00 00 00 00
(ada0:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed
(ada0:ahcich1:0:0:0): Error 5, Unretryable error
GEOM_ELI: g_eli_read_done() failed (error=5) ada0.eli[READ(offset=278528, length=114688)]
(ada0:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 20 00 00 40 00 00 00 00 00 00
(ada0:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed
(ada0:ahcich1:0:0:0): Error 5, Unretryable error
GEOM_ELI: g_eli_read_done() failed (error=5) ada0.eli[READ(offset=16384, length=114688)]
(ada0:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 20 02 00 40 00 00 00 00 00 00
(ada0:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed
(ada0:ahcich1:0:0:0): Error 5, Unretryable error
GEOM_ELI: g_eli_read_done() failed (error=5) ada0.eli[READ(offset=278528, length=114688)]
(ada0:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 20 fa ff 40 74 05 00 00 00 00
(ada0:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed
(ada0:ahcich1:0:0:0): Error 5, Unretryable error
GEOM_ELI: g_eli_read_done() failed (error=5) ada0.eli[READ(offset=12000137854976, length=114688)]
(ada0:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 20 fc ff 40 74 05 00 00 00 00
(ada0:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed
(ada0:ahcich1:0:0:0): Error 5, Unretryable error
...

This is my zpool status after booting up:
Code:
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: scrub repaired 0B in 01:56:53 with 0 errors on Sun Oct 19 05:34:49 2025
config:

    NAME          STATE     READ WRITE CKSUM
    libraryz      DEGRADED     0     0     0
      mirror-0    DEGRADED     0     0     0
        ada1.eli  ONLINE       0     0     0
        ada0.eli  FAULTED      0     0     0  too many errors

Is there any way I could repair it through ZFS ?
If not luckily enough I still have warranty so, I could probably let it be replaced.

If it is not repairable and needs a replacement, should I still wipe it or can I leave it as is since the data is encrypted ?
Wiping data away can take hours and if the data on the faulted HDD is encrypted, I am unsure whether wiping would still be needed.

I think probably that way to many sectors got faulted on the HDD.

EDIT:
It got resilvered again, and the pool is again in a functional state.
2 resilverings inside 1 week are kind of doubtful.
 
The drive is failing and needs to be replaced. ZFS might online the disk after a restart and will attempt to fix data on the disk that isn't correct but it will likely keep happening.

As the whole disk is encrypted with GELI, I don't see a particular need to wipe the disk.
 
agreeing with the above assessments. zpool offline the disk and, if you're extra-paranoid, you can use geli kill ada0 to nuke the key material on the disk.
 
We don't have enough information to understand your risk management posture, and you need to tell us before it's possible to suggest a sensible approach. e.g.
  • If the data are valuable and can't be replaced, you should have backups, urgently.
  • If high availability is required, you should have/get a spare 12TB disk in store, ready to deploy immediately.
As others have suggested, sysutils/smartmontools provide essential diagnostics.

Read the TrueNAS Hard Drive Troubleshooting Guide. But beware Section 4(d) which says "Swap the DATA cables between the suspect drive and a nearby drive". This may cause a second, potentially fatal failure on the "nearby drive", if the cable is at fault.
 
Here is a writeup that I made last year for another project.
depends on installing sysutils/smartmontools
=======================================================
The important quality data from a SATA HDD.

root@FreeBSDnode:~ # smartctl -HA /dev/ada0 ( eg, the suspected device )

smartctl 7.3 2022-02-28 r5338 [FreeBSD 13.1-STABLE amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0
2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0
3 Spin_Up_Time 0x0027 100 100 001 Pre-fail Always - 9624
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 244
5 Reallocated_Sector_Count 0x0033 100 100 050 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 050 Pre-fail Offline - 0
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 1085
10 Spin_Retry_Count 0x0033 100 100 030 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 244
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 8
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 675
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 40
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

220 Disk_Shift 0x0002 100 100 000 Old_age Always - 169345034
222 Loaded_Hours 0x0032 100 098 000 Old_age Always - 372
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 0
224 Load_Friction 0x0022 100 100 000 Old_age Always - 0
226 Load-in_Time 0x0026 100 100 000 Old_age Always - 534
240 Head_Flying_Hours 0x0001 100 100 001 Pre-fail Offline - 0

Reallocated Sectors Count and Reallocated events Means that the Disk has and is using the SPARE sectors
that are available on every track. At some point, no more spare sectors are availble and no more reallocation can take place.

UDMA CRC Errors are transmission errors between the Motherboard and the HDD. Possible SATA Cable fault.

When these numbers of errors reach several hundreds, the drives are unsafe to use.
Offline uncorrectable sectore are permanent flaws on the disk surface , basically requiring a disk surface reformat.
Running a HDD with Uncorrectable sectors is unsafe.

Regards
 
We don't have enough information to understand your risk management posture, and you need to tell us before it's possible to suggest a sensible approach. e.g.
  • If the data are valuable and can't be replaced, you should have backups, urgently.
  • If high availability is required, you should have/get a spare 12TB disk in store, ready to deploy immediately.
As others have suggested, sysutils/smartmontools provide essential diagnostics.

Read the TrueNAS Hard Drive Troubleshooting Guide. But beware Section 4(d) which says "Swap the DATA cables between the suspect drive and a nearby drive". This may cause a second, potentially fatal failure on the "nearby drive", if the cable is at fault.
I will run a smartmontools and post a log of it.
The cables should be normal as I accidentally tried to swap them, and the faulted drive kept to be the same.
 
Interesting.
I built smartmontools from source.
And tried some commands like:
Input:
Code:
# smartctl -a /dev/ada0
# smartctl -HA /dev/ada0
# smartctl -d auto -a /dev/ada0

smartctl always tells me: "/dev/ada0: No such file or directory"

I tried to scan for the devices with # smartctl --scan
The HDDs were not displayed.
Does it maybe has something to do with it that I created a zfs pool with raw disks ?

As a next step I tried # camcontrol devlist -v
The relevant entry is: "scbus1 on ahcich1 bus 0:
<WDC WD120EFBX-68B0EN0 85.00A85> at scbus1 target 0 lun 0 (ada0)"

It seems the device is recognized, but I wonder why smartctl doesn't see it then.
 
Managed to solve the smartcl error not showing up my HDDs.

Issuing smartcl -a /dev/ada0 shows me:
Code:
smartctl 7.5 2025-04-30 r5714 [FreeBSD 14.3-RELEASE-p3 amd64] (local build)
Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red Plus
Device Model:     WDC WD120EFBX-68B0EN0
Serial Number:    D7JWSNLN
LU WWN Device Id: 5 000cca 2dfe8cddc
Firmware Version: 85.00A85
User Capacity:    12,000,138,625,024 bytes [12.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.5/5706
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Oct 24 14:54:03 2025 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (   87) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (1326) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   127   127   054    Old_age   Offline      -       112
  3 Spin_Up_Time            0x0007   171   171   024    Pre-fail  Always       -       388 (Average 380)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       1022
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   132   132   020    Old_age   Offline      -       17
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       2232
 10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       387
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       1906
193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       1906
194 Temperature_Celsius     0x0002   209   209   000    Old_age   Always       -       31 (Min/Max 15/46)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       8

SMART Error Log Version: 1
ATA Error Count: 8 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 8 occurred at disk power-on lifetime: 2218 hours (92 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 a0 00 fb ff 40 08      00:01:49.117  READ FPDMA QUEUED
  60 00 a8 00 fd ff 40 08      00:01:49.116  READ FPDMA QUEUED
  60 00 98 00 03 00 40 08      00:01:49.115  READ FPDMA QUEUED
  60 b0 90 50 01 00 40 08      00:01:49.115  READ FPDMA QUEUED
  60 08 88 48 01 00 40 08      00:01:49.115  READ FPDMA QUEUED

Error 7 occurred at disk power-on lifetime: 2176 hours (90 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 40  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 01 00 00 00 e0 00      00:00:36.923  READ DMA EXT
  25 00 01 00 00 00 e0 00      00:00:36.909  READ DMA EXT
  ef 02 00 00 00 00 a0 00      00:00:35.977  SET FEATURES [Enable write cache]
  60 01 70 00 00 00 40 00      00:00:35.884  READ FPDMA QUEUED
  ec 00 00 00 00 00 a0 00      00:00:35.883  IDENTIFY DEVICE

Error 6 occurred at disk power-on lifetime: 2176 hours (90 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 01 70 00 00 00 40 00      00:00:35.907  READ FPDMA QUEUED
  ec 00 00 00 00 00 a0 00      00:00:35.883  IDENTIFY DEVICE
  b0 d8 00 00 4f c2 a0 00      00:00:35.021  SMART ENABLE OPERATIONS
  ef 02 00 00 00 00 a0 00      00:00:34.937  SET FEATURES [Enable write cache]
  ec 00 00 00 00 00 a0 00      00:00:34.624  IDENTIFY DEVICE

Error 5 occurred at disk power-on lifetime: 2092 hours (87 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 50 80 a5 e6 40 08      00:04:01.638  READ FPDMA QUEUED
  60 00 58 80 ad e6 40 08      00:04:01.636  READ FPDMA QUEUED
  60 00 48 80 9d e6 40 08      00:04:01.627  READ FPDMA QUEUED
  60 00 40 80 95 e6 40 08      00:04:01.623  READ FPDMA QUEUED
  60 00 38 80 8d e6 40 08      00:04:01.619  READ FPDMA QUEUED

Error 4 occurred at disk power-on lifetime: 2091 hours (87 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 48 e0 70 27 00 40 08      00:02:51.101  WRITE FPDMA QUEUED
  61 48 d8 70 27 00 40 08      00:02:51.101  WRITE FPDMA QUEUED
  61 30 d0 90 21 00 40 08      00:02:51.100  WRITE FPDMA QUEUED
  61 28 c8 98 21 00 40 08      00:02:51.100  WRITE FPDMA QUEUED
  61 98 c0 a0 29 00 40 08      00:02:51.099  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Tested the other HDD, too, and it displayed healthy values.

So, I think I should eventually replace the HDD and its corresponding SATA cable then ?
 
My version of FreeBSD is 14.3-RELEASE.
Now we know. Thanks.

You may be interested in PR 279978.
Cherry-picking a patch or a referenced commit from it should allow you to see proper error diagnostics instead of "Auto-Sense Retrieval Failed".
That may shed more on the actual problem.

It's beyond me why the fix was never MFC-ed to stable/14.
 
Try with replacing the cable with a new short sata cable.
Well going through the diagnostics, and inspecting carefully I will just replace the 2 sata cables for the HDDs tomorrow and see whether still errors will occur or not, before replacing stuff.
Thank you for the recommendation :)
 
Now we know. Thanks.

You may be interested in PR 279978.
Cherry-picking a patch or a referenced commit from it should allow you to see proper error diagnostics instead of "Auto-Sense Retrieval Failed".
That may shed more on the actual problem.

It's beyond me why the fix was never MFC-ed to stable/14.
That failed, because I have a custom kernel compiled, and left out some SCSI options out, which were apparently needed.
So, the drivers were not loaded properly.
After comparing my kernel against the GENERIC one, I fixed that problem.
 
There's no logged bad sectors or pending sectors so i think your disk plates are good. The 8 errors that you see can be caused by bad SATA connection, power interrupts (disk reset) or bad connection between the controller and disk head. After you replacing the sata cable if you continue to see a rising UDMA_CRC you can connect the disk to different sata power cable and check again. And the final thing that you can do is to grab a torx screw driver remove the disk controller and clear the contact pads with rubber and then with alcohol to remove any oxidation.
 
I pretty much agree with the ideas VladiBG said, so making a short list here of possible causes:

  1. Your driver mods might have been the issue, if this does not happen with a GENERIC kernel, you know what the problem was.
  2. Bad SATA Cable
  3. Bad Controller circuit on mainboard, try another SATA port
  4. Bad controller circuit on drive (not very likely, but you never know)
  5. Iffy power supply to drive
I have personal experience with number 5, with a failing power supply that "bled" AC harmonics into the DC power supply.
 
I pretty much agree with the ideas VladiBG said, so making a short list here of possible causes:

  1. Your driver mods might have been the issue, if this does not happen with a GENERIC kernel, you know what the problem was.
  2. Bad SATA Cable
  3. Bad Controller circuit on mainboard, try another SATA port
  4. Bad controller circuit on drive (not very likely, but you never know)
  5. Iffy power supply to drive
I have personal experience with number 5, with a failing power supply that "bled" AC harmonics into the DC power supply.
1. Camcontrol devlist and smartctl clearly didn't show my HDDs correctly, now they are, after adding some options to my custom kernel

I assume very highly that it is either option 2 or 3.
More likely option 2 though as the SATA cables are very old now.

For option 5. I have a EVGA gold 2000W supply.
Could also be a cause, but I think it is unlikely.

Thank you for the list :-)
I will go one by one and see which point now caused the error.
I mean, the other HDD I use in the mirror is completely healthy with 0 errors.
 
Funny fact: for me it was an EVGA Platinum 1000W power supply that failed. Granted, it was ten years and a bit old, but it was a 10 year warranty power supply so it was still unexpected. But agreed, not very likely, so it is at the bottom of the list. With iffy power supply, devices may work or not and fail in strange ways, even if they are identical devices.

Before I diagnosed the power supply to be the issue, it killed a mainboard (warranty) and a graphics card (warranty).
 
Don't make all changes at once, do it one step at the time and test for a week or so otherwise you won't know which change fix the issue.
Yes :)
I will inspect the issue first.
If it occurs again I will try to replace SATA cables.
If that doesn't work different ports, etc.


Funny fact: for me it was an EVGA Platinum 1000W power supply that failed. Granted, it was ten years and a bit old, but it was a 10 year warranty power supply so it was still unexpected. But agreed, not very likely, so it is at the bottom of the list. With iffy power supply, devices may work or not and fail in strange ways, even if they are identical devices.

Before I diagnosed the power supply to be the issue, it killed a mainboard (warranty) and a graphics card (warranty).
Power Supplies should be changed after 5-7 years I think.
Mine is 3 years old now, and always running in eco mode.
 
I would still expect a power supply with a ten year warranty to last a good ten years, especially if it is reasonably over-provisioned and a brand name device. Also, I would expect such a device to fail "cleanly" instead of going haywire and kill your computer's components. I was just lucky with the warranties still being valid on the mainboard and graphics card, the latter failing a mere two weeks before the warranty expired.
 
Back
Top