ZFS Devices removed from my pool all the time.

With my ZFS setup, configured in a raidz1 setting, two of the drives are continuously removed forcing me to reboot the system to get things working again.

This is what I get when I do a zpool status

Code:
  pool: library
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-HC
  scan: scrub in progress since Mon Sep  9 19:13:50 2019
        58.0G scanned out of 62.0T at 72.9M/s, 247h23m to go
        0 repaired, 0.09% done
config:

        NAME                     STATE     READ WRITE CKSUM
        library                  UNAVAIL    107    12     0
          raidz1-0               UNAVAIL    218     8     0
            da0.eli              ONLINE       0     0     0
            da1.eli              ONLINE       0     0     0
            da2.eli              ONLINE       0     0     0
            da3.eli              ONLINE       0     0     0
            ada0.eli             ONLINE       0     0     0
            ada1.eli             ONLINE       0     0     0
            ada2.eli             ONLINE       0     0     0
            ada3.eli             ONLINE       0     0     0
            ada6.eli             ONLINE       0     0     0
            591043243035698817   REMOVED      0     0     0  was /dev/ada8.eli
            4784318005487700532  REMOVED      0     0     0  was /dev/ada9.eli
            ada7.eli             ONLINE       0     0     0

errors: 118 data errors, use '-v' for a list

The pool is encrypted as well. I have identified the two drives that gets removed constantly, but I can't seem to find a way to get them online again. I can reboot the system and decrypt and attach the entire pool with no problem, but I would like to be able to get the pool back without having to do a reboot all the time. Any ideas?

Also curious if anyone knows what the issue might be, as this seem to happen a lot. The fact that the same two drives are removed every time is a bit odd to me. Any logs I could provide?
 
This was printed from dmesg regarding the affected devices.
Code:
ada8 at ahcich6 bus 0 scbus7 target 0 lun 0
ada8: <WDC WD100EFAX-68LHPN0 83.H0A83> s/n 2YJ5UXPD detached
GEOM_ELI: g_eli_read_done() failed (error=6) ada8.eli[READ(offset=1160306712576, length=4096)]
GEOM_ELI: g_eli_read_done() failed (error=6) ada8.eli[READ(offset=4775005310976, length=4096)]
GEOM_ELI: g_eli_read_done() failed (error=6) ada8.eli[READ(offset=270336, length=8192)]
GEOM_ELI: g_eli_read_done() failed (error=6) ada8.eli[READ(offset=10000830570496, length=8192)]
GEOM_ELI: g_eli_read_done() failed (error=6) ada8.eli[READ(offset=10000830832640, length=8192)]
GEOM_ELI: g_eli_read_done() failed (error=6) ada8.eli[READ(offset=270336, length=8192)]
GEOM_ELI: g_eli_read_done() failed (error=6) ada8.eli[READ(offset=10000830570496, length=8192)]
GEOM_ELI: g_eli_read_done() failed (error=6) ada8.eli[READ(offset=10000830832640, length=8192)]
GEOM_ELI: Device ada8.eli destroyed.
GEOM_ELI: Detached ada8.eli on last close.
(ada8:ahcich6:0:0:0): Periph destroyed
ada8 at ahcich6 bus 0 scbus7 target 0 lun 0
ada8: <WDC WD100EFAX-68LHPN0 83.H0A83> ACS-2 ATA SATA 3.x device
ada8: Serial Number 2YJ5UXPD
ada8: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada8: Command Queueing enabled
ada8: 9537536MB (19532873728 512 byte sectors)
(ada9:ahcich7:0:0:0): READ_FPDMA_QUEUED. ACB: 60 c0 78 96 95 40 00 00 00 00 00 00
(ada9:ahcich7:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada9:ahcich7:0:0:0): Retrying command
(ada9:ahcich7:0:0:0): READ_FPDMA_QUEUED. ACB: 60 60 38 97 95 40 00 00 00 00 00 00
(ada9:ahcich7:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada9:ahcich7:0:0:0): Retrying command
ada9 at ahcich7 bus 0 scbus8 target 0 lun 0
ada9: <WDC WD100EFAX-68LHPN0 83.H0A83> s/n 2YJJHG1D detached
GEOM_ELI: g_eli_read_done() failed (error=6) ada9.eli[READ(offset=5019332608, length=98304)]
GEOM_ELI: g_eli_read_done() failed (error=6) ada9.eli[READ(offset=5019430912, length=49152)]
GEOM_ELI: g_eli_read_done() failed (error=6) ada9.eli[READ(offset=5019480064, length=122880)]
GEOM_ELI: g_eli_read_done() failed (error=6) ada9.eli[READ(offset=5019602944, length=122880)]
GEOM_ELI: g_eli_read_done() failed (error=6) ada9.eli[READ(offset=270336, length=8192)]
GEOM_ELI: g_eli_read_done() failed (error=6) ada9.eli[READ(offset=10000830570496, length=8192)]
GEOM_ELI: g_eli_read_done() failed (error=6) ada9.eli[READ(offset=10000830832640, length=8192)]
GEOM_ELI: Device ada9.eli destroyed.
GEOM_ELI: Detached ada9.eli on last close.
(ada9:ahcich7:0:0:0): Periph destroyed
ada9 at ahcich7 bus 0 scbus8 target 0 lun 0
ada9: <WDC WD100EFAX-68LHPN0 83.H0A83> ACS-2 ATA SATA 3.x device
ada9: Serial Number 2YJJHG1D
ada9: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada9: Command Queueing enabled
ada9: 9537536MB (19532873728 512 byte sectors)
 
Well, something is wrong with the connection to your drives, so that they get parity errors.
The parity error is the first cause, the drive disappearing is just a consequence.

It should be possible to reintegrate them online. Check with camcontrol devlist that they are present, then do geli attach, and then tell ZFS that they are online.
But that's not a real solution: the cause for the parity errors should be found and fixed. This can be something like wiring, power supply, or a fault with the drives.

In any case, a twelve-path raidZ1 is a little bit, eh, ambitious. At least it needs very well tested hardware.
 
Thank you for your answers.

Yes, I regret configuring it as raidz1.. I guess there is nothing I can do to get it to raidz2 or raidz3 without redoing everything, right?

How do I tell ZFS that the drives are online once I attached them with geli?

I can't seem to find any smartctl utility installed and running pkg install smartctl yields nothing.
 
I guess there is nothing I can do to get it to raidz2 or raidz3 without redoing everything, right?
Correct. You'll have to re-do the whole pool.

How do I tell ZFS that the drives are online once I attached them with geli?
zpool status should tell you.


I can't seem to find any smartctl utility installed and running pkg install smartctl yields nothing.
Oops, sorry, I assumed this was common knowledge. It's part of sysutils/smartmontools.
 
zpool status should tell you.

I didn't really believe it, but You're right.
I tried it out - unplugging two of three drives of a raidz1, reaching a state where the whole pool is faulted. And it is indeed enough to do exactly as told - bring the drives back online (including geli attach), and do zpool clear. (I thought I had used zpool online on occasion.)
 
I have executed 'smartctl -t long' for both devices, with the result that it 'completed without error'. Not very familiar with this utility, but is there another test I could run which might uncover the underlying issue I am looking for?
 
I've seen faulty sata cables more than once and had a power supply that wasn't up to the task. You may want to get a multimeter to check for any significant voltage drop on the furthest connectors.
 
I noticed that the issue occurs when I access certain files more than others. It can chug along nicely for a month, but then when I access that particular file it breaks down. After setting things up again, accessing that same file triggers the faulty event yet again. Maybe some bad sectors is the root cause for all of this? But I will try changing the sata cables and test again.

Feels good though to be able to reproduce the error with access to that file so I don't have to wait around not knowing if its gonna happen again.
 
Feels good though to be able to reproduce the error with access to that file so I don't have to wait around not knowing if its gonna happen again.
There's nothing worse than trying to analyze something that only happens intermittently. I much rather have something blow up in my face and complain very loudly with a whole bunch of errors. They give me something to work with :)
 
Back
Top