Hi all!
I have a 6x2TB raidz3 pool (devices are Seagate STSHX-M201TCBM). The devices are encrypted with geli and are connected via a TP-Link UH720 USB3-hub to an Intel-NUC running Freebsd (11.x-RELEASE for about two years, 12.0-RELEASE since some days). What bothers me is that I cannot get a scrub to finish for some months now. My pool currently allocates about 4TB. During a scrub at some point (at most I reach about 25% to 30% of the scrub) randomly one or two of the six devices drop out from USB (it's also different devices all the time, so not a problem of a single device) and then immediately reconnect. ZFS marks the device as REMOVED then. When I manually online the device or reopen the pool, of course a resilver takes places for some seconds, and unfortunately resets the scrub progress back to start.
Here is what /var/log/messages shows me, when one of the devices drops out:
Some
What I did so far: For some time I thought it might be a temperature issue, but with the new
I tried to set
to reduce the load on the bus and devices, but the issue stays.
I also tried to mount the devices via ggated() via loopback, because I hoped for some buffering to be available, but with no positive result either.
What I would like to get is either the USB more stable or to have a geom class layered inbetween that is willing to not show a missing device immediately to the layers above it. Any idea or link how to tackle or further analyze the issue or to point me in some direction for a solution is highly appreciated.
I have a 6x2TB raidz3 pool (devices are Seagate STSHX-M201TCBM). The devices are encrypted with geli and are connected via a TP-Link UH720 USB3-hub to an Intel-NUC running Freebsd (11.x-RELEASE for about two years, 12.0-RELEASE since some days). What bothers me is that I cannot get a scrub to finish for some months now. My pool currently allocates about 4TB. During a scrub at some point (at most I reach about 25% to 30% of the scrub) randomly one or two of the six devices drop out from USB (it's also different devices all the time, so not a problem of a single device) and then immediately reconnect. ZFS marks the device as REMOVED then. When I manually online the device or reopen the pool, of course a resilver takes places for some seconds, and unfortunately resets the scrub progress back to start.
Here is what /var/log/messages shows me, when one of the devices drops out:
Code:
Dec 30 14:20:49 kernel: ugen0.16: <Seagate M3> at usbus0 (disconnected)
Dec 30 14:20:49 kernel: umass4: at uhub7, port 2, addr 15 (disconnected)
Dec 30 14:20:49 kernel: da4 at umass-sim4 bus 4 scbus5 target 0 lun 0
Dec 30 14:20:49 kernel: da4: <Seagate M3 0707> s/n NM12JVT3 detached
Dec 30 14:20:49 kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/pl1d2_NM12JVT3.eli[READ(offset=339987775488, length=1007616)]
Dec 30 14:20:49 kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/pl1d2_NM12JVT3.eli[READ(offset=339988787200, length=1028096)]
Dec 30 14:20:49 kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/pl1d2_NM12JVT3.eli[READ(offset=270336, length=8192)]
Dec 30 14:20:49 kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/pl1d2_NM12JVT3.eli[READ(offset=1999306498048, length=8192)]
Dec 30 14:20:49 kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/pl1d2_NM12JVT3.eli[READ(offset=1999306760192, length=8192)]
Dec 30 14:20:49 kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/pl1d2_NM12JVT3.eli[READ(offset=339989815296, length=1015808)]
Dec 30 14:20:50 ZFS[8396]: vdev state changed, pool_guid=$2824558670284347164 vdev_guid=$8875389928371428699
Dec 30 14:20:50 ZFS[8397]: vdev is removed, pool_guid=$2824558670284347164 vdev_guid=$8875389928371428699
Dec 30 14:20:50 kernel: GEOM_ELI: Device gpt/pl1d2_NM12JVT3.eli destroyed.
Dec 30 14:20:50 kernel: GEOM_ELI: Detached gpt/pl1d2_NM12JVT3.eli on last close.
Dec 30 14:20:50 kernel: (da4:umass-sim4:4:0:0): Periph destroyed
Dec 30 14:20:50 kernel: umass4: detached
Dec 30 14:20:53 ZFS[8398]: vdev state changed, pool_guid=$2824558670284347164 vdev_guid=$8875389928371428699
Dec 30 14:20:53 kernel: ugen0.16: <Seagate M3> at usbus0
Dec 30 14:20:53 kernel: umass4 on uhub7
Dec 30 14:20:53 kernel: umass4: <Seagate M3, class 0/0, rev 3.00/7.07, addr 20> on usbus0
Dec 30 14:20:53 kernel: umass4: SCSI over Bulk-Only; quirks = 0x8100
Dec 30 14:20:53 kernel: umass4:5:4: Attached to scbus5
Dec 30 14:20:53 kernel: da4 at umass-sim4 bus 4 scbus5 target 0 lun 0
Dec 30 14:20:53 kernel: da4: <Seagate M3 0707> Fixed Direct Access SPC-4 SCSI device
Dec 30 14:20:53 kernel: da4: Serial Number NM12JVT3
Dec 30 14:20:53 kernel: da4: 400.000MB/s transfers
Dec 30 14:20:53 kernel: da4: 1907729MB (3907029167 512 byte sectors)
Dec 30 14:20:53 kernel: da4: quirks=0x2<NO_6_BYTE>
Dec 30 14:20:53 kernel: GEOM_ELI: Device gpt/pl1d2_NM12JVT3.eli created.
Dec 30 14:20:53 kernel: GEOM_ELI: Encryption: AES-XTS 128
Dec 30 14:20:53 kernel: GEOM_ELI: Crypto: hardware
Some
smartctl
output for the same device:
Code:
=== START OF INFORMATION SECTION ===
Device Model: ST2000LM007-1R8174
Serial Number: WDZ19F80
LU WWN Device Id: 5 000c50 09e05fef7
Firmware Version: SBK2
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Dec 27 21:00:02 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
...
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 079 064 006 Pre-fail Always - 72357960
3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 428
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 074 060 045 Pre-fail Always - 24087191
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 1873 (76 138 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 423
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 1
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 062 049 040 Old_age Always - 38 (Min/Max 18/44)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 11
193 Load_Cycle_Count 0x0032 082 082 000 Old_age Always - 36771
194 Temperature_Celsius 0x0022 038 051 000 Old_age Always - 38 (0 15 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 1286 (37 165 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2456324202
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 4697227613
254 Free_Fall_Sensor 0x0032 100 100 000 Old_age Always - 0
What I did so far: For some time I thought it might be a temperature issue, but with the new
scrub -p
feature I could rule that out. With a script loop I scrubbed for some time, paused the scrub for some time to let the devices cool down and then continued with the scrub. I monitored the device temperatures and they basically stay below 40°C, all the time.I tried to set
Code:
vfs.zfs.top_maxinflight=1
I also tried to mount the devices via ggated() via loopback, because I hoped for some buffering to be available, but with no positive result either.
What I would like to get is either the USB more stable or to have a geom class layered inbetween that is willing to not show a missing device immediately to the layers above it. Any idea or link how to tackle or further analyze the issue or to point me in some direction for a solution is highly appreciated.