Solved Scrub cannot finish due to quick USB disconnect/reconnect

Hi all!

I have a 6x2TB raidz3 pool (devices are Seagate STSHX-M201TCBM). The devices are encrypted with geli and are connected via a TP-Link UH720 USB3-hub to an Intel-NUC running Freebsd (11.x-RELEASE for about two years, 12.0-RELEASE since some days). What bothers me is that I cannot get a scrub to finish for some months now. My pool currently allocates about 4TB. During a scrub at some point (at most I reach about 25% to 30% of the scrub) randomly one or two of the six devices drop out from USB (it's also different devices all the time, so not a problem of a single device) and then immediately reconnect. ZFS marks the device as REMOVED then. When I manually online the device or reopen the pool, of course a resilver takes places for some seconds, and unfortunately resets the scrub progress back to start.

Here is what /var/log/messages shows me, when one of the devices drops out:
Code:
Dec 30 14:20:49 kernel: ugen0.16: <Seagate M3> at usbus0 (disconnected)
Dec 30 14:20:49 kernel: umass4: at uhub7, port 2, addr 15 (disconnected)
Dec 30 14:20:49 kernel: da4 at umass-sim4 bus 4 scbus5 target 0 lun 0
Dec 30 14:20:49 kernel: da4: <Seagate M3 0707>  s/n NM12JVT3 detached
Dec 30 14:20:49 kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/pl1d2_NM12JVT3.eli[READ(offset=339987775488, length=1007616)]
Dec 30 14:20:49 kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/pl1d2_NM12JVT3.eli[READ(offset=339988787200, length=1028096)]
Dec 30 14:20:49 kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/pl1d2_NM12JVT3.eli[READ(offset=270336, length=8192)]
Dec 30 14:20:49 kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/pl1d2_NM12JVT3.eli[READ(offset=1999306498048, length=8192)]
Dec 30 14:20:49 kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/pl1d2_NM12JVT3.eli[READ(offset=1999306760192, length=8192)]
Dec 30 14:20:49 kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/pl1d2_NM12JVT3.eli[READ(offset=339989815296, length=1015808)]
Dec 30 14:20:50 ZFS[8396]: vdev state changed, pool_guid=$2824558670284347164 vdev_guid=$8875389928371428699
Dec 30 14:20:50 ZFS[8397]: vdev is removed, pool_guid=$2824558670284347164 vdev_guid=$8875389928371428699
Dec 30 14:20:50 kernel: GEOM_ELI: Device gpt/pl1d2_NM12JVT3.eli destroyed.
Dec 30 14:20:50 kernel: GEOM_ELI: Detached gpt/pl1d2_NM12JVT3.eli on last close.
Dec 30 14:20:50 kernel: (da4:umass-sim4:4:0:0): Periph destroyed
Dec 30 14:20:50 kernel: umass4: detached
Dec 30 14:20:53 ZFS[8398]: vdev state changed, pool_guid=$2824558670284347164 vdev_guid=$8875389928371428699
Dec 30 14:20:53 kernel: ugen0.16: <Seagate M3> at usbus0
Dec 30 14:20:53 kernel: umass4 on uhub7
Dec 30 14:20:53 kernel: umass4: <Seagate M3, class 0/0, rev 3.00/7.07, addr 20> on usbus0
Dec 30 14:20:53 kernel: umass4:  SCSI over Bulk-Only; quirks = 0x8100
Dec 30 14:20:53 kernel: umass4:5:4: Attached to scbus5
Dec 30 14:20:53 kernel: da4 at umass-sim4 bus 4 scbus5 target 0 lun 0
Dec 30 14:20:53 kernel: da4: <Seagate M3 0707> Fixed Direct Access SPC-4 SCSI device
Dec 30 14:20:53 kernel: da4: Serial Number NM12JVT3
Dec 30 14:20:53 kernel: da4: 400.000MB/s transfers
Dec 30 14:20:53 kernel: da4: 1907729MB (3907029167 512 byte sectors)
Dec 30 14:20:53 kernel: da4: quirks=0x2<NO_6_BYTE>
Dec 30 14:20:53 kernel: GEOM_ELI: Device gpt/pl1d2_NM12JVT3.eli created.
Dec 30 14:20:53 kernel: GEOM_ELI: Encryption: AES-XTS 128
Dec 30 14:20:53 kernel: GEOM_ELI:     Crypto: hardware

Some smartctl output for the same device:
Code:
=== START OF INFORMATION SECTION ===
Device Model:     ST2000LM007-1R8174
Serial Number:    WDZ19F80
LU WWN Device Id: 5 000c50 09e05fef7
Firmware Version: SBK2
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Dec 27 21:00:02 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

...

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   079   064   006    Pre-fail  Always       -       72357960
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       428
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   074   060   045    Pre-fail  Always       -       24087191
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1873 (76 138 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       423
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       1
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   062   049   040    Old_age   Always       -       38 (Min/Max 18/44)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       11
193 Load_Cycle_Count        0x0032   082   082   000    Old_age   Always       -       36771
194 Temperature_Celsius     0x0022   038   051   000    Old_age   Always       -       38 (0 15 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       1286 (37 165 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2456324202
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       4697227613
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0


What I did so far: For some time I thought it might be a temperature issue, but with the new scrub -p feature I could rule that out. With a script loop I scrubbed for some time, paused the scrub for some time to let the devices cool down and then continued with the scrub. I monitored the device temperatures and they basically stay below 40°C, all the time.

I tried to set
Code:
vfs.zfs.top_maxinflight=1
to reduce the load on the bus and devices, but the issue stays.

I also tried to mount the devices via ggated() via loopback, because I hoped for some buffering to be available, but with no positive result either.

What I would like to get is either the USB more stable or to have a geom class layered inbetween that is willing to not show a missing device immediately to the layers above it. Any idea or link how to tackle or further analyze the issue or to point me in some direction for a solution is highly appreciated.
 
Bad cabling or insufficient power issues, perhaps? Can you try to move one or two of hard-drives to motherboard USB ports? The hard drive that is dropping, is it always the same one, or changes? If the same one try with different cable.
 
I have the exact same disk at home:
Code:
# smartctl -a -d sat /dev/da0
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-RELEASE-p7 i386] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST2000LM007-1R8174
Serial Number:    ZDZ0BW5E
LU WWN Device Id: 5 000c50 0a44a8bcf
Firmware Version: SBK2
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Jan  2 19:18:40 2019 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
...
No problems with it. It is only used every 2 hours for a few minutes (it is my external backup disk), and in-between it goes to sleep. Such deep sleep that ZFS disconnects it, but every time it is needed, it immediately reconnects, without any problem. The one weird thing I notice is that it gets rather warm internally when in use: every few days it gets scrubbed, and the temperature goes significantly over 40 degrees (which is a bit warm for my taste).

Here's my question for you: What type of USB port do you have it connected to? Mine is a USB 3.0 port right on the motherboard. I suspect that you have a USB or power problem.
 
Hi! Thanks for your replies so far.

Before both of you replied, I actually had started again a scrub in the morning of 1.1. Yesterday I wasn't at home and had the machine turned off, and today I continued the scrub. I also enabled some of the USB debug sysctls right from the start of the scrub in hope to catch some meaningful output when a device drops out. The debug logging now let the transfer rates drop from about 40M/s to about 15M/s. I have disabled the xhci debug flag again, since the transfer rates then drop to 5M/s. And guess what, the scrub climbed to above 40% by now. I have not gotten such far in a single try within the last months. Currently it tells it has about 24h to go, which doubles in my case, since I have to shutdown the thing during night time.

I don't expect cabling to be an issue here, as I would expect an issue arising right from the start then and I should not have been able to get to 25% of scrub in the past, wouldn't I?

I did not think about an power issue yet. The spec of the USB hub power supply says it can deliver 12V / 3.5A which I considered sufficient for 6 2.5"-drives. And shouldn't power consumption be more an issue when spinning up the disks? I expect I would get spikes in power consumptions just then? But maybe the voltage is not that stable over time even in case of lower consumption once the disks are already spinning. I will try to plug some measurement device in between when I power down the hub next time, and see what it actually draws when spinning up and during normal operations.

On more observation: I have additionally two single disk backup pools of my main pool. One with a 2TB-disk, older model than the ones in the main pool, that I store offsite, which receives only the most important zfs file systems from time to time. And a 4TB-disk I store at home. I was able to scrub both of them without an error so far, also during the last months, and also when plugged in (one of the two at a time) to the same HUB while the six other disks where up and spinning, but I was not doing any significant I/O on the six main disks concurrently then.

To your questions:

1. It's always different devices failing. So, no single disk has an issue.

2. I have connected the devices to the USB3 hub and the hub to the USB3 port of the NUC.

3. Relating to temperature: When I scrubbed the pool with 11.x-RELEASE there was no way to pause the scrub. So, sometimes the temperature got as high as 48°C for some devices, which is much too high also for my taste. With the switch to 12.0- RELEASE two things changed, a) I can pause the scrub and continue the scrub as described in my first post (only then the temperature stayed below 40°C all the time), and b) we have sequential scrubs and resilvers with 12.0 now. When I now let the scrub run without pausing and without USB debugging enabled, it raises "only" up to 44°C, even if I have it running for some hours. I think this is the effect of sorting the I/O requests for the leave vdevs and the disks heads do not need to seek back and forth all the time.

When the scrub will fail again, I will try Bobi B.'s suggestion to connect two of the six devices directly to the other two USB ports of the NUC and see if it makes a difference to stability and also to power consumption.

One more question. When looking through USB tunables I found the following related to timings, but could not find an explanation for them in detail, neither when searching in the man pages, nor when grepping through the source code and reading things in ./sys/dev/usb/usb_debug.c and usb_debug.h:

Code:
hw.usb.timings.extra_power_up_time
hw.usb.timings.resume_recovery
hw.usb.timings.resume_wait
hw.usb.timings.resume_delay
hw.usb.timings.set_address_settle
hw.usb.timings.port_resume_delay
hw.usb.timings.port_powerup_delay
hw.usb.timings.port_reset_recovery
hw.usb.timings.port_root_reset_delay
hw.usb.timings.port_reset_delay

What are these actually doing, especially those named resume and recovery?
 
Just wanted to report back on the issue after several days. The scrub mentioned above with several usb debugging sysctls enabled actually finished successfully. But more important my pool now also scrubs fine with default settings when I connect two of the six disks directly to the NUCs USB ports as suggested above. Thanks so much, solved.
 
Back
Top