Solved HDD deattached randomly, no errors in the logs

Hello,

Hard drive de-attached/re-attached randomly. No performance issues while attached detected. Log below.
This is one of the disks in the ZFS Mirror, the first disk has no issues. No power lost/issues....

Code:
Jul 23 11:57:50 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
Jul 23 11:57:50 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> ACS-3 ATA SATA 3.x device
Jul 23 11:57:50 BSD01 kernel: ada1: Serial Number SERIAL
Jul 23 11:57:50 BSD01 kernel: ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Jul 23 11:57:50 BSD01 kernel: ada1: Command Queueing enabled
Jul 23 11:57:50 BSD01 kernel: ada1: 3815447MB (7814037168 512 byte sectors)
Jul 23 12:06:44 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
Jul 23 12:06:44 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> ACS-3 ATA SATA 3.x device
Jul 23 12:06:44 BSD01 kernel: ada1: Serial Number SERIAL
Jul 23 12:06:44 BSD01 kernel: ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Jul 23 12:06:44 BSD01 kernel: ada1: Command Queueing enabled
Jul 23 12:06:44 BSD01 kernel: ada1: 3815447MB (7814037168 512 byte sectors)
Aug  7 13:06:42 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
Aug  7 13:06:42 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> s/n SERIAL detached
Aug  7 13:06:42 BSD01 kernel: (ada1:ahcich2:0:0:0): Periph destroyed
Aug  7 13:06:49 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
Aug  7 13:06:49 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> ACS-3 ATA SATA 3.x device
Aug  7 13:06:49 BSD01 kernel: ada1: Serial Number SERIAL
Aug  7 13:06:49 BSD01 kernel: ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Aug  7 13:06:49 BSD01 kernel: ada1: Command Queueing enabled
Aug  7 13:06:49 BSD01 kernel: ada1: 3815447MB (7814037168 512 byte sectors)
Aug  8 03:07:32 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
Aug  8 03:07:32 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> s/n SERIAL detached
Aug  8 03:07:32 BSD01 kernel: (ada1:ahcich2:0:0:0): Periph destroyed
Aug  8 03:07:38 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
Aug  8 03:07:38 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> ACS-3 ATA SATA 3.x device
Aug  8 03:07:38 BSD01 kernel: ada1: Serial Number SERIAL
Aug  8 03:07:38 BSD01 kernel: ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Aug  8 03:07:38 BSD01 kernel: ada1: Command Queueing enabled
Aug  8 03:07:38 BSD01 kernel: ada1: 3815447MB (7814037168 512 byte sectors)
Aug  8 16:17:49 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
Aug  8 16:17:49 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> s/n SERIAL detached
Aug  8 16:17:49 BSD01 kernel: (ada1:ahcich2:0:0:0): Periph destroyed
Aug  8 16:17:56 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
Aug  8 16:17:56 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> ACS-3 ATA SATA 3.x device
Aug  8 16:17:56 BSD01 kernel: ada1: Serial Number SERIAL
Aug  8 16:17:56 BSD01 kernel: ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Aug  8 16:17:56 BSD01 kernel: ada1: Command Queueing enabled
Aug  8 16:17:56 BSD01 kernel: ada1: 3815447MB (7814037168 512 byte sectors)
 
Elazar, this happened on live system, not during boot... I have no boot issues....
Code:
# uptime
11:06AM  up 17 days, 19:01, 2 users, load averages: 0.34, 0.28, 0.26

# uname -a
FreeBSD BSD01 12.1-RELEASE-p7 FreeBSD 12.1-RELEASE-p7 GENERIC amd64
 
If you move it to another sata port, does it still occur?
Does smartmontools/smartctl print anything out of the ordinary?
 
Ah yes. I was seeing similar with an ST3000DM008.

In my case, it would prevent proper boot, by going offline when the pools are first scanned and then coming back 10 seconds later. So thats the same behaviour as we see above, but here it happened only during boot. This is the only point when all disks (about ten of different brands) do seek at the same time, and certainly the point with the highest power consumtion.

In my case the problem did disappear by unplugging and replugging the power connector on the disk (which was properly seated before).

Conclusions:
  • These type of power connectors are unsuited for their purpose - otherwise such could not happen. I think they are even worse than the 4-pin plugs, which were utter crap also.
  • That Seagate disk of mine seems to draw serious pulse load and seems to be quite sensitive on voltage fluctuations.
Recommendaton to OP.
  • Check server activity, if there is some specific load pattern common to the dropout times - with the focus on disk seek activities and general power consumtion amount - to see if there is a similar relation.
  • Check/reseat/refurbish the power wiring. If there are any intermediate plugs, get rid of them and replace by screw joints.
 
Rule of thumb: random/sporadic errors -> hardware issues (including slack joints).
How old are the disks? Is it always the same disk that fails? If this system is commercial-grade hardware, the power supply unit should be able to handle the load. BUT these (their electronic parts) are ageing, too. I'd suggest to replace the disk if it's always the same that fails.
 
How old are the disks?
2-3 weeks old

Is it always the same disk that fails?
Yes, the ADA1, even when disks switched places.

If this system is commercial-grade hardware, the power supply unit should be able to handle the load.
500W power supply with Intel I5 CPU, highest load so far was around 120W. Not this is generic Desktop Asus Motherboard.

I'd suggest to replace the disk if it's always the same that fails.
It is definitely FreeBSD/ZFS issue, and not the hardware.
The disk can handle full load for couple of hours, tested. I have a strong feeling that disk going to sleep or something. when the system has almost 0 load.
 
It is definitely FreeBSD/ZFS issue, and not the hardware.
The disk can handle full load for couple of hours, tested. I have a strong feeling that disk going to sleep or something. when the system has almost 0 load.
I insist: sporadic or random failures are most likely hardware issues.
A new device can be errorneous, it's just less likely than for an old one. If the disks are of the same model, the other disk would be "going to sleep or something", too. Thus: replace the disk. This is not about some feeling you have, but to combine the facts. Yes, a feeling based on experience can guide to identify the root cause of a failure. But in this particular case, why would an error in the OS or it's ZFS implementation only affect the same disk out of two, even when you switch their order in the box?
 
mjollnir said:
Is it always the same disk that fails?

Yes, the ADA1, even when disks switched places.
Do you mean by ADA1 the disk formerly being ada1 or every disk connected to that SATA port where ADA1 has been connected to before, following becomming ada1? Disk device names, in this case, are named after the SATA port they are connected to. If disk A is ada0 and disk B is ada1, when they switch places disk A becomes ada1, disk B ada0.
 
Do you mean by ADA1 the disk formerly being ada1
yes, after first error I did switch places disks. I did it by unplugging sata cables from motherboard. This is the basement of my "strong feelings" that this should not be a cable/power issue, may be motherboard ?
Next time I will be there I will switch it again and test it one more time with a different SATA port


It might interested you, Seagate has own diagnostic tools
Yes I know about this tool. I noted above that I will have no physical access to this machine till next Friday (Next monday if everything goes well).
Besides the Smart Test - it is also a Seagate test which runs independently inside of the disk.

Seagate Long Test (SMART) results:
Code:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       442         -  <<< yesterday test, no errors/logs was detacted
# 2  Extended offline    Interrupted (host reset)      00%       435         -  <<< previous (2 days ago) test where aborted by "self random detach of the disk" with no logs
Machine runs 24/7, disks age is 18 days old (442 hours)


To give you a sense of how often this "detaching" happens:
Code:
cat /var/log/messages | grep ada | grep detached
Aug  7 13:06:42 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> s/n SERIAL detached
Aug  8 03:07:32 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> s/n SERIAL detached
Aug  8 16:17:49 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> s/n SERIAL detached
Aug 10 18:38:01 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> s/n SERIAL detached
 
I have a strong feeling that disk going to sleep or something. when the system has almost 0 load.
That sounds vaguely familiar ... but 3 out of 4 of your messages in the last post seemed to be in the daytime so depends on your load.

It's not imagined e.g. there is this sort of thing - OLD links etc. (and usually about external drives) but prove it's not out of the realms of possibility:


Also wasn't there something recently about some drive where if the connector wasn't "just so" then it could fall asleep/go into power-saving mode ... sorry that's VERY vague and can't remember where I was reading that ... I'll see if I can find it!
 
That sounds vaguely familiar ... but 3 out of 4 of your messages in the last post seemed to be in the daytime so depends on your load.

It's not imagined e.g. there is this sort of thing - OLD links etc. (and usually about external drives) but prove it's not out of the realms of possibility:


Also wasn't there something recently about some drive where if the connector wasn't "just so" then it could fall asleep/go into power-saving mode ... sorry that's VERY vague and can't remember where I was reading that ... I'll see if I can find it!
It will be a thing, but this (my current hard drives) doesn't seems to support Advanced Power Management (which is responsible for sleeping):
Code:
camcontrol apm /dev/ada1
camcontrol: ATA SETFEATURES DISABLE APM failed
root@BSD1:/home/nerozero # camcontrol identify /dev/ada1
pass1: <ST4000VX007-2DT166 CV11> ACS-3 ATA SATA 3.x device
pass1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)

protocol              ACS-3 ATA SATA 3.x
device model          ST4000VX007-2DT166
firmware revision     CV11
serial number         SERIAL
WWN                   5000c500a8089c46
additional product id
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 4096, offset 0
LBA supported         268435455 sectors
LBA48 supported       7814037168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6
media RPM             5980
Zoned-Device Commands no

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes    yes
write cache                    yes    yes
flush cache                    yes    yes
Native Command Queuing (NCQ)   yes        32 tags
NCQ Priority Information       no
NCQ Non-Data Command           no
NCQ Streaming                  no
Receive & Send FPDMA Queued    yes
NCQ Autosense                  yes
SMART                          yes    yes
security                       yes    no
power management               yes    yes
microcode download             yes    yes
advanced power management      no    no  <<<<<<<<<<<<<<<<<<< No APM support, no sleep schedule
automatic acoustic management  no    no
media status notification      no    no
power-up in Standby            yes    no
write-read-verify              yes    no    0/0x0
unload                         yes    yes
general purpose logging        yes    yes
free-fall                      no    no
sense data reporting           yes    no
extended power conditions      yes    yes
device statistics notification no    no
Data Set Management (DSM/TRIM) no
Trusted Computing              no
encrypts all user data         no
Sanitize                       yes        overwrite,
Sanitize - commands allowed    yes
Sanitize - antifreeze lock     yes
Host Protected Area (HPA)      yes      no      7814037168/7814037167
HPA - Security                 yes      no
Accessible Max Address Config  no
 
So I have disable the EPC (Extended Power Condition) now, lets see results:

Code:
root@BSD1:/home/nerozero # camcontrol epc /dev/ada1 -c status
APM: NOT Supported, NOT Enabled
EPC: Supported, Enabled # <<<<<< was enabled
Low Power Standby NOT Supported
Set EPC Power Source NOT Supported
Current power state: PM0:Active or PM1:Idle(0xff)
root@BSD1:/home/nerozero # camcontrol epc /dev/ada1 -c disable # <<<<<< disable
root@BSD1:/home/nerozero # camcontrol epc /dev/ada1 -c status
APM: NOT Supported, NOT Enabled
EPC: Supported, NOT Enabled # <<<<<< EPC Disabled
Low Power Standby NOT Supported
Set EPC Power Source NOT Supported
Current power state: PM0:Active or PM1:Idle(0xff)
 
Back
Top