Solved HDD deattached randomly, no errors in the logs

nerozero · Aug 10, 2020

Hello,

Hard drive de-attached/re-attached randomly. No performance issues while attached detected. Log below.
This is one of the disks in the ZFS Mirror, the first disk has no issues. No power lost/issues....

Code:

Jul 23 11:57:50 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
Jul 23 11:57:50 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> ACS-3 ATA SATA 3.x device
Jul 23 11:57:50 BSD01 kernel: ada1: Serial Number SERIAL
Jul 23 11:57:50 BSD01 kernel: ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Jul 23 11:57:50 BSD01 kernel: ada1: Command Queueing enabled
Jul 23 11:57:50 BSD01 kernel: ada1: 3815447MB (7814037168 512 byte sectors)
Jul 23 12:06:44 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
Jul 23 12:06:44 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> ACS-3 ATA SATA 3.x device
Jul 23 12:06:44 BSD01 kernel: ada1: Serial Number SERIAL
Jul 23 12:06:44 BSD01 kernel: ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Jul 23 12:06:44 BSD01 kernel: ada1: Command Queueing enabled
Jul 23 12:06:44 BSD01 kernel: ada1: 3815447MB (7814037168 512 byte sectors)
Aug  7 13:06:42 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
Aug  7 13:06:42 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> s/n SERIAL detached
Aug  7 13:06:42 BSD01 kernel: (ada1:ahcich2:0:0:0): Periph destroyed
Aug  7 13:06:49 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
Aug  7 13:06:49 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> ACS-3 ATA SATA 3.x device
Aug  7 13:06:49 BSD01 kernel: ada1: Serial Number SERIAL
Aug  7 13:06:49 BSD01 kernel: ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Aug  7 13:06:49 BSD01 kernel: ada1: Command Queueing enabled
Aug  7 13:06:49 BSD01 kernel: ada1: 3815447MB (7814037168 512 byte sectors)
Aug  8 03:07:32 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
Aug  8 03:07:32 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> s/n SERIAL detached
Aug  8 03:07:32 BSD01 kernel: (ada1:ahcich2:0:0:0): Periph destroyed
Aug  8 03:07:38 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
Aug  8 03:07:38 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> ACS-3 ATA SATA 3.x device
Aug  8 03:07:38 BSD01 kernel: ada1: Serial Number SERIAL
Aug  8 03:07:38 BSD01 kernel: ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Aug  8 03:07:38 BSD01 kernel: ada1: Command Queueing enabled
Aug  8 03:07:38 BSD01 kernel: ada1: 3815447MB (7814037168 512 byte sectors)
Aug  8 16:17:49 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
Aug  8 16:17:49 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> s/n SERIAL detached
Aug  8 16:17:49 BSD01 kernel: (ada1:ahcich2:0:0:0): Periph destroyed
Aug  8 16:17:56 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
Aug  8 16:17:56 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> ACS-3 ATA SATA 3.x device
Aug  8 16:17:56 BSD01 kernel: ada1: Serial Number SERIAL
Aug  8 16:17:56 BSD01 kernel: ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Aug  8 16:17:56 BSD01 kernel: ada1: Command Queueing enabled
Aug  8 16:17:56 BSD01 kernel: ada1: 3815447MB (7814037168 512 byte sectors)

George · Aug 10, 2020

Maybe you get more info in a verbose boot.
Also, which version of FreeBSD?

nerozero · Aug 10, 2020

Elazar, this happened on live system, not during boot... I have no boot issues....

Code:

# uptime
11:06AM  up 17 days, 19:01, 2 users, load averages: 0.34, 0.28, 0.26

# uname -a
FreeBSD BSD01 12.1-RELEASE-p7 FreeBSD 12.1-RELEASE-p7 GENERIC amd64

diizzy · Aug 10, 2020

If you move it to another sata port, does it still occur?
Does smartmontools/smartctl print anything out of the ordinary?

nerozero · Aug 10, 2020

the server will not be accessible till next Friday...

I have only terminal ....

acheron · Aug 10, 2020

Defective cable?

PMc · Aug 10, 2020

Ah yes. I was seeing similar with an ST3000DM008.

In my case, it would prevent proper boot, by going offline when the pools are first scanned and then coming back 10 seconds later. So thats the same behaviour as we see above, but here it happened only during boot. This is the only point when all disks (about ten of different brands) do seek at the same time, and certainly the point with the highest power consumtion.

In my case the problem did disappear by unplugging and replugging the power connector on the disk (which was properly seated before).

Conclusions:

These type of power connectors are unsuited for their purpose - otherwise such could not happen. I think they are even worse than the 4-pin plugs, which were utter crap also.
That Seagate disk of mine seems to draw serious pulse load and seems to be quite sensitive on voltage fluctuations.

Recommendaton to OP.

Check server activity, if there is some specific load pattern common to the dropout times - with the focus on disk seek activities and general power consumtion amount - to see if there is a similar relation.
Check/reseat/refurbish the power wiring. If there are any intermediate plugs, get rid of them and replace by screw joints.

nerozero · Aug 10, 2020

acheron said:
Defective cable?

Definitely not.

Mjölnir · Aug 10, 2020

Rule of thumb: random/sporadic errors -> hardware issues (including slack joints).
How old are the disks? Is it always the same disk that fails? If this system is commercial-grade hardware, the power supply unit should be able to handle the load. BUT these (their electronic parts) are ageing, too. I'd suggest to replace the disk if it's always the same that fails.

nerozero · Aug 10, 2020

mjollnir said:
How old are the disks?

2-3 weeks old

mjollnir said:
Is it always the same disk that fails?

Yes, the ADA1, even when disks switched places.

mjollnir said:
If this system is commercial-grade hardware, the power supply unit should be able to handle the load.

500W power supply with Intel I5 CPU, highest load so far was around 120W. Not this is generic Desktop Asus Motherboard.

mjollnir said:
I'd suggest to replace the disk if it's always the same that fails.

It is definitely FreeBSD/ZFS issue, and not the hardware.
The disk can handle full load for couple of hours, tested. I have a strong feeling that disk going to sleep or something. when the system has almost 0 load.

Mjölnir · Aug 10, 2020

nerozero said:
It is definitely FreeBSD/ZFS issue, and not the hardware.
The disk can handle full load for couple of hours, tested. I have a strong feeling that disk going to sleep or something. when the system has almost 0 load.

I insist: sporadic or random failures are most likely hardware issues.
A new device can be errorneous, it's just less likely than for an old one. If the disks are of the same model, the other disk would be "going to sleep or something", too. Thus: replace the disk. This is not about some feeling you have, but to combine the facts. Yes, a feeling based on experience can guide to identify the root cause of a failure. But in this particular case, why would an error in the OS or it's ZFS implementation only affect the same disk out of two, even when you switch their order in the box?

nerozero · Aug 10, 2020

mjollnir said:
I insist: sporadic or random failures are most likely hardware issues.

Agree, couple of hours ago i ran a full hard drive selftest (smartctl -t long .... ) it will took a wile .... I will report here

T-Daemon · Aug 10, 2020

nerozero said:
mjollnir said:
Is it always the same disk that fails?

Yes, the ADA1, even when disks switched places.

Do you mean by ADA1 the disk formerly being ada1 or every disk connected to that SATA port where ADA1 has been connected to before, following becomming ada1? Disk device names, in this case, are named after the SATA port they are connected to. If disk A is ada0 and disk B is ada1, when they switch places disk A becomes ada1, disk B ada0.

T-Daemon · Aug 10, 2020

nerozero said:
... couple of hours ago i ran a full hard drive selftest (smartctl -t long .... )

It might interested you, Seagate has own diagnostic tools to check the health of a hard disk:

Code:

Downloads

SeaTools Bootable
The quick USB diagnostic tool that checks the health of your drive.

nerozero · Aug 11, 2020

T-Daemon said:
Do you mean by ADA1 the disk formerly being ada1

yes, after first error I did switch places disks. I did it by unplugging sata cables from motherboard. This is the basement of my "strong feelings" that this should not be a cable/power issue, may be motherboard ?
Next time I will be there I will switch it again and test it one more time with a different SATA port

T-Daemon said:
It might interested you, Seagate has own diagnostic tools

Yes I know about this tool. I noted above that I will have no physical access to this machine till next Friday (Next monday if everything goes well).
Besides the Smart Test - it is also a Seagate test which runs independently inside of the disk.

Seagate Long Test (SMART) results:

Code:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       442         -  <<< yesterday test, no errors/logs was detacted
# 2  Extended offline    Interrupted (host reset)      00%       435         -  <<< previous (2 days ago) test where aborted by "self random detach of the disk" with no logs

Machine runs 24/7, disks age is 18 days old (442 hours)

To give you a sense of how often this "detaching" happens:

Code:

cat /var/log/messages | grep ada | grep detached
Aug  7 13:06:42 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> s/n SERIAL detached
Aug  8 03:07:32 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> s/n SERIAL detached
Aug  8 16:17:49 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> s/n SERIAL detached
Aug 10 18:38:01 BSD01 kernel: ada1: <ST4000VX007-2DT166 CV11> s/n SERIAL detached

richardtoohey2 · Aug 11, 2020

nerozero said:
I have a strong feeling that disk going to sleep or something. when the system has almost 0 load.

That sounds vaguely familiar ... but 3 out of 4 of your messages in the last post seemed to be in the daytime so depends on your load.

It's not imagined e.g. there is this sort of thing - OLD links etc. (and usually about external drives) but prove it's not out of the realms of possibility:

My drive is sleeping too much in Windows | Support Seagate US

Troubleshooting when external drives sleep too often or irregularly, or do not wake up properly in Windows.

www.seagate.com

Seagate HDD Won't stay awake?

Have a 3 TB Seagate HD that used to be an external but decided i'd rather just have it in the case, but now, I can't stop the thing from going to sleep and it won't wake up when I want to access data, I have to reboot... Quite annoying... I've tried the seagate software but when i go to set the...

forums.tomshardware.com

Also wasn't there something recently about some drive where if the connector wasn't "just so" then it could fall asleep/go into power-saving mode ... sorry that's VERY vague and can't remember where I was reading that ... I'll see if I can find it!

richardtoohey2 · Aug 11, 2020

Not what I was looking for, but looks like you had a similar issue last year? https://forums.freebsd.org/threads/...y-keeps-turning-on-and-off.70452/#post-424678

nerozero · Aug 11, 2020

richardtoohey2 said:
Not what I was looking for, but looks like you had a similar issue last year? https://forums.freebsd.org/threads/...y-keeps-turning-on-and-off.70452/#post-424678

No, that was a different machine, and seagate issue, fixed after hard drive firmware update. at least for me.... But the symptoms is very same, only there is no error messages at all.

nerozero · Aug 11, 2020

richardtoohey2 said:
That sounds vaguely familiar ... but 3 out of 4 of your messages in the last post seemed to be in the daytime so depends on your load.

It's not imagined e.g. there is this sort of thing - OLD links etc. (and usually about external drives) but prove it's not out of the realms of possibility:

My drive is sleeping too much in Windows | Support Seagate US

Troubleshooting when external drives sleep too often or irregularly, or do not wake up properly in Windows.

www.seagate.com

Seagate HDD Won't stay awake?

Have a 3 TB Seagate HD that used to be an external but decided i'd rather just have it in the case, but now, I can't stop the thing from going to sleep and it won't wake up when I want to access data, I have to reboot... Quite annoying... I've tried the seagate software but when i go to set the...

forums.tomshardware.com

Also wasn't there something recently about some drive where if the connector wasn't "just so" then it could fall asleep/go into power-saving mode ... sorry that's VERY vague and can't remember where I was reading that ... I'll see if I can find it!

It will be a thing, but this (my current hard drives) doesn't seems to support Advanced Power Management (which is responsible for sleeping):

Code:

camcontrol apm /dev/ada1
camcontrol: ATA SETFEATURES DISABLE APM failed
root@BSD1:/home/nerozero # camcontrol identify /dev/ada1
pass1: <ST4000VX007-2DT166 CV11> ACS-3 ATA SATA 3.x device
pass1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)

protocol              ACS-3 ATA SATA 3.x
device model          ST4000VX007-2DT166
firmware revision     CV11
serial number         SERIAL
WWN                   5000c500a8089c46
additional product id
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 4096, offset 0
LBA supported         268435455 sectors
LBA48 supported       7814037168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6
media RPM             5980
Zoned-Device Commands no

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes    yes
write cache                    yes    yes
flush cache                    yes    yes
Native Command Queuing (NCQ)   yes        32 tags
NCQ Priority Information       no
NCQ Non-Data Command           no
NCQ Streaming                  no
Receive & Send FPDMA Queued    yes
NCQ Autosense                  yes
SMART                          yes    yes
security                       yes    no
power management               yes    yes
microcode download             yes    yes
advanced power management      no    no  <<<<<<<<<<<<<<<<<<< No APM support, no sleep schedule
automatic acoustic management  no    no
media status notification      no    no
power-up in Standby            yes    no
write-read-verify              yes    no    0/0x0
unload                         yes    yes
general purpose logging        yes    yes
free-fall                      no    no
sense data reporting           yes    no
extended power conditions      yes    yes
device statistics notification no    no
Data Set Management (DSM/TRIM) no
Trusted Computing              no
encrypts all user data         no
Sanitize                       yes        overwrite,
Sanitize - commands allowed    yes
Sanitize - antifreeze lock     yes
Host Protected Area (HPA)      yes      no      7814037168/7814037167
HPA - Security                 yes      no
Accessible Max Address Config  no

richardtoohey2 · Aug 11, 2020

Out of my league now ... but what is extended power conditions about - https://www.seagate.com/files/docs/pdf/en-GB/whitepaper/tp608-powerchoice-tech-provides-gb.pdf - that refers to standby. But I'm hurling red herrings in your path so

from me!

nerozero · Aug 11, 2020

richardtoohey2 said:
Out of my league now ... but what is extended power conditions about - https://www.seagate.com/files/docs/pdf/en-GB/whitepaper/tp608-powerchoice-tech-provides-gb.pdf - that refers to standby. But I'm hurling red herrings in your path so from me!

Now that is interesting,Thank you!

nerozero · Aug 11, 2020

So I have disable the EPC (Extended Power Condition) now, lets see results:

Code:

root@BSD1:/home/nerozero # camcontrol epc /dev/ada1 -c status
APM: NOT Supported, NOT Enabled
EPC: Supported, Enabled # <<<<<< was enabled
Low Power Standby NOT Supported
Set EPC Power Source NOT Supported
Current power state: PM0:Active or PM1:Idle(0xff)
root@BSD1:/home/nerozero # camcontrol epc /dev/ada1 -c disable # <<<<<< disable
root@BSD1:/home/nerozero # camcontrol epc /dev/ada1 -c status
APM: NOT Supported, NOT Enabled
EPC: Supported, NOT Enabled # <<<<<< EPC Disabled
Low Power Standby NOT Supported
Set EPC Power Source NOT Supported
Current power state: PM0:Active or PM1:Idle(0xff)

Mjölnir · Aug 11, 2020

Both disks are the same brand & model?

nerozero · Aug 11, 2020

mjollnir said:
Both disks are the same brand & model?

Exactly the same, bought both for a new internal cloud.

richardtoohey2, Thank you!
8 hours - so far so good, but fingers still crossed.... Lets wait till tomorrow...

nerozero · Aug 12, 2020

Almost 24 hours - no detach. Will wait one more day and then change status to "solution".