Seagate Hard disk randomly keeps turning on and off

nerozero · Apr 19, 2019

Hello,

I dont't know is this right place to discus...

I have 2 computers built on two different Asus motherboards doing same with new (Green)Seagate Barakuda hard drives (Model Number: ST1000DM010 / Family: BARRACUDA35) - hard drives randomly disconnects and reconnects .
This starts after replacing old faulty hard drives with new ones. The replacement happened on two machines within 2 months apart, with the hard drives bought from different distributors. Smart reports doesn't show any issues.
The Old seagate drives (not green ones) inside at the same time never had that issue.
I catch this event once, and it sounds like normal hard drive [spin down]->[park]->[spin up] several times in a row, no other weird noises. The BIOS on both machines has been updated to the latest version available, it doesn't make any effect.

both hard drives ware tested under linux (latest sysrescuecd live),whdd, read/write of huge files, idle,.... for 2 days (one drive is still running for now) - with 0 issues

here is what I got in dmesg for one of the disk:

Code:

...
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <ST1000DM010-2EP102 CC43> s/n Z9AAWW59 detached
g_access(944): provider ada0 has error 6 set
g_access(944): provider ada0 has error 6 set
g_access(944): provider ada0 has error 6 set
g_access(944): provider ada0 has error 6 set
(ada0:ahcich0:0:0:0): Periph destroyed
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <ST1000DM010-2EP102 CC43> ATA8-ACS SATA 3.x device
ada0: Serial Number Z9AAWW59
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 953869MB (1953525168 512 byte sectors)
ada0: quirks=0x1<4K>
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <ST1000DM010-2EP102 CC43> s/n Z9AAWW59 detached
g_access(944): provider ada0 has error 6 set
g_access(944): provider ada0 has error 6 set
g_access(944): provider ada0 has error 6 set
g_access(944): provider ada0 has error 6 set
...

I don't know what to think, if you have some ideas please share.

OS: FreeBSD 11.2-RELEASE, amd64

Thanks in advance

roddierod · Apr 19, 2019

I've had this issue in the past with green drives and FreeBSD. The only way I could ever really solve the issue was to find out how long it took for the drive to "sleep" and create a cron job that would touch a temp file on the drive to keep the drive "awake". So if the drive sleeps after 5 minutes of no activity, I'd set the script to run every 4.5 minutes. Not ideal, but the only way I found that truly worked, until I got rid of the green drives.

There has to be something better, but nothing every worked for me. So this is a stop gap I came up with.

T-Daemon · Apr 19, 2019

You could try booting a 12.0-RELEASE installation disk, to see if the problem is also present there.

nerozero · Apr 19, 2019

Thank you for reply.
I'm planing to do BSD12 tests next week...

ralphbsz · Apr 19, 2019

Is this actually a problem? If the drive spins down when idle, and spins up again, and the SATA driver stack doesn't get any errors in the process, why not let them do it?

Sure, the messages in the log look scary: You are getting error ENXIO (number 6). But that's logical and easy to explain: The lower level part of the driver has noticed that the drive is gone away, and when the higher level part tries to access it, the lower level part responds by saying "this disk does not actually exist right now". But they only look scary, and I don't see how they cause any problems.

nerozero · Apr 19, 2019

ralphbsz it is major issue, zpool become degraded .... The hard drive is disconnected.

ralphbsz · Apr 19, 2019

Ah, that's bad. It means that ZFS will start doing things that are unnecessary, and that carries risk. I don't know a way to tell ZFS that a disk is "optional" and to not worry if it goes offline for a while. So you have to configure the drive to not go to sleep. This is easy to do if you can connect the drive to a Windows machine, then you can run the Seagate tools to adjust parameters (do a firmware update at the same time, disks have firmware, and newer firmware tends to have fewer bugs). But you clearly have a FreeBSD machine, not a Windows machine. The next option is to use the bootable version of the Seagate tool: Write it to a USB stick, boot the computer, and you get a standalone mini-OS that can be used to upgrade and configure your Seagate disks. If you don't like that, I know that there is a version of the Seagate tools for Linux, but (a) I think the Linux version only works for SCSI (SAS) disks, and (b) you are using FreeBSD, not Linux. Finally, you can configure the sleep/power management using camcontrol directly from FreeBSD, but it is not easy.

Look at the options, and pick the one that best fits your needs.

VladiBG · Apr 20, 2019

Use disks that are build for raid arrays. The green series are for home desktop only they doesn't have ERC support.

What is Error Recovery Control? | Seagate UK

Brief description of Error Recovery Control, which enables the host to set a soft time limit for specific commands (reads and writes).

www.seagate.com

k.jacker · Apr 20, 2019

As VladiBG, I would also advise against green drives in an array. Their goal is to save as much energy as possible and be as quiet as possible.

In addition to camcontrol, suggested by ralphbsz , another tool to temporarily disable apm/standby (until you get yourself proper drives

) is sysutils/smartmontools. Compared to camcontrol, smartmontools (smartctl) would be the less dangerous tool, but you have to install it first.

A simple # smartctl --get=apm /dev/<devicename> should show you the state of apm. If implemented, you could for example do # smartctl --set=apm,off /dev/<devicename> (or --set=standby,off).
Not all drives implement that in the same fashion, read smartctl(8) from ~~page~~ line 537. Another workaround could be to poll the drives in some way, I do that every 25 minutes on USB drives by reading their temperature using smartctl (they would otherwise enter idle mode after 30 minutes and at some point later, spin down as well).

ralphbsz · Apr 20, 2019

I didn't know that smartctl can also adjust the sleep settings, that's useful to learn.

PMc · Apr 21, 2019

ralphbsz said:
Is this actually a problem? If the drive spins down when idle, and spins up again, and the SATA driver stack doesn't get any errors in the process, why not let them do it?

Sure, the messages in the log look scary: You are getting error ENXIO (number 6).

Yes, that is a problem and must not happen.
It is perfectly possible to have disks in ZFS go ~~sleep~~ standby. I spin down all my disks in the pools and do not see that error:

Code:

ada0: STBY
ada1: 36 C
ada2: STBY
ada3: 34 C
ada4: STBY
da0:  STBY
da1:  34 C
da2:  37 C

(data from smartctl)

I also have one seagate in my menagerie since recently, but it is ST3000DM008-2DM166 (don't know how "green" that is). I had that one once set to camcontrol apm /dev/ada2 -l 1 (so that it does spindown immediately after each request), and no problem with that.

The errno=6 is indeed bad - I see these about twice a year, only due to connection problems (in summer the disks may run at 60+ C, and SATA connectors are crap, they don't cope with such wide temperature walks); and when I try to do a system dump (but thats a flaw in the SATA controller, independent on which disk is used).

@TO: Your disks seem to go offline to the controller for whatever reason (on ~~sleep~~ standby they shouldn't).
Here is the current config of my seagate, You may compare:

Code:

# camcontrol identify /dev/ada2
[...]
Feature                      Support  Enabled   Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes
overlap                        no
Tagged Command Queuing (TCQ)   no       no
Native Command Queuing (NCQ)   yes              32 tags
NCQ Queue Management           no
NCQ Streaming                  no
Receive & Send FPDMA Queued    no
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      yes      yes     120/0x78
automatic acoustic management  no       no
media status notification      no       no
power-up in Standby            yes      no
write-read-verify              yes      no      0/0x0
unload                         no       no
general purpose logging        yes      yes
free-fall                      no       no
Data Set Management (DSM/TRIM) no
Host Protected Area (HPA)      yes      no      5860533168/5860533168
HPA - Security                 no

Addendum: did some further research (and crashed my server btw) and updated text accordingly.

There are actually three different powersave modes for ATA:
IDLE is practically a noop: it tells the drive that there are no pending requests (in case the drive is too stupid to figure that).
STANDBY is spin-down. This is what we usually want. This will result in ~5-20sec delay on the next request, and no errors at all, not with ZFS, not in mirrored or raid configs.
SLEEP is supposed to spin-down and power-off the electronics. The disk will not react on commands anymore.

When I tried SLEEP on my desktop (HGST, single disk), it came back on it's own, but with these errors logged:

Apr 21 23:26:32 <kern.crit> disp kernel: ahcich1: Timeout on slot 22 port 0
Apr 21 23:26:32 <kern.crit> disp kernel: ahcich1: is 00000000 cs ffcfffff ss ffcfffff rs ffcfffff tfd d0 serr 00000000 cmd 0000d617
Apr 21 23:26:32 <kern.crit> disp kernel: (ada0:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 78 ab 9d 40 05 00 00 00 00 00
Apr 21 23:26:32 <kern.crit> disp kernel: (ada0:ahcich1:0:0:0): CAM status: Command timeout
Apr 21 23:26:32 <kern.crit> disp kernel: (ada0:ahcich1:0:0:0): Retrying command

This is not what we want.

When I tried the same on the server with the seagate, the disk blocked. I was still able to run camcontrol reset, that came back, but the SATA controller didn't understand the proceedings, blocked the disk, and itself, and probably the PCI bus, so at that point the whole system was frozen and required pushbutton service.
This is alright, as that SATA controller has seen it's last firmware update in 2004, and traditionally a unix system is not supposed to survive a lost device (Winux or Lindows may handle that differently).

Conclusion: to some extent it does depend on the controller, how this is handled. But anyway, this SLEEP mode is not what we want. I have not seen disks that would go there on their own behalf - so if these mentioned seagate do, there should be some way to remedy that.

What to do: the disk can be brought intentionally into these states, with camcontrol(8) idle, standby and sleep, respectively. I would try these out at a time when the disk is still spinning, to see inhowfar the results correspond to the perceived behaviour, or if something else is the matter.

VladiBG · Apr 22, 2019

PMc said:
so that it does spindown immediately after each request

ST3000DM008 is rated at 300,000 load/unload cycles. Maybe is not a good idea to spin-down and park the heads so often.

PMc · Apr 22, 2019

Thanks for the info - that wasn't a good idea indeed, as this piece takes a loooong time to get to speed again.
Interesting detail in the specs, btw: power consumtion goes down to .75W in standby, so it does almost a full shutdown - and there is no further benefit with sleep mode.
The setting was originally for HGST with Coolspin, which do some kind of dynamic speed adaption and come back quicker (but then these need more power in standby).