[FreeBSD 8.1] Timeouts on AHCI. Broken HDD?

mky · Oct 24, 2010

Hi,
I have a problem with one of HDD, connected via SATA2, using AHCI driver. My disks are:

Code:

ada0 at ahcich0 bus 0 scbus1 target 0 lun 0
ada0: <TOSHIBA MK8034GSX AH303B> ATA-7 SATA 1.x device
ada0: 150.000MB/s transfers (SATA 1.x, UDMA5, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 76319MB (156301488 512 byte sectors: 16H 63S/T 16383C)
ada1 at ahcich1 bus 0 scbus2 target 0 lun 0
ada1: <SAMSUNG HD154UI 1AG01118> ATA-7 SATA 2.x device
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 1430799MB (2930277168 512 byte sectors: 16H 63S/T 16383C)
ada2 at ahcich5 bus 0 scbus6 target 0 lun 0
ada2: <WD My Book 01.01A01> ATA-6 SATA 2.x device
ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 512bytes)
ada2: Command Queueing enabled
ada2: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)

My controller is:

Code:

ahci0: <Intel ICH10 AHCI SATA controller> port 0x9c00-0x9c07,0x9880-0x9883,0x9800-0x9807,0x9480-0x9483,0x9400-0x941f mem 0xfcffe800-0xfcffefff irq 19 at device 31.2 on pci0
ahci0: [ITHREAD]
ahci0: AHCI v1.20 with 6 3Gbps ports, Port Multiplier supported

From time to time a have following messages in kernel:

Code:

ahcich1: Timeout on slot 30
ahcich1: is 00000000 cs 40000000 ss 00000000 rs 40000000 tfd c0 serr 00000000
ahcich1: Timeout on slot 30
ahcich1: is 00000000 cs 0000001c ss c000001f rs c000001f tfd c0 serr 00000000
ahcich1: Timeout on slot 10
ahcich1: is 00000000 cs 00000c00 ss 00000000 rs 00000c00 tfd c0 serr 00000000
ahcich1: Timeout on slot 12
ahcich1: is 00000000 cs 00030000 ss 0003f000 rs 0003f000 tfd c0 serr 00000000
ahcich1: Timeout on slot 22
ahcich1: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd c0 serr 00000000
...

This shows only for Samsung drive, never for any other. What do these messages mean? The only thing, which I can do is hard reset my system. After reset everything works fine.

Is my HDD is broken? SMART status is fine. I use ZFS on this drive and after scrub there are no errors found too.

fronclynne · Oct 24, 2010

Is the drive perhaps suspending/sleeping/parking and then failing to resume?

I should add that the direct command to disable spin-down for ada(4) devices seems to be # /sbin/camcontrol cmd ada0 -a "EF 85 00 00 00 00 00 00 00 00 00 00"
camcontrol(8) says that it has idle and standby, but I've never been able to get them to do anything.

See also Thread 8841.

Terry_Kennedy · Oct 25, 2010

mky said:
From time to time a have following messages in kernel:

Code:

ahcich1: Timeout on slot 30 ahcich1: is 00000000 cs 40000000 ss 00000000 rs 40000000 tfd c0 serr 00000000

Is my HDD is broken? SMART status is fine. I use ZFS on this drive and after scrub there are no errors found too.

If I'm remembering correctly, there were some reports here (and FreeBSD PR's) with Samsung drives, but with a different error message (I/O errors at high LBA's). I believe they were closed when a new hypervisor version appeared to fix the problem (the system in question was running FreeBSD in a VM).

It is interesting that in both your case and that problem, only a Samsung drive (you both also have other brands of disk attached) shows the problem. This freebsd-stable post reports a very similar error (which the poster resolved by using a non-Samsung drive).

mav@ · Oct 25, 2010

Try to replace SATA cable first.

Crivens · Oct 26, 2010

fronclynne said:
Is the drive perhaps suspending/sleeping/parking and then failing to resume?

Unlikely. I got the same with a resilver in progress, so the drive in question should not have been idle

These are the 5400 rpm Samsungs, right? Maybe they really are overloaded and can not write the entries fast enough, so the ahci driver considers them to not be responding.

Also, i did not find any entries in SMART for these timeouts, they come from the ahci driver.

I went back to not using AHCI for my ZFS pool due to these messages. They also tend to freeze the IO for some seconds or minutes.

mky · Oct 29, 2010

First of all, thanks for all answers

The problem is exactly as described here:

Terry_Kennedy said:
It is interesting that in both your case and that problem, only a Samsung drive (you both also have other brands of disk attached) shows the problem. This freebsd-stable post reports a very similar error (which the poster resolved by using a non-Samsung drive).

I have exactly same drive: Samsung F2 1.5TB 5200rpm EcoGreen and similar as this thread on "FreeBSD-Stable" list, this drive was stop respond usually with heavy overload (in my case, when ZFS sync data to disk at 100 MB per sec).

I disabled AHCI few days ago (which disable also faulty NCQ on this drive) and since then I have no problems with this drive.

For me, the problem is solved. Thanks again.

[FreeBSD 8.1] Timeouts on AHCI. Broken HDD?

mky

fronclynne

Terry_Kennedy

mav@

Crivens

Administrator

mky