SSD ahcich timeouts

N

I started out with FreeBSD 8.2-RELEASE on a Supermicro X8DTi-LN4F motherboard (Intel 5520 chipset) for a backup server using ZFS attached to Intel SASUC8I in IT mode.

FreeBSD OS is on Crucial M4 SSD (zroot) with two Intel 80 Gb SSDSA2CW080G3 SSD for swap and ZFS L2ARC.

SSD are all connected to on-board SATA ports on the mobo.

About three weeks ago I started having problems with kernel panics due to timeouts to these SATA devices.

Kernel messages were not conclusive as far as which drive was the culprit.

I tried to eliminate various potential sources of the issue by methodically replacing items such as SATA cables, checking drives, checking memory, swapping mobo with identical mobo, etc.

I finally installed FreeBSD 9.0-RELEASE which has worked problem free for over a week until I received an ahcich timeout problem this morning.

I have the following in /boot/loader.conf:

Code:

ahci_load="YES"

# See ahci(4)
hint.ahcich.0.sata_rev=1
hint.ahcich.1.sata_rev=1
hint.ahcich.2.sata_rev=1
hint.ahcich.3.sata_rev=1

hint.ahcich.0.pm_level=1
hint.ahcich.1.pm_level=1
hint.ahcich.2.pm_level=1
hint.ahcich.3.pm_level=1

I also setup NCQ to be disabled for my drives (ada0 through ada3).

Code:

#!/bin/sh

CAMCONTROL=/sbin/camcontrol

$CAMCONTROL tags ada0 -N 1 > /dev/null
$CAMCONTROL tags ada1 -N 1 > /dev/null
$CAMCONTROL tags ada2 -N 1 > /dev/null
$CAMCONTROL tags ada3 -N 1 > /dev/null

exit 0

I'm pretty sure that my issue is a bad SSD but I wanted feedback on what I'm seeing on my smartctl commands.

None of these drives show SMART events.

I ran the following commands for each drive (full output attached):

Code:

smartctl -a /dev/blah
smartctl -l devstat /dev/blah
smartctl -l sataphy /dev/blah
smartctl -l ssd /dev/blah

The only thing odd that I'm seeing (from what I know, patterns, etc) is that two of the devices show a relatively high number of ASR events and hardware resets relative to the other devices.

Code:

ada0 - FreeBSD zroot disk
0 ASR events
0 hardware resets

ada1 - swap
0 ASR events
0 hardware resets

ada2 - FreeBSD zroot disk
43 ASR events
160 hardware resets

ada3 - L2ARC
180 hardware resets
25607 ASR events

In terms of proportion I would assume that ada2 is a little wonky and ada3 is a problem.

I haven't been able to find anything that elaborates on ASR events and hardware resets as reported by smartmontools so I'm looking for feedback to tell me if I'm on the write track or not.

Any help, direction, etc would be much appreciated.

wblock@

Developer

Oct 5, 2012

#2

nateK said:

I have the following in /boot/loader.conf:

Code:

ahci_load="YES"

# See ahci(4)
hint.ahcich.0.sata_rev=1
hint.ahcich.1.sata_rev=1
hint.ahcich.2.sata_rev=1
hint.ahcich.3.sata_rev=1

hint.ahcich.0.pm_level=1
hint.ahcich.1.pm_level=1
hint.ahcich.2.pm_level=1
hint.ahcich.3.pm_level=1

I also setup NCQ to be disabled for my drives (ada0 through ada3).

Code:

#!/bin/sh

CAMCONTROL=/sbin/camcontrol

$CAMCONTROL tags ada0 -N 1 > /dev/null
$CAMCONTROL tags ada1 -N 1 > /dev/null
$CAMCONTROL tags ada2 -N 1 > /dev/null
$CAMCONTROL tags ada3 -N 1 > /dev/null

exit 0

With FreeBSD 9, none of those should be needed. The AHCI module is part of GENERIC, SATA level and NCQ should be negotiated. It may not be related to the problem, but is worth testing.

OP

N

nateK

Oct 5, 2012

Thread Starter
#3

Would you say that it is correct to equate ASR and hardware resets with hardware issue with disks?

wblock@

Developer

Oct 5, 2012

#4

Sorry, I don't know.

OP

N

nateK

Oct 5, 2012

Thread Starter
#5

No problem, I appreciate the info about auto-negotiation.

I wanted to clarify my post as I discovered that I had transposed two of the device numbers.

The two devices that I have throwing ASR and hardware resets are Intel SSD. This would lead me to believe that one of the Intel SSD is wonky and the other is headed that way.

I have swapped the Intel SSDs out and will report back here for future reference as far as if the issue pops up again with ahcich timeout, etc.

OP

N

nateK

Oct 6, 2012

Thread Starter
#6

The configuration without Intel SSD (replaced with 2 x Crucial V4) lasted about 1.5 hours without /boot/loader.conf items and NCQ enabled.

Getting 'ahcichX ahci reset device not ready poll timeout' for at least ada0 and ada1 (FreeBSD).

Removed ada2 and ada3 (L2ARC and swap) and removed ada1 from zroot and made it a 16 GB swap device.

Did not add /boot/loader.conf or camcontrol script back in the mix to kill NCQ as other people have had success with.

OP

N

nateK

Oct 11, 2012

Thread Starter
#7

Made it to today sometime and have this from the kernel:

Code:

ahcich0: Timeout on slot 29 port 0
ahcich0: is 000000000 cs 00000000 ss e0000000 rs e0000000 tfd 40 serr 00000000 cmd 0004df17

ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 31 port 0 
ahcich0: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr 00000000 cmd 0004df17
(ada0:ahcich0:0:0:0): lost device

ahcich0: AHCI reset: device not ready after 3100ms (tfd = 00000080)
ahcich0: Timeout on slot 31 port 0
ahcich0: is 00000000 cs 80000003 ss 800000003 rs 80000003 tfd 80 serr 0000000 cmd 0004df17
(ada0:ahcich0:0:0:0): removing device entry

ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Poll timeout on slot 1 port 0
ahcich0: is 00000000 cs 00000002 ss 000000000 rs 0000002 tfd 80 serr 00000000 cmd 004c117

Fixing this requires a hard reset of the system (power off) and power on otherwise the AHCI/SATA controller does not detect on-board SATA devices.

I did disable APM on the two SSD disks and will see what happens.

At this point I don't have numerous options as other motherboards will have similar chipset (Intel 5520).

Replace SSD with SATA disk - haven't seen this issue with FreeBSD and straight SATA disks
Port system to Solaris 11 and chalk up to some weird FreeBSD issue

wblock@

Developer

Oct 11, 2012

#8

Can you test on an identical system? If that system doesn't have the error, swap in the suspect SSD. It could be a bug with the SATA controller code.

This would probably get more feedback on one of the mailing lists, maybe freebsd-questions.

OP

N

nateK

Oct 12, 2012

Thread Starter
#9

I tried swapping out the motherboard for a new one of the same model and the issue followed.

Will post this over on freebsd-questions and update back here with what I find out.

OP

N

nateK

Oct 22, 2012

Thread Starter
#10

Update

A week ago I replaced all of the SSD in favor of good old fashioned SATA drives in external Rosewill array.

First test was to see if removing removing /boot/loader.conf and /usr/local/etc/rc.d/camcontrol shims would have a negative impact on the system.

With SSD in place I was getting server blow-out every 2 hours unless these two items were in place.

No issues found and after a week without problems I have to conclude that some sort of firmware bug or FreeBSD issue with Crucial M4 SSD is present under 8.2 and 9.0.

I have been running some Crucial V4 SSD in another system for a few weeks for testing of another application and no issues although the usage pattern of that system is different so this could be I/O related, firmware related, or just plain old bad juju in the form of a wonky SSD somehow.

B

baos

Jan 7, 2013

#11

I am having this exact problem. I am on my third hard drive and still getting ahcich timeouts. I have an Asus motherboard with Seagate hard drives. The last one to go bad without any SMART issues was a 320GB. I am now on a 500GB and thinking it's possible none of these drives are bad hence the same errors. From the motherboards website: http://www.asus.com/Motherboards/AMD_AM3/M4A87TD_EVO/#specifications

SB850 Chipset
6 xSATA 6.0 Gb/s ports Support RAID 0,1,5,10
JMicronÂ® JMB361 PATA and SATA controller
1 xUltraDMA 133/100 for up to 2 PATA devices
1 xExternal SATA 3Gb/s port (SATA On-the-Go)

B

baos

Jan 12, 2013

#12

baos said:
I am having this exact problem. I am on my third hard drive and still getting ahcich timeouts. I have an Asus motherboard with Seagate hard drives. The last one to go bad without any SMART issues was a 320GB. I am now on a 500GB and thinking it's possible none of these drives are bad hence the same errors. From the motherboards website: http://www.asus.com/Motherboards/AMD_AM3/M4A87TD_EVO/#specifications

SB850 Chipset
6 xSATA 6.0 Gb/s ports Support RAID 0,1,5,10
JMicronÂ® JMB361 PATA and SATA controller
1 xUltraDMA 133/100 for up to 2 PATA devices
1 xExternal SATA 3Gb/s port (SATA On-the-Go)

Turned out to be caused by a loose cable plugged in with no drives attached.

F

FreeMWP

Mar 19, 2013

#13

baos said:
I am having this exact problem. I am on my third hard drive and still getting ahcich timeouts. I have an Asus motherboard with Seagate hard drives. The last one to go bad without any SMART issues was a 320GB. I am now on a 500GB and thinking it's possible none of these drives are bad hence the same errors. From the motherboards website: http://www.asus.com/Motherboards/AMD_AM3/M4A87TD_EVO/#specifications

SB850 Chipset
6 xSATA 6.0 Gb/s ports Support RAID 0,1,5,10
JMicronÂ® JMB361 PATA and SATA controller
1 xUltraDMA 133/100 for up to 2 PATA devices
1 xExternal SATA 3Gb/s port (SATA On-the-Go)

I have a similar problem with a ASUS M4A88TD-M EVO and a LiteOn Blu Ray drive under high load (more specific when using cdrom passthrough in Virtualbox). I have tried to change cable and port but still the same issue. When using cdrom passthrough in Virtualbox on a Openindiana/Illumos or Linux host, it just works without problems, which make me think this is a ahci driver problem in FreeBSD, not a hardware problem.

SSD ahcich timeouts

Attachments