Solved 10.2 - ASM1062 Command Timeouts in /var/log/messages

Hi,

This evening I've just upgraded my FreeBSD 10.0-RELEASE machine to 10.2-RELEASE.

The boot drive for this machine is a spinning-rust HD in an eSATA enclosure connected to a PCIe eSATA card (ASMedia ASM1062-based).

This had been running perfectly in 10.0-RELEASE. However, since the upgrade, the machine regularly hangs for a few seconds and writes the following to /var/log/messages:

Code:
Sep 26 22:57:26 trillian kernel: ahcich0: Timeout on slot 5 port 0
Sep 26 22:57:26 trillian kernel: ahcich0: is 00000000 cs 00000020 ss 00000000 rs 00000020 tfd 451 serr 00000000 cmd 0004c517
Sep 26 22:57:59 trillian kernel: ahcich0: Timeout on slot 10 port 0
Sep 26 22:57:59 trillian kernel: ahcich0: is 00000000 cs fffffc01 ss fffffc01 rs fffffc01 tfd 451 serr 00000000 cmd 0004df17
Sep 26 22:57:59 trillian kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 02 e0 bb 00 40 00 00 00 00 00 00
Sep 26 22:57:59 trillian kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
Sep 26 22:57:59 trillian kernel: (ada0:ahcich0:0:0:0): Retrying command

The command is not always the same (e.g. READ_FPDMA_QUEUED) and neither are the various hex addresses.

Unfortunately, I can't do a full smartctl -a /dev/<disk> on the disk as it errors trying to obtain the data (always has done on this disk). Here's the output anyway:

Code:
smartctl 6.4 2015-06-04 r4109 [FreeBSD 10.2-RELEASE amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===

Model Family:     Hitachi Deskstar 7K1000.B
Device Model:     Hitachi HDT721010SLA360
Serial Number:    STF601MH0MHHEB
Firmware Version: ST6OA31B
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-6 T13/1410D revision 1
Local Time is:    Sat Sep 26 23:10:44 2015 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Read SMART Data failed: request failed, error code 0xd0

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Read SMART Error Log failed: request failed, error code 0x04

I can't hear any oddness from the disk (nasty clicks, re-spinups, etc).

Has something in the ahci(4) driver changed or am I just unlucky in that the drive's started giving up the ghost at coincidentally the same time that I upgraded the OS?

Any help greatly appreciated.

Chris
 
Here's another one, apparently not related to reading or writing:

Code:
ahcich0: Timeout on slot 28 port 0
ahcich0: is 00000000 cs 10000000 ss 00000000 rs 10000000 tfd 451 serr 00000000 cmd 0004dc17
(ada0:ahcich0:0:0:0): SETFEATURES ENABLE WCACHE. ACB: ef 02 00 00 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
 
Right -- I've read pretty much every post out there on this. Looks like I'm going to have to crack the enclosure open and hook the HD up via a trusted SATA-USB bridge and see if that lets me reach its SMART status info. The other popular possibility seems to be an eSATA cable gone bad. Guess what I don't have a spare of? :)

So a bad controller card/driver combo seems to be the least likely option according to the web. But I can't shake the coincidence of this only showing up after upgrading to 10.2-RELEASE.

I've even gone back through the changelog on ahci.c but can't see anything recent that, to my non-low-level-programmer eye, looks like it could be the cause...
 
Great -- it's getting worse.

My system is an HP Microserver N40L.

It has a built-in eSATA port and the above-mentioned eSATA card. The add-in card has two eSATA ports. The add-in card appears to be ahcich0 whilst the internal eSATA is ahcich1.

Standard operation is:

  • Internal eSATA port goes to an external enclosure containing the boot disk
  • Add-in eSATA, port 0 has a 4-slot Startech enclosure containing 4xHDs.
As the boot disk was giving me the errors, today I added:
  • eSATA enclosure on Add-in eSATA, port 1

... with the the aim of cloning my boot disk to that drive.

Now, I'm seeing timeouts on both ahchi0 and ahchi1

*Groan*. No idea what the problem is here....
 
I've followed up on bugs:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=186346 and
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=201194

Just finished cloning my OS drive to another disk and am in the process of downgrading it to 10.0-RELEASE.

...so I suppose the silver lining here is I'm learning how easy it is to copy a FreeBSD installation to another disk with cpdup then downgrade by abusing freebsd-update(8).

Shame. I was really enjoying the NFS speed boost in 10.2-RELEASE, but never mind.
 
Last edited by a moderator:
Downgrade didn't work (turns out abusing freebsd-update(8) leads to a pkg(8) that throws bus errors.

Would've been a waste of time anyway since I stupidly upgraded my ZFS pools' features before noticing the AHCI errors - so 10.0 wouldn't be able to mount my pools anyway.

So I'm stuck with this until a fix is found. Also I just found that my pf(4) rules to route traffic from certain IPs on my LAN out to the internet via a OpenVPN-connected VPS have magically stopped working too. Superb.

I'll try a clean 10.2-RELEASE installation soon to see if there's something broken from the original upgrade. I can't believe how painful this has been!
 
Last edited by a moderator:
I had a similar problem when upgrading from 9.3 to 10.3. Message ahcich0: Timeout on slot X appeared each time I run smartctl (or at boot with smartd). After rebuilding smartmontools on 10.3 the message disappeared.
 
Top