Other Kernel: ahcich: Timeout in slot

patpro

Active Member

Reaction score: 15
Messages: 230

This is very odd, I see a lot of messages about this kind of issues. My server is affected too, but it's an Intel chipset. I feel like this problem appeared after I moved from FreeBSD 9 to FreeBSD 10 (always RELEASE).
Lucky me, I'm logging almost everything into SPLUNK, that allowed me to trace the first "ahcich*: Timeout on slot *" to 28 minutes after upgrading my FreeBSD 9.3-RELEASE to FreeBSD 10.1-RELEASE.

Code:
ahci0: <Intel Cougar Point AHCI SATA controller> port 0xf070-0xf077,0xf060-0xf063,0xf050-0xf057,0xf040-0xf043,0xf000-0xf01f mem 0xdfa22000-0xdfa227ff irq 19 at device 31.2 on pci0
ahci0: AHCI v1.30 with 6 3Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ahcich4: <AHCI channel> at channel 4 on ahci0
ahcich5: <AHCI channel> at channel 5 on ahci0
ahciem0: <AHCI enclosure management bridge> on ahci0



At boottime, I've got:

Code:
ahcich5: Timeout on slot 0 port 0
ahcich5: is 00000000 cs 00000000 ss 00000000 rs 00000001 tfd 150 serr 00000000 cmd 0004c017
(aprobe5:ahcich5:0:0:0): SETFEATURES SET TRANSFER MODE. ACB: ef 03 00 00 00 40 00 00 00 00 46 00
(aprobe5:ahcich5:0:0:0): CAM status: Command timeout
(aprobe5:ahcich5:0:0:0): Retrying command
ahcich5: Timeout on slot 0 port 0
ahcich5: is 00000000 cs 00000000 ss 00000000 rs 00000001 tfd 150 serr 00000000 cmd 0004c017
(aprobe5:ahcich5:0:0:0): SETFEATURES SET TRANSFER MODE. ACB: ef 03 00 00 00 40 00 00 00 00 46 00
(aprobe5:ahcich5:0:0:0): CAM status: Command timeout
(aprobe5:ahcich5:0:0:0): Error 5, Retries exhausted
ahcich5: Timeout on slot 0 port 0


which corresponds I think to an SSD plugged here as a rescue disk, but not in use.

This morning, without notice, the server hanged:

Code:
ahcich1: Timeout on slot 25 port 0
ahcich1: is 00000000 cs 04000000 ss 06000000 rs 06000000 tfd 40 serr 00000000 cmd 0004d917
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 28 ee 3c 40 2e 00 00 01 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
(ada1:ahcich1:0:0:0): Retrying command
ahcich1: stopping AHCI engine failed
ahcich0: Timeout on slot 23 port 0
ahcich0: is 00000000 cs 01000000 ss 01800000 rs 01800000 tfd 40 serr 00000000 cmd 0004d717
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 28 ee 3c 40 2e 00 00 01 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
ahcich0: stopping AHCI engine failed
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: executing CLO failed
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: executing CLO failed
ahcich1: Timeout on slot 26 port 0


I've not tested to play with hint.ahci.X.msi yet, but that would be great if it solves the problem of my HDD (so I don't need to RMA them just to discover the culprit is a driver...)
 
OP
WCSN

WCSN

Member

Reaction score: 18
Messages: 66

I've not tested to play with hint.ahci.X.msi yet, but that would be great if it solves the problem of my HDD (so I don't need to RMA them just to discover the culprit is a driver...)
Hmm ... I guess I have in Intel motherboards old controllers or where otherwise organized msi or may it there at all.
After assigning *msi* = 0, the system is fully stable.

But apparently it's still achi driver error?
I look for svn1 and svn2 something is being done, but did not understand what it was.
We will wait.
 
OP
WCSN

WCSN

Member

Reaction score: 18
Messages: 66

The new version of FreeBSD 10.2 did not bring a solution.
All the same. One drive is working properly. Several drives in the RAID continue to give all the same mistake.
View source ahci shows that no particularly significant change in the driver has not been.

hint.ahci.msi set in default...

Новая версия FreeBSD 10.2 не принесла решения проблемы.
Всё тоже самое. Один диск работает нормально. Несколько дисков в режиме RAID продолжают всё ту же давать ошибку.
Просмотр исходников ahci показывает, что никаких особо значимых изменений в драйвере не проводилось.
 

diizzy

Well-Known Member

Reaction score: 86
Messages: 328

I'm going to say that this is probably broken hardware in some way, broken as in not working is it is supposed to do by design rather than faulty.
//Danne
 

patpro

Active Member

Reaction score: 15
Messages: 230

Not so sure about that. Unless 10.x-RELEASE comes with a particularly pedantic ahci implementation that would not allow such hypothetically "broken-by-design" hardware/firmware to work properly.
It worked perfectly in 9.x, now it's broken in 10.x, so my guess is, something changed that (sh|c)ould be reversed.
 

displaced

New Member

Reaction score: 2
Messages: 18

displaced

New Member

Reaction score: 2
Messages: 18

Applied and rebooted. No luck -- still seeing timeouts on ahcich0 and ahcich1.

I'm really worried about the health of my ZFS pools. I'll need to shut this machine down and wait until I can revert to 10.0-RELEASE, assuming my ZFS pools will still be readable, since I stupidly did a zpool upgrade before I noticed this issue.
 

raul321

New Member


Messages: 6

I have this too
Code:
ahcich0: Timeout on slot 6 port 0
ahcich0: is 00000000 cs 00000000 ss 00000040 rs 00000040 tfd 40 serr 00000000 cmd 0000e617
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 01 44 5e 52 40 99 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command

Code:
]# zpool status
  pool: data
state: ONLINE
status: One or more devices are configured to use a non-native block size.
  Expect reduced performance.
action: Replace affected devices with devices that support the
  configured block size, or migrate data to a properly configured
  pool.
  scan: none requested
config:

  NAME  STATE  READ WRITE CKSUM
  data  ONLINE  0  0  0
  raidz1-0  ONLINE  0  0  0
  ada0p3  ONLINE  0  0  0  block size: 512B configured, 4096B native
  ada1p3  ONLINE  0  0  0  block size: 512B configured, 4096B native
  ada2p3  ONLINE  0  0  0  block size: 512B configured, 4096B native
  ada3p3  ONLINE  0  0  0  block size: 512B configured, 4096B native
  gptid/97218bd3-5980-11e2-b974-902b34a1bd7e  ONLINE  0  0  0  block size: 512B configured, 4096B native
  gptid/97a7e40a-5980-11e2-b974-902b34a1bd7e  ONLINE  0  0  0  block size: 512B configured, 4096B native

errors: No known data errors

Code:
# freebsd-version -ku
10.2-RELEASE
10.2-RELEASE-p4
 
OP
WCSN

WCSN

Member

Reaction score: 18
Messages: 66

Code:
# freebsd-version
10.2-RELEASE-p8
# uname -a
FreeBSD wfid78-172 10.2-RELEASE-p4 FreeBSD 10.2-RELEASE-p4 #0: Tue Aug 18 15:15:36 UTC 2015     root@amd64-builder.pcbsd.org:/usr/obj/usr/src/sys/GENERIC  amd64


Strange... version is different :).

After upgrade my system have hint.ahci.0.msi set in default...

Error have!
Uptime after upgrade will over 25 days.
Last update kernel I install yesterday and after reboot I got our favourite error.

Author of AHCI driver [USER=227]mav@[/USER] is silent.
 
OP
WCSN

WCSN

Member

Reaction score: 18
Messages: 66

Code:
Nov  9 23:49:51 wcsn kernel: ahcich1: Timeout on slot 25 port 0
Nov  9 23:49:51 wcsn kernel: ahcich1: is 00000000 cs 02000000 ss 00000000 rs 02000000 tfd 50 serr 00000000 cmd 10009617
Nov  9 23:49:51 wcsn kernel: ahcich1: Error while READ LOG EXT
Nov  9 23:49:51 wcsn kernel: (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 28 03 00 40 00 00 00 01 00 00
Nov  9 23:49:51 wcsn kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Nov  9 23:49:51 wcsn kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Nov  9 23:49:51 wcsn kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Nov  9 23:49:51 wcsn kernel: (ada1:ahcich1:0:0:0): Retrying command
Nov  9 23:49:51 wcsn kernel: (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 28 03 00 40 00 00 00 01 00 00
Nov  9 23:49:51 wcsn kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Nov  9 23:49:51 wcsn kernel: (ada1:ahcich1:0:0:0): ATA status: 00 ()
Nov  9 23:49:51 wcsn kernel: (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Nov  9 23:49:51 wcsn kernel: (ada1:ahcich1:0:0:0): Retrying command


device.hints

Code:
hint.ahci.0.msi="0"
hint.ahci.1.msi="0"


very bad very bad!!! :(
 
OP
WCSN

WCSN

Member

Reaction score: 18
Messages: 66

Now... bug repeated :(

Code:
FreeBSD wcsn.wocson.ru 10.2-RELEASE-p11 FreeBSD 10.2-RELEASE-p11 #0: Thu Jan 14 15:48:17 UTC 2016     root@amd64-builder.pcbsd.org:/usr/obj/usr/src/sys/GENERIC  amd64


Now after update to -p11. First reboot and get error AHCI.
 

rowan194

Member

Reaction score: 5
Messages: 57

Hi all, I also had those pesky CRC and timeout errors (plus a couple of panics on ZFS import) start almost immediately after an upgrade from 8.4-STABLE to 10.3-RELEASE. I can confirm that at least one of my drives had 0 CRC errors reported by SMART prior to the upgrade, with a power on time of over 11 months under 8.4-STABLE; since upgrading to 10.3-RELEASE it has suddenly accumulated 480 CRC errors.

I've registered to post a potential solution, in /boot/loader.conf:

Code:
# disable MSI to try to fix repeated CAM CRC/timeout errors and possible panic on zfs import. these suddenly started on port 6/7 after changing from FreeBSD 8.4 to 10.3 (same hardware)
hint.ahci.0.msi=0
hint.ahci.1.msi=0
hint.ahci.2.msi=0
hint.ahci.3.msi=0
hint.ahci.4.msi=0
hint.ahci.5.msi=0
hint.ahci.6.msi=0
hint.ahci.7.msi=0

# force SATA2 on all drives (this was also present on 8.4 install)
hint.ahcich.0.sata_rev=2
hint.ahcich.1.sata_rev=2
hint.ahcich.2.sata_rev=2
hint.ahcich.3.sata_rev=2
hint.ahcich.4.sata_rev=2
hint.ahcich.5.sata_rev=2
hint.ahcich.6.sata_rev=2
hint.ahcich.7.sata_rev=2


I suspect it that perhaps forcing drives to SATA2 is the solution, as others have reported that disabling MSI seems to do nothing. I am not game to test this hypothesis further; it seems to be working now - not a single error since boot - so I'm leaving it exactly as it is!

Hope this helps others fix the problem. Please update this thread if it works for you.
 

CreateCat

New Member


Messages: 2

Hope this helps others fix the problem. Please update this thread if it works for you.

I can confirm that the sata_rev tweak this seems to have worked for me. I was able to complete a full zpool scrub without error, as well as several hundred GB of transfers. Normally it would crash about 20% through the scrub.

Like rowan194, I'm not too interested in playing around any more as long as this works. I will note that I used quotes around my numbers:
Code:
="2"

instead of
Code:
=2

(in case it matters)

To help Google direct people here in the future, I will also mention that this is on PC-BSD 10.1 on an HP Microserver N40L.
 
OP
WCSN

WCSN

Member

Reaction score: 18
Messages: 66

Code:
# disable MSI to try to fix repeated CAM CRC/timeout errors and possible panic on zfs import. these suddenly started on port 6/7 after changing from FreeBSD 8.4 to 10.3 (same hardware)
hint.ahci.0.msi=0
hint.ahci.1.msi=0
hint.ahci.2.msi=0
hint.ahci.3.msi=0
hint.ahci.4.msi=0
hint.ahci.5.msi=0
hint.ahci.6.msi=0
hint.ahci.7.msi=0

# force SATA2 on all drives (this was also present on 8.4 install)
hint.ahcich.0.sata_rev=2
hint.ahcich.1.sata_rev=2
hint.ahcich.2.sata_rev=2
hint.ahcich.3.sata_rev=2
hint.ahcich.4.sata_rev=2
hint.ahcich.5.sata_rev=2
hint.ahcich.6.sata_rev=2
hint.ahcich.7.sata_rev=2


This parameters now don't help.
The error occurs anyway.

Unfortunately @mav not paying attention.
I remain of the opinion that the error occurs on systems with lots of disk organization raid.
AHCI-driver in linux, this error is handled differently and there is no check occurs on LVM arrays.

% uname -a
FreeBSD wfid78-172 10.3-RELEASE-p4 FreeBSD 10.3-RELEASE-p4 #0: Thu Jun 2 00:55:04 UTC 2016 root@amd64-builder.pcbsd.org:/usr/obj/usr/src/sys/GENERIC amd64

Code:
tail ... messages
Jun  5 19:39:53 wfid78-172 kernel: ahcich0: is 00000000 cs 01000000 ss 00000000 rs 01000000 tfd c0 serr 00000000 cmd 0000f817
Jun  5 19:39:53 wfid78-172 kernel: (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Jun  5 19:39:53 wfid78-172 kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
Jun  5 19:39:53 wfid78-172 kernel: (ada0:ahcich0:0:0:0): Retrying command
Jun  5 20:42:41 wfid78-172 kernel: ahcich0: Timeout on slot 3 port 0
Jun  5 20:42:41 wfid78-172 kernel: ahcich0: is 00000000 cs 00000008 ss 00000000 rs 00000008 tfd c0 serr 00000000 cmd 0000e317
Jun  5 20:42:41 wfid78-172 kernel: (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Jun  5 20:42:41 wfid78-172 kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
Jun  5 20:42:41 wfid78-172 kernel: (ada0:ahcich0:0:0:0): Retrying command
Jun  5 21:16:59 wfid78-172 kernel: ahcich0: Timeout on slot 16 port 0

Jun  5 19:39:53 wfid78-172 kernel: ahcich0: is 00000000 cs 01000000 ss 00000000 rs 01000000 tfd c0 serr 00000000 cmd 0000f817
Jun  5 19:39:53 wfid78-172 kernel: (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Jun  5 19:39:53 wfid78-172 kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
Jun  5 19:39:53 wfid78-172 kernel: (ada0:ahcich0:0:0:0): Retrying command
Jun  5 20:42:41 wfid78-172 kernel: ahcich0: Timeout on slot 3 port 0
Jun  5 20:42:41 wfid78-172 kernel: ahcich0: is 00000000 cs 00000008 ss 00000000 rs 00000008 tfd c0 serr 00000000 cmd 0000e317
Jun  5 20:42:41 wfid78-172 kernel: (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Jun  5 20:42:41 wfid78-172 kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
Jun  5 20:42:41 wfid78-172 kernel: (ada0:ahcich0:0:0:0): Retrying command
Jun  5 21:16:59 wfid78-172 kernel: ahcich0: Timeout on slot 16 port 0
 

marian_cerny

New Member

Reaction score: 9
Messages: 14

I had a similar problem when upgrading from 9.3 to 10.3. Message ahcich0: Timeout on slot X appeared each time I run smartctl (or at boot with smartd). After rebuilding smartmontools on 10.3 the message disappeared.
 

Stanislav

New Member


Messages: 2

I have similiar problem...
Code:
ata3: error issuing ata_identify command

on
Code:
atapci0: <Promise PDC20571 SATA150 controller>

I have create pool tank using two disks, all worked properly, but after add "zfs_enable=yes" and reboot, the error was present...
 

Stanislav

New Member


Messages: 2

Note that this is a two year old thread referencing FreeBSD versions that have since gone EoL.
I have
Code:
root@pool2:/var/log # uname -a
FreeBSD pool2 11.1-RELEASE FreeBSD 11.1-RELEASE #0 r321309: Fri Jul 21 02:08:28 UTC 2017     root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
 

Ome Ko

New Member


Messages: 5

None of the above worked for me.
I'm running 2 servers with a Supermicro A1SAM-2550F Serverboard, Intel Atom C2550 2.4Ghz Quad Core with 2 HGST Ultrastar 7K2 harddisks in a ZFS mirror.

One has been running for almost 2 years as a backupserver with the occasional timeout. Errors increased dramatically when switched to production. Yesterday I setup an almost identical server (only difference is that disks on the new server are running on SATA 3 instead of 2).
The "old" one was FREEBSD 11.0 RELEASE until EOL then went to 11.2-RELEASE-p4 FreeBSD and is running SATA 2.
The new one is FREEBSD 12.0 RELEASE running SATA 3.

Both suffer from timeouts at boottime.

Code:
"Old":

+ada0 at ahcich4 bus 0 scbus4 target 0 lun 0
+ada0: <HGST HUS722T1TALA604 RAGNWA05> ACS-3 ATA SATA 3.x device
+ada0: Serial Number WMC6M0D6D790
+ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
+ada0: Command Queueing enabled
+ada0: 953869MB (1953525168 512 byte sectors)
+ada1 at ahcich5 bus 0 scbus5 target 0 lun 0
+ada1: <HGST HUS722T1TALA604 RAGNWA05> ACS-3 ATA SATA 3.x device
+ada1: Serial Number WMC6M0D5SHDJ
+ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
+ada1: Command Queueing enabled
+ada1: 953869MB (1953525168 512 byte sectors)

New:

ada0 at ahcich4 bus 0 scbus4 target 0 lun 0
ada0: <HGST HUS722T1TALA604 RAGNWA07> ACS-3 ATA SATA 3.x device
ada0: Serial Number WMC6N0K2T9MU
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 953869MB (1953525168 512 byte sectors)
ada1 at ahcich5 bus 0 scbus5 target 0 lun 0
ada1: <HGST HUS722T1TALA604 RAGNWA07> ACS-3 ATA SATA 3.x device
ada1: Serial Number WMC6N0K6HJJ6
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 953869MB (1953525168 512 byte sectors)


The error is the regular

Code:
ahcich5: Timeout on slot 3 port 0
ahcich5: is 00000000 cs 00000000 ss 00000008 rs 00000008 tfd 40 serr 00000000 cmd 0004c317
(ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 b0 0e 46 40 0a 00 00 01 00 00
(ada1:ahcich5:0:0:0): CAM status: Command timeout
(ada1:ahcich5:0:0:0): Retrying command, 3 more tries remain


occuring on both servers and on both enabled ahcichannels.

Smarctl shows no errors on any of the disks involved. I can't imagine this being a hardware failure of any kind. I set both slots on the new server to IDE in the bios to make sure.

If anyone has a clue how to proceed on this issue (other than keeping the IDE setting) I'd be happy to learn.
 

tingo

Daemon

Reaction score: 503
Messages: 2,313

Does your server use sata cables? (instead of a backplane) If so, try to change them. If that doesn't work , try changing drives.
Oh, and be sure that the firmware on your server is up to date (bios / uefi / etc )
 

ralphbsz

Son of Beastie

Reaction score: 1,660
Messages: 2,594

...
Smarctl shows no errors on any of the disks involved. I can't imagine this being a hardware failure of any kind.
Then you need a more fertile imagination!

Smartctl only checks the inside of the disk drive. Fundamentally, the computer uses the existing communication infrastructure to ask the disk drive how it is feeling. The disk drive will respond with information about its internal health, such as spinup/poweron counts, error counts, temperature, and so on. But it does nothing to check the communication infrastructure, it just needs the communication to basically work.

Your problem is not a disk error. Component ahcich is not a disk drive, but part of the communication infrastructure, namely part of the ahci(4) driver (try the command "man ahci" to read about it). If you are getting a timeout on ahci, then either the SATA part of your motherboard, or perhaps your external HBA (if you have one), or the SATA cable, or the SATA port on the disk drive are faulty. Of those, the most likely candidate is the SATA cable, just as tingo said. Because it is the most easily damaged, and because it can be replaced with minor effort and cost. A good starting point would be just to reseat the SATA cable. Also try reseating or changing the power cable to the disk drive, bad grounding or intermittent power can also cause communication problems. If all that fails, then more trial and error will be required, like switching the disks back and forth and see whether the problem follows the disk or the motherboard port.
I set both slots on the new server to IDE in the bios to make sure.
I have no idea what that does.
 

Ome Ko

New Member


Messages: 5

I have the latest BIOS version installed.
If I'm on IDE and I see no errors (which I have not for the past 2 days) how could the cables be harriewarrie?
It could be that these HDD's are not compatible in some odd way, but I don't think they shipped 2 faulty disks 2 years ago and another 2 faulty disks last week.
I took a peep at my rock solid production server that's been running since 2013 on AHCI using other disks, and it has an Intel Cougar Point AHCI SATA controller. New boxes run an Intel Avoton AHCI SATA controller (Avoton is in the kernel AHCI device code).
That may be an issue, or otherwise these new disks may not be compatible.

[update] Just found out there's some interesting data on https://www.supermicro.com/support/resources/OS/MicroBlade_Atom.cfm
Looks like there's been no official FreeBSD support for the Atom chipset since 10.0 but I cannot find any references to AHCI problems and ATOM chips in BSD since.
 
Top