Other Kernel: ahcich: Timeout in slot

In my case, the machine I upgraded the OS on worked rock solid for 4 years. Thus, the upgrade is clearly the cause of the timeouts, as I observe.
 
I had the same issue for a while on FreeBSD 11.2 and 11.3 with Supermicro and DELL servers:

Code:
Oct 25 10:51:04 X2 kernel: ahcich1: Timeout on slot 21 port 0
Oct 25 10:51:04 X2 kernel: ahcich1: is 00000000 cs 00200000 ss 00000000 rs 00200000 tfd d0 serr 00000000 cmd 0004d517
Oct 25 10:51:04 X2 kernel: (aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Oct 25 10:51:04 X2 kernel: (aprobe0:ahcich1:0:0:0): CAM status: Command timeout
Oct 25 10:51:04 X2 kernel: (aprobe0:ahcich1:0:0:0): Retrying command
Oct 25 10:51:34 X2 kernel: ahcich1: Timeout on slot 23 port 0
Oct 25 10:51:34 X2 kernel: ahcich1: is 00000000 cs 00800000 ss 00000000 rs 00800000 tfd d0 serr 00000000 cmd 0004d717
Oct 25 10:51:34 X2 kernel: (aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Oct 25 10:51:34 X2 kernel: (aprobe0:ahcich1:0:0:0): CAM status: Command timeout
Oct 25 10:51:34 X2 kernel: (aprobe0:ahcich1:0:0:0): Retrying command
Oct 25 10:51:52 X2 kernel: icmp redirect from 194.187.108.13: 195.137.202.4 => 194.187.108.11
Oct 25 10:52:04 X2 kernel: ahcich1: Timeout on slot 25 port 0
Oct 25 10:52:04 X2 kernel: ahcich1: is 00000000 cs 02000000 ss 00000000 rs 02000000 tfd d0 serr 00000000 cmd 0004d917
Oct 25 10:52:04 X2 kernel: (aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Oct 25 10:52:04 X2 kernel: (aprobe0:ahcich1:0:0:0): CAM status: Command timeout
Oct 25 10:52:04 X2 kernel: (aprobe0:ahcich1:0:0:0): Retrying command
Oct 25 10:52:35 X2 kernel: ahcich1: Timeout on slot 27 port 0
Oct 25 10:52:35 X2 kernel: ahcich1: is 00000000 cs 08000000 ss 00000000 rs 08000000 tfd d0 serr 00000000 cmd 0004db17
Oct 25 10:52:35 X2 kernel: (aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Oct 25 10:52:35 X2 kernel: (aprobe0:ahcich1:0:0:0): CAM status: Command timeout
Oct 25 10:52:35 X2 kernel: (aprobe0:ahcich1:0:0:0): Retrying command
Oct 25 10:52:42 X2 kernel: sonewconn: pcb 0xfffff8003d2ef3a0: Listen queue overflow: 151 already in queue awaiting acceptance (5 occurrences)
Oct 25 10:53:05 X2 kernel: ahcich1: Timeout on slot 29 port 0
Oct 25 10:53:05 X2 kernel: ahcich1: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd d0 serr 00000000 cmd 0004dd17
Oct 25 10:53:05 X2 kernel: (aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Oct 25 10:53:05 X2 kernel: (aprobe0:ahcich1:0:0:0): CAM status: Command timeout
Oct 25 10:53:05 X2 kernel: (aprobe0:ahcich1:0:0:0): Retrying command
Oct 25 10:53:35 X2 kernel: ahcich1: Timeout on slot 31 port 0

fixed with
Code:
vfs.zfs.cache_flush_disable: 1

and power cycle for at least 3 days.
 
Does anybody have this problem after updating to FreeBSD 12.1?

Using vfs.zfs.cache_flush_disable: 1 solved problem for me but caused some performance degradation.
 
I tried that (with 11.3) in /boot/loader.conf with no effect.
Code:
# pciconf -lvbe ahci0
ahci0@pci0:0:31:2:    class=0x010601 card=0x01211462 chip=0x1c028086 rev=0x05 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '6 Series/C200 Series Chipset Family 6 port Desktop SATA AHCI Controller'
    class      = mass storage
    subclass   = SATA
    bar   [10] = type I/O Port, range 32, base 0xf070, size 8, enabled
    bar   [14] = type I/O Port, range 32, base 0xf060, size 4, enabled
    bar   [18] = type I/O Port, range 32, base 0xf050, size 8, enabled
    bar   [1c] = type I/O Port, range 32, base 0xf040, size 4, enabled
    bar   [20] = type I/O Port, range 32, base 0xf020, size 32, enabled
    bar   [24] = type Memory, range 32, base 0xf7b02000, size 2048, enabled
so 5 AHCI channels.

setting them all (I know, why a shutgun when you've a rifle?) to
Code:
hint.ahcich.0.sata_rev=2
hint.ahcich.1.sata_rev=2
hint.ahcich.3.sata_rev=2
hint.ahcich.4.sata_rev=2
hint.ahcich.5.sata_rev=2
did nothing.

In the following order, I did:

1. Replaced cables
2. Replaced (reslivered zfs mirro).
3. Replaced the entire server (with new cables). So, new disks, cables, harddisks.

Obviously there are some performance limits being hit (too many jails doing too much work?). But it's very odd.

My drive to the RZ is non-trivial. I'm going to rebuild these boxes linux.
 
Adding:
kern.cam.ada.write_cache=0

doing an ada1 detach from the zpool
rebooting
doing an attach to the zpool,

seems to limit how often the error occurs.
Since the error inevitably returns under load I'm moving services off the machine (as I indicated to another OS). It's a pity that this kind of error (even if it is load) can freeze a server.

I mean, I have serial consoles (no ipmi crap) and I can't get into these machines once the errors pile up.
 
I have this problem when moving a big file from an nvme to a sata SSD. Using dd without the "bs" parameter instead of mv made it slow enough.
vfs.zfs.cache_flush_disable=1 did not help

Two SSDs on one controller might be too much:
Code:
ahci0@pci0:0:23:0:    class=0x010601 card=0x224517aa chip=0x9d038086 rev=0x21 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Sunrise Point-LP SATA Controller [AHCI mode]
    class      = mass storage
    subclass   = SATA
 
Back
Top