Other Kernel: ahcich: Timeout in slot

If I'm on IDE and I see no errors (which I have not for the past 2 days) how could the cables be harriewarrie?
Well, for one thing, AHCI and IDE operates at different speeds. So if you cables are slightly out of spec, they might work in IDE mode, but fail in AHCI mode.
But the second point is; it takes you a short time to try other cables (assuming you have them, and the machine is within reach) and even if it isn't the cables then you have checked out one point of possible things from the list.
 
One more datapoint for the ahci timeouts.

I recently upgraded from 10.3 to 11.2, and this problem did not start happening until I upgraded. If I hard rebooted (power cycle), the problem will go away for a random number of days. A soft reboot does not fix the problem.

Now I also have smartctl running out of icinga2 regularly (and want to continue because I like advance warning of disk failure). The interesting datapoint here is that smartctl was built for 10.3; I didn't upgrade my packages on the OS upgrade (for irrelevant reasons).

Quite by accident, I decided to upgrade my packages just last night. Since the machine has since survived a large set of writes without the controller timeouts, I did weakly conclude that it might be a mismatch between the 10.3 smartctl and 11.2. However, since then, I've experienced the timeouts again so ... back to the drawing board eh?

It would seem to me that behavior like this:

Code:
# grep Timeout /var/log/messages
...
Feb 26 17:30:23 zfsraid kernel: ahcich0: Timeout on slot 0 port 0
Feb 26 17:30:23 zfsraid kernel: ahcich0: Timeout on slot 24 port 0
Feb 26 17:30:23 zfsraid kernel: ahcich0: Timeout on slot 4 port 0
Feb 26 17:30:23 zfsraid kernel: ahcich0: Timeout on slot 12 port 0
Feb 26 17:30:23 zfsraid kernel: ahcich0: Timeout on slot 21 port 0
Feb 26 17:32:32 zfsraid kernel: ahcich0: Timeout on slot 0 port 0
Feb 26 17:33:44 zfsraid kernel: ahcich0: Timeout on slot 11 port 0
Feb 26 17:36:23 zfsraid kernel: ahcich0: Timeout on slot 1 port 0
Feb 26 17:44:09 zfsraid kernel: ahcich0: Timeout on slot 20 port 0
Feb 26 17:44:09 zfsraid kernel: ahcich0: Timeout on slot 21 port 0
Feb 26 17:44:09 zfsraid kernel: ahcich0: Timeout on slot 18 port 0
Feb 26 17:44:09 zfsraid kernel: ahcich0: Timeout on slot 20 port 0
Feb 26 17:44:53 zfsraid kernel: ahcich0: Timeout on slot 15 port 0
Feb 26 17:46:37 zfsraid kernel: ahcich0: Timeout on slot 22 port 0
Feb 26 17:46:37 zfsraid kernel: ahcich0: Timeout on slot 7 port 0
Feb 26 17:47:22 zfsraid kernel: ahcich0: Timeout on slot 0 port 0
Feb 26 17:52:14 zfsraid kernel: ahcich0: Timeout on slot 0 port 0
Feb 26 17:54:04 zfsraid kernel: ahcich0: Timeout on slot 20 port 0
Feb 26 17:54:04 zfsraid kernel: ahcich0: Timeout on slot 0 port 0
Feb 26 17:54:48 zfsraid kernel: ahcich0: Timeout on slot 15 port 0
Feb 26 18:02:40 zfsraid kernel: ahcich0: Timeout on slot 29 port 0
...

is indicative of something repeatedly connecting and disconnecting. I'm running 11.2-STABLE r342400. I hope this data can help someone squash this bug once and for all.
 
Aga... (Yes).
I have near tomorrow same problem toooooooooooooooooooo!
Now 11.2... but HDD will be near deadh... replaced on new HDD.
BUT! This problem not need normal work from ahcich0->27+? :(
 
I started experiencing the same problem with a QNAP TS-453B mini (Intel CPU) and 4x 4TB WD Red WD40EFRX-68N32N0.
The device worked perfectly (or at least I was not aware of any such issues) for a year on FreeBSD 11.1 with monthly scrubs.
Then I upgraded to 11.2-RELEASE and the timeouts started appearing.
After upgrade to 12.0-RELEASE and patching up to patch 3 the problem is still here.
HDDs seem just fine. A smartctl call shows no apparent errors. Zpool is also fine.

Symptom: When I do zpool scrub after a while (5%-15%) those timeouts occur and one drive gets detached, GELI device is destroyed, and I have to attach it again manually.
I can see in the log:
Code:
11:39:53  kernel: ahcich3: Timeout on slot 9 port 0
11:39:53  kernel: ahcich3: is 00000000 cs 0001f800 ss 0001ff00 rs 0001ff00 tfd 40 serr 00000000 cmd 0004cb17
11:39:53  kernel: (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 d0 9a 9e 40 20 00 00 01 00 00
11:39:53  kernel: (ada3:ahcich3:0:0:0): CAM status: Command timeout
11:39:53  kernel: (ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
11:41:32  kernel: ahcich3: Timeout on slot 8 port 0
11:41:32  kernel: ahcich3: is 00000000 cs 0000c000 ss 0000fff0 rs 0000fff0 tfd 40 serr 00000000 cmd 0004ce17
11:41:32  kernel: (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 80 e6 f4 40 21 00 00 01 00 00
11:41:32  kernel: (ada3:ahcich3:0:0:0): CAM status: Command timeout
11:41:32  kernel: (ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
11:46:53  kernel: ahcich2: Timeout on slot 5 port 0
11:46:53  kernel: ahcich2: is 00000000 cs 00070000 ss 0007ffe0 rs 0007ffe0 tfd 40 serr 00000000 cmd 0004d017
11:46:53  kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 a8 bd 0f 40 34 00 00 01 00 00
11:46:53  kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
11:46:53  kernel: (ada2:ahcich2:0:0:0): Retrying command, 3 more tries remain
11:49:21  kernel: ahcich2: Timeout on slot 31 port 0
11:49:21  kernel: ahcich2: is 00000000 cs 00020000 ss 8003ffff rs 8003ffff tfd 40 serr 00000000 cmd 0004d017
11:49:21  kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 28 7a 4a 40 35 00 00 01 00 00
11:49:21  kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
11:49:21  kernel: (ada2:ahcich2:0:0:0): Retrying command, 3 more tries remain
11:49:21  kernel: ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
11:49:21  kernel: ada2: < > detached

Interestingly, I seem to have an ASMedia ASM106x SATA controller for hdd2+3 along with another SATA controller for hdd0+1.
The problem happens only with hdd2 or hdd3, but never with hdd 0 or 1. So it has something to do with the ASMedia controller.
I can see in the dmesg log:
Code:
# dmesg | grep ahci
ahci0: <AHCI SATA controller> port 0xf090-0xf097,0xf080-0xf083,0xf060-0xf07f mem 0x91414000-0x91415fff,0x9141c000-0x9141c0ff,0x9141b000-0x9141b7ff irq 19 at device 18.0 on pci0
ahci0: AHCI v1.31 with 2 6Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahci1: <ASMedia ASM106x AHCI SATA controller> mem 0x91380000-0x91381fff irq 22 at device 0.0 on pci1
ahci1: AHCI v1.31 with 12 6Gbps ports, Port Multiplier not supported
ahci1: quirks=0xc00000<NOCCS,NOAUX>
ahcich2: <AHCI channel> at channel 0 on ahci1
ahcich3: <AHCI channel> at channel 1 on ahci1
It is worth to point out that the ahci0 controller has Port Multiplier supported but ahci1 has Port Multiplier not supported. I have no idea if this is relevant to the problem but it is one difference I can see.

I tried the workaround proposed above in this thread:
Code:
nano /boot/device.hints
# ------------------8<-------------------------
# Workaround for AHCI CAM timeouts, source: https://forums.freebsd.org/threads/kernel-ahcich-timeout-in-slot.51868/
# I believe I need only the second line but doing both anyway.
hint.ahci.0.msi=0
hint.ahci.1.msi=0
# ------------------8<-------------------------
And this seems to improve the situation (somewhat). Some timeout errors still do occur, however the drives do not get detached and the scrub continues so far (at 30% right now). I'll post an update when it's finished. UPDATE: The error occurred again at ~40% done.

There is a second workaround, which I have not yet tested, but people report it helped them. Forcing the drives to SATA II, which should not be a problem because 300 MB/s should be more than what my current drives can handle.
UPDATE: This workaround worked for me. I was able to do a complete scrub and no timeouts occurred. I did not combine it with the first one (otherwise the whole machine freezed and/or rebooted).
Code:
# force SATA2 on all drives
# Again, I think I need to set this only for the second SATA controller, i.e. only for channels 2, 3, ...
hint.ahcich.0.sata_rev=2
hint.ahcich.1.sata_rev=2
hint.ahcich.2.sata_rev=2
hint.ahcich.3.sata_rev=2
# ...

To summarize my experience with the problem:
  • It seems to have appeared when I upgraded from 11.1 to 11.2. Upgrading to 12.0-RELEASE-p3 did not help.
  • It seems to have something to do with the SATA/AHCI controller driver and how IRQ's are handled.
  • It happens only with my second AHCI controller but not the first one.
  • It happens when I do scrub, otherwise I can use the system for a long time without issues.
  • I have quite a lot of snapshots (~8000 and growing) so initially I thought it was probably a RAM/swap/ZFS ARC issue but the RAM is 16GB and even with swap the swap never gets used, so I think this could be ruled out.
  • Applying workaround #2 (forcing the problematic SATA controller's channels to SATA II) seems to help in my case and I was able to complete the scrub.
Here is a collection of related links I found during my research. Most of them seem to describe the same or at least a similar problem. Some are for even older versions, like 10.2.
 
I was able to work around the problem and I'd like to post a short summary in a separate post.
Forcing the problematic SATA controller and its disks to SATA II mode helped me and I completed a scrub without any timeouts.

I put the following in my /boot/device.hints:
Code:
# force SATA2 on all drives on second SATA controller
hint.ahcich.2.sata_rev=2
hint.ahcich.3.sata_rev=2
hint.ahcich.4.sata_rev=2
hint.ahcich.5.sata_rev=2
hint.ahcich.6.sata_rev=2
hint.ahcich.7.sata_rev=2
hint.ahcich.8.sata_rev=2
hint.ahcich.9.sata_rev=2
hint.ahcich.10.sata_rev=2
hint.ahcich.11.sata_rev=2
hint.ahcich.12.sata_rev=2
hint.ahcich.13.sata_rev=2
Please note that I have the problem only with one of my two SATA controllers (two out of four HDDs are hit by the error):
Code:
ahci0: <AHCI SATA controller> port 0xf090-0xf097,0xf080-0xf083,0xf060-0xf07f mem 0x91414000-0x91415fff,0x9141c000-0x9141c0ff,0x9141b000-0x9141b7ff irq 19 at device 18.0 on pci0
ahci0: AHCI v1.31 with 2 6Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahci1: <ASMedia ASM106x AHCI SATA controller> mem 0x91380000-0x91381fff irq 22 at device 0.0 on pci1
ahci1: AHCI v1.31 with 12 6Gbps ports, Port Multiplier not supported
ahci1: quirks=0xc00000<NOCCS,NOAUX>
ahcich2: <AHCI channel> at channel 0 on ahci1
ahcich3: <AHCI channel> at channel 1 on ahci1
ahcich6: <AHCI channel> at channel 4 on ahci1
ahcich7: <AHCI channel> at channel 5 on ahci1
ahcich8: <AHCI channel> at channel 6 on ahci1
ahcich9: <AHCI channel> at channel 7 on ahci1
ahcich10: <AHCI channel> at channel 8 on ahci1
ahcich11: <AHCI channel> at channel 9 on ahci1
ahcich12: <AHCI channel> at channel 10 on ahci1
ahcich13: <AHCI channel> at channel 11 on ahci1

Therefore I left ahci0 operate in SATA III mode and forced only ahci1 to operate as SATA II.
I did this for all the 12 ports.

I tried a couple of other combinations, for example combining hint.ahci.1.msi=0 with hint.ahcich.2.sata_rev=2 but the result was bad. The whole machine froze and/or restarted automatically.
Only when I applied just the SATA II hack it worked properly.

This is not a clean solution. The ahci code in the kernel needs to be fixed. The bug has been reported, see the links in my last post.
 
I noticed the two controllers are on different PCI busses. Have you swapped things around to see if you can isolate this to a bus.
ASMedia isn't exactly a controller I would use. Does this card happen to use a PCIe 1x connector?
 
I noticed the two controllers are on different PCI busses. Have you swapped things around to see if you can isolate this to a bus.
It is a prepackaged NAS box. It should work out of the box actually, I installed only the HDDs and two RAM modules. The rest is as designed by QNAP and I have not unscrewed the device so far.

ASMedia isn't exactly a controller I would use. Does this card happen to use a PCIe 1x connector?
I am not aware of this. I tried finding more detailed documentation on what's inside but QNAP does not publish so much technical detail about their design, or at least I was not able to find it.
What makes the ASMedia controller undesirable in your opinion?
And how could I check if the used slot is PCIe 1x? QNAP's engineers should have thought about this. It would be quite odd if something like that would slip through.
 
I was able to work around the problem and I'd like to post a short summary in a separate post.
Forcing the problematic SATA controller and its disks to SATA II mode helped me and I completed a scrub without any timeouts.

I put the following in my /boot/device.hints:
Code:
# force SATA2 on all drives on second SATA controller
hint.ahcich.2.sata_rev=2
hint.ahcich.3.sata_rev=2
...

Confirming that this worked for me. There are a couple of other workarounds if you scan this bug report:

 
What became of this? Is this believed to be a 'user hardware issue' or a FreeBSD kernel bug issue?

EDIT 2019-10-06:
I ask because my 1-year old brand new machine was rock solid. Then I noticed it crashed early this summer and I thought wait and see. Last night I tried to do a bunch of TARs which was always a non-issue event. Last night I could not get a single one to complete and I tired quite a few times; system freeze and I found these messages:

Code:
Oct  5 17:01:32 freedom-s001 kernel: ahcich1: Timeout on slot 6 port 0
Oct  5 17:01:32 freedom-s001 kernel: ahcich1: is 00000000 cs ffffffbf ss ffffffff rs ffffffff tfd 40 serr 00000000 cmd 0000c617
Oct  5 17:01:32 freedom-s001 kernel: (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 40 e4 31 40 cb 00 00 00 00 00
Oct  5 17:01:32 freedom-s001 kernel: (ada1:ahcich1:0:0:0): CAM status: Command timeout
Oct  5 17:01:32 freedom-s001 kernel: (ada1:ahcich1:0:0:0): Retrying command

So I did this:
Code:
# force SATA2 on all drives on second SATA controller
hint.ahcich.1.sata_rev=2

And so far so good, all TARs done without issue.

I realize I could have a hardware issue, but a little while later I went to 11.2-RELEASE. Based on all the other disucssion I'm inclined to think this is a FreeBSD OS bug, and not a hardware issue. However if I am wrong and this issue has been proven to not be a OS issue I want to know so I can go troubleshooting further.

Note I did reseat cables last night which made no difference but I am willing to go buy and replace these 1-year old cables with something higher quality if need be. Hard drives are all brand new too and are WDs. OS drive is black, and 'nas' drives are red.
 
I realize I could have a hardware issue, but a little while later I went to 11.2-RELEASE. Based on all the other disucssion I'm inclined to think this is a FreeBSD OS bug, and not a hardware issue. However if I am wrong and this issue has been proven to not be a OS issue I want to know so I can go troubleshooting further.
Does this always happen on the same device (slot 6 port 0)? Are the drives directly cabled for both power and data, or is there some sort of backplane involved? If you reference the drives by GUID or label instead of by actual device name, you could swap the drive on port 6 with one from a different port and see if the problem stays or moves.
 
Sorry for the delayed responce.
What makes the ASMedia controller undesirable in your opinion?
I frequent the FreeNAS and pfSense forums and I follow recommendations found there.
Not a very brand'ish guy but Disk Controllers and Network Adapters are two areas where I do buy favored brands.
I realize with this appliance you might not have a choice.

And how could I check if the used slot is PCIe 1x?
How about looking at the pciconf output.
pciconf -lvcbe ahci1
 
How about looking at the pciconf output.
pciconf -lvcbe ahci1
Code:
ahci1@pci0:1:0:0:       class=0x010601 card=0x10601b21 chip=0x06251b21 rev=0x01 hdr=0x00
    vendor     = 'ASMedia Technology Inc.'
    class      = mass storage
    subclass   = SATA
    bar   [24] = type Memory, range 32, base 0x91380000, size 8192, enabled
    cap 05[50] = MSI supports 1 message enabled with 1 message
    cap 01[78] = powerspec 3  supports D0 D3  current D0
    cap 10[80] = PCI-Express 2 legacy endpoint max data 256(512) RO NS
                 link x2(x2) speed 5.0(5.0) ASPM disabled(L0s/L1)
    ecap 0002[100] = VC 1 max VC0
    ecap 0001[800] = AER 1 0 fatal 0 non-fatal 0 corrected
  PCI-e errors = Correctable Error Detected
Keep in mind that I forced the controller in SATA II mode as described above.
 
Does this always happen on the same device (slot 6 port 0)? Are the drives directly cabled for both power and data, or is there some sort of backplane involved? If you reference the drives by GUID or label instead of by actual device name, you could swap the drive on port 6 with one from a different port and see if the problem stays or moves.

Yes seems to be the same device, but, its the only one that takes a pounding when doing the TARs. I suppose I could try doing some TARing on another disk. The SATA cables are one per drive, and each cable goes to a dedicated port on the motherboard. The power cables I do believe are two drives per cable, which if very common in power supply designs for home retail grade machines. My machine is more or less a gaming grade machine, made for home use; nas, tinkering around, etc. CPUs sit idle most of the time. Only periodic(8), and TARing seem to hit the IO hard.

My machine in case you are interested:
Ryzen 5 1600 machine
 
Now I remember why I dislike ASMedia.
ahci1: AHCI v1.31 with 12 6Gbps ports, Port Multiplier not supported
So you have 12 ports on the controller.
cap 10[80] = PCI-Express 2 legacy endpoint max data 256(512) RO NS link x2(x2) speed 5.0(5.0) ASPM disabled(L0s/L1)
With only PCIe-2 bus and it is running at x2.
So looking here you will notice that PCIe 2@x2 only provides 1000Megabytes/sec
12 drives at 500Megabytes/sec would need 6000Megabytes/sec aggregated.
Now your 4 spinning rust drives probably don't max out the bus, but if you had SSD's then 2 drives would max out that bus.

Who does that? Thinking back Asmedia also sells a 4 port SATA3 card for PCIe x1 slot.
Utterly ridiculous. 4 drives need 2000Megabytes/sec. and x1 slot only provides 500Megabytes/sec.
Really underprovisioned in my opinion.

Like I say I don't think this is related but it is not what I want on my NAS.

Edit: I see on the Qnap page this blurb that makes me laugh.
and SATA 6Gb/s, the TS-453Bmini drives up to 225MB/s read speed
This line with its blatent typo is scary.
Expand the total raw storage capacity up to 12 drivers with the economical UX-800P QNAP expansion enclosure
 
Yes, but it may make the system more sensitive to bad cables.

Example: In a large SAS installation, we had some external SAS cables (3m long) that worked fine, until we started turning on T10DIF (i.e., checksums on the disks). Then they failed. Software changes can make hardware behave differently.
 
So is the consensus that this is a hardware issue on individual machines, and is not an OS bug? I find it hard to believe that a 1 year old machine could have a cable go bad already, but then mark my own words, I did say I bought consumer grade stuff.

I'll try some cable and port swapping over the next couple weeks when I have time to see if the trouble tracks. My sport fishing is wrapping up, but my wife is now hinting for home projects, but I should have more time. :)
 
Now I remember why I dislike ASMedia.

So you have 12 ports on the controller.

With only PCIe-2 bus and it is running at x2.
So looking here you will notice that PCIe 2@x2 only provides 1000Megabytes/sec
12 drives at 500Megabytes/sec would need 6000Megabytes/sec aggregated.
Now your 4 spinning rust drives probably don't max out the bus, but if you had SSD's then 2 drives would max out that bus.

Who does that? Thinking back Asmedia also sells a 4 port SATA3 card for PCIe x1 slot.
Utterly ridiculous. 4 drives need 2000Megabytes/sec. and x1 slot only provides 500Megabytes/sec.
Really underprovisioned in my opinion.

Like I say I don't think this is related but it is not what I want on my NAS.

Edit: I see on the Qnap page this blurb that makes me laugh.

This line with its blatent typo is scary.
Good points! I'll keep it in mind. As you wrote though, I don't think my 2 HDDs are going to max out the bus. The other 2 HDDs hang on a different controller.
A I reported above, the NAS scrubs fine, when I force the ASMedia attached drives to SATA II mode. I don't see any performance hit that matters.
In future - I'll watch out for such issues when shopping around.
 
So is the consensus that this is a hardware issue on individual machines, and is not an OS bug? I find it hard to believe that a 1 year old machine could have a cable go bad already, but then mark my own words, I did say I bought consumer grade stuff.

I'll try some cable and port swapping over the next couple weeks when I have time to see if the trouble tracks. My sport fishing is wrapping up, but my wife is now hinting for home projects, but I should have more time. :)
No, it is not the consensus. As I reported above - a software change worked around the bug for me. I forced the drives to talk SATA II and the issue is gone. The drive speed stayed about the same.
It certainly looks like a driver issue but I don't know which driver and how to debug it.
 
Stuff can go bad after a year, there might be strain on the cables which causes bad connections etc. I'd try updating bios (if it's a sata controller on the motherboard), hba firmware and possibly replace the cable(s) initially.
 
Code:
Oct  5 17:01:32 freedom-s001 kernel: ahcich1: Timeout on slot 6 port 0
Oct  5 17:01:32 freedom-s001 kernel: ahcich1: is 00000000 cs ffffffbf ss ffffffff rs ffffffff tfd 40 serr 00000000 cmd 0000c617
Oct  5 17:01:32 freedom-s001 kernel: (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 40 e4 31 40 cb 00 00 00 00 00
Oct  5 17:01:32 freedom-s001 kernel: (ada1:ahcich1:0:0:0): CAM status: Command timeout
Oct  5 17:01:32 freedom-s001 kernel: (ada1:ahcich1:0:0:0): Retrying command

So I did this:
Code:
# force SATA2 on all drives on second SATA controller
hint.ahcich.1.sata_rev=2

Hmmmp, got another freeze on heavy disk usage today. So looks like the code tweak above made no difference. Cable swapping coming soon; got a bunch of stuff to do first.

Actually if I was to order some high performance heavy duty rock solid made of gold cables, could a couple of you folks recommend a couple brands? I would order from Amazon.ca or Newegg.ca likely.
 
Back
Top