ZFS MARVELL 88SE9230 broken?

gustopn · Oct 18, 2019

I have had on my MARVELL 88SE9230 card the following:

Code:

Oct 18 05:14:54 constance kernel: (ada5:ahcich7:0:0:0): Error 5, Periph was invalidated
Oct 18 05:14:54 constance kernel: (ada5:ahcich7:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 c8 3e ae 40 07 01 00 00 00 00
Oct 18 05:14:54 constance kernel: (ada5:ahcich7:0:0:0): CAM status: Command timeout
Oct 18 05:14:54 constance kernel: (ada5:ahcich7:0:0:0): Error 5, Periph was invalidated
Oct 18 05:15:41 constance kernel: ahcich7: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Oct 18 05:15:41 constance kernel: ahcich7: Poll timeout on slot 7 port 0
Oct 18 05:15:41 constance kernel: ahcich7: is 00000000 cs f00000ff ss 00000040 rs 00000080 tfd 80 serr 00000000 cmd 10001b17
Oct 18 05:15:41 constance kernel: (aprobe0:ahcich7:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Oct 18 05:15:41 constance kernel: (aprobe0:ahcich7:0:0:0): CAM status: Command timeout
Oct 18 05:15:41 constance kernel: (aprobe0:ahcich7:0:0:0): Error 5, Retries exhausted
Oct 18 05:16:11 constance kernel: ahcich7: Timeout on slot 8 port 0
Oct 18 05:16:11 constance kernel: ahcich7: is 00000000 cs f00003ff ss 00000340 rs 00000300 tfd 80 serr 00000000 cmd 10001b17
Oct 18 05:16:11 constance kernel: (ada5:ahcich7:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 70 6c b3 40 07 01 00 00 00 00
Oct 18 05:16:11 constance kernel: (ada5:ahcich7:0:0:0): CAM status: Command timeout
Oct 18 05:16:11 constance kernel: (ada5:ahcich7:0:0:0): Error 5, Periph was invalidated
Oct 18 05:16:11 constance kernel: (ada5:ahcich7:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 d8 d4 de 40 0b 01 00 00 00 00
Oct 18 05:16:11 constance kernel: (ada5:ahcich7:0:0:0): CAM status: Unconditionally Re-queue Request
Oct 18 05:16:11 constance kernel: (ada5:ahcich7:0:0:0): Error 5, Periph was invalidated
Oct 18 05:16:11 constance kernel: (ada5:ahcich7:0:0:0): Periph destroyed
Oct 18 05:16:57 constance kernel: ahcich7: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Oct 18 05:16:57 constance kernel: ahcich7: Poll timeout on slot 10 port 0
Oct 18 05:16:57 constance kernel: ahcich7: is 00000000 cs f00007ff ss 00000340 rs 00000400 tfd 80 serr 00000000 cmd 10001b17
Oct 18 05:16:57 constance kernel: (aprobe0:ahcich7:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Oct 18 05:16:57 constance kernel: (aprobe0:ahcich7:0:0:0): CAM status: Command timeout
Oct 18 05:16:57 constance kernel: (aprobe0:ahcich7:0:0:0): Error 5, Retries exhausted

Of course that caused my ZFS pool to go into degraded state. Once I only noticed one disk, so I was thinking the HDD may be faulty,
but then I noticed that it kicked out ALL of the 4 disks which were connected to a specific PCIe card.
So I replaced that card and I hope it will not get back, while I am waiting for a new one (I have found some spare old ASmedia I am using now)
First I was suspecting the Marvell driver, but I have 2 cards of marvell, and the other one is

Code:

ahci0: <Marvell 88SE9215 AHCI SATA controller> port 0xe050-0xe057,0xe040-0xe043,0xe030-0xe037,0xe020-0xe023,0xe000-0xe01f mem 0x91510000-0x915107ff irq 16 at device 0.0 on pci1
ahci0: AHCI v1.00 with 4 6Gbps ports, Port Multiplier supported with FBS

The faulty suspect I now replaced was

Code:

/var/log/messages.4.bz2:Mar 27 00:14:01 constance kernel: ahci1: <Marvell 88SE9230 AHCI SATA controller> port 0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem 0x91310000-0x913107ff irq 17 at device 0.0 on pci2
/var/log/messages.4.bz2:Mar 27 00:14:01 constance kernel: ahci1: AHCI v1.20 with 8 6Gbps ports, Port Multiplier not supported
/var/log/messages.4.bz2:Mar 27 00:14:01 constance kernel: ahci1: quirks=0x900<NOBSYRES,ALTSIG>

now I have there

Code:

ahci1: <ASMedia ASM1062 AHCI SATA controller> port 0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem 0x91410000-0x914101ff irq 17 at device 0.0 on pci2
ahci1: AHCI v1.20 with 2 6Gbps ports, Port Multiplier supported
ahci1: quirks=0xc00000<NOCCS,NOAUX>

and I see no errors yet.
My first theory was a bug in FreeBSD 12.0/12.1 driver. I can not rule it out yet, but look at this:

This looks strange and I have a feeling that we have a torn chip here. Has ANYONE seen ANYTHING remotely like this? I have not.
Look at the back:

I can remember that there was NOTHING on the chip, no glue, no nothing. And all of sudden there is SOMETHING on it.
Does a hardware failure represent itself like this?
Thanks!

SirDice · Oct 18, 2019

Does the chip itself get hot? I mean too hot to keep your finger on it? Be careful not to burn your finger though, I did this once and had a DIL-16 imprinted on my finger for a couple of months. If it's super hot then it's likely broken.

gustopn · Oct 18, 2019

I could not try that, because I have there 3 PCIe cards above each other, so no way to put a finger on it while running, one would need to put it into another computer as single card and then measure the temperature.
Those chips should have a sensor in them ;-) that would be great!

SirDice · Oct 18, 2019

Some time ago I had a cheap Promise ATA-300 card with 4 SATA ports. It had been working fine for some years and suddenly started producing similar weird bus errors and completely dropping the ZFS pool that was attached to all four ports. The controller chip was extremely hot (too hot to be considered 'normal'). It would work fine with only one disk but as soon as I added a second it would start acting up again. Trashed the card, and bought a proper LSI (Avago/Broadcom nowadays) HBA card. Should have bought that sooner.

gustopn · Oct 18, 2019

Yes, I can remember Promise controllers, but that was not a time ago, that was 20 years ago!!!! LOL!
Yes, mine shows some weird kind of green(ish) fluid comming out of top and bottom of the controller chip.
And yes, it also works quite normal, when you booted up, but after load (zfs scrub) it dropped after a while, as you can see 5 AM, that's when I was not up for sure, so it must have been the scrub job.

ralphbsz · Oct 19, 2019

The chip very likely does have a sensor in it, but it probably requires specialized software to read.

Your two pictures in the original post are not visible. But the first text output already tells me that something very wrong is going on. I agree with the idea of replacing the card, based on the suspicion of broken card. You could also try putting a cooling fan near it to massively increase the airflow and see whether it helps. There are some PCIe cards that need increased airflow.

Which brings me to the anecdote: In a job a while ago, we were using several very high-powered cards (SAS and Infiniband) in a server, and due to a variety of firmware incompatibilities and bugs, the fans were turned to low speed. One SAS card failed, and we noticed a bad smell in the server room. When reviewing the logs, we found that the card had reported a chip temperature of 106 degrees C, and then was never heard from again. The PC board around the chip had turned brown.

toorski · Oct 19, 2019

gustopn said:
Yes, mine shows some weird kind of green(ish) fluid comming out of top and bottom of the controller chip.

ralphbsz said:
One SAS card failed, and we noticed a bad smell in the server room. When reviewing the logs, we found that the card had reported a chip temperature of 106 degrees C, and then was never heard from again. The PC board around the chip had turned brown.

LMFAO

gustopn · Oct 19, 2019

ralphbsz said:
The chip very likely does have a sensor in it, but it probably requires specialized software to read.

Your two pictures in the original post are not visible. But the first text output already tells me that something very wrong is going on. I agree with the idea of replacing the card, based on the suspicion of broken card. You could also try putting a cooling fan near it to massively increase the airflow and see whether it helps. There are some PCIe cards that need increased airflow.

Which brings me to the anecdote: In a job a while ago, we were using several very high-powered cards (SAS and Infiniband) in a server, and due to a variety of firmware incompatibilities and bugs, the fans were turned to low speed. One SAS card failed, and we noticed a bad smell in the server room. When reviewing the logs, we found that the card had reported a chip temperature of 106 degrees C, and then was never heard from again. The PC board around the chip had turned brown.

Now they should be visible.
Here my theory graphically again:

One can see in "all_ahcich.txt" attachment that problems started suddenly on Sep 10 03:23:34.

gustopn · Oct 19, 2019

I just came across this one

D-FENS · Oct 19, 2019

Did you check out this forum thread: https://forums.freebsd.org/threads/kernel-ahcich-timeout-in-slot.51868/page-3#post-423317
I had similar error messages with my SATA drives and I found a workaround.

gustopn · Oct 19, 2019

roccobaroccoSC said:
Did you check out this forum thread: https://forums.freebsd.org/threads/kernel-ahcich-timeout-in-slot.51868/page-3#post-423317
I had similar error messages with my SATA drives and I found a workaround.

Yes, I looked at it, but now I have ASmedia in and I also did start scrubbing on all ports (all drives) and there are no timeouts any more.
Of course the workaround of setting the speed to slower SATA2 and disabling MSI might have helped on my broken Marvell card too,
but that is no solution. I am now running with ASmedia and waiting for a Marvell replacements w/o RAID I ordered from Amazon (the card was too old for RMA).

ZFS MARVELL 88SE9230 broken?

gustopn

SirDice

Administrator

gustopn

SirDice

Administrator

gustopn

ralphbsz

toorski

gustopn

Attachments

gustopn

D-FENS

gustopn