Solved SATA controller going bad?

pva · Dec 9, 2014

I'm running FreeBSD 10.1-RELEASE on an HP MicroServer N36L.
uname -a

Code:

FreeBSD microserver 10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 21:02:49 UTC 2014     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64

The machine has a total of 12 hard disks, 4 of which are connected to the AMD SATA controller on the motherboard, and the other 8 to two external eSATA enclosures (4 disks in each) using a Silicon Image SATA controller.
dmesg

Code:

ahci0: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0
x7000-0x700f mem 0xfe5ffc00-0xfe5fffff irq 19 at device 17.0 on pci0
ahci0: AHCI v1.20 with 4 3Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
[...]
siis0: <SiI3132 SATA controller> port 0xd800-0xd87f mem 0xfe8ffc00-0xfe8ffc7f,0xfe8f8000-0xfe8fbfff irq 18 at device 0.0 on pci2
siisch0: <SIIS channel> at channel 0 on siis0
siisch1: <SIIS channel> at channel 1 on siis0

camcontrol devlist

Code:

<SAMSUNG HD203WI 1AN10003>         at scbus0 target 0 lun 0 (pass0,ada0)
<SAMSUNG HD203WI 1AN10003>         at scbus0 target 1 lun 0 (pass1,ada1)
<SAMSUNG HD203WI 1AN10003>         at scbus0 target 2 lun 0 (pass2,ada2)
<Hitachi HDS5C3020BLE630 MZ4OAAB0>  at scbus0 target 4 lun 0 (pass3,ada3)
<Port Multiplier 37261095 1706>    at scbus0 target 15 lun 0 (pass4,pmp0)
<SAMSUNG HD203WI 1AN10003>         at scbus1 target 0 lun 0 (pass5,ada4)
<SAMSUNG HD204UI 1AQ10001>         at scbus1 target 1 lun 0 (pass6,ada5)
<SAMSUNG HD203WI 1AN10003>         at scbus1 target 2 lun 0 (pass7,ada6)
<Hitachi HDS5C3020BLE630 MZ4OAAB0>  at scbus1 target 3 lun 0 (pass8,ada7)
<Port Multiplier 37261095 1706>    at scbus1 target 15 lun 0 (pass9,pmp1)
<SAMSUNG HD204UI 1AQ10001>         at scbus2 target 0 lun 0 (pass10,ada8)
<SAMSUNG HD204UI 1AQ10001>         at scbus3 target 0 lun 0 (pass11,ada9)
<SAMSUNG HD204UI 1AQ10001>         at scbus4 target 0 lun 0 (pass12,ada10)
<SAMSUNG HD204UI 1AQ10001>         at scbus5 target 0 lun 0 (pass13,ada11)
<OCZ RALLY2 8.07>                  at scbus8 target 0 lun 0 (pass14,da0)

The disks are set up in a ZFS pool consisting of 3 RAID-Z1 vdevs of 4 disks each.

While performing a periodic scrub of the pool a few days ago, I started seeing multiple timeouts on ahcich1, ahcich2 and ahcich3.

dmesg

Code:

ahcich3: Timeout on slot 0 port 0
ahcich3: is 00000002 cs 00000000 ss 00000000 rs 00000001 tfd 50 serr 00000000 cmd 00006017
(aprobe2:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe2:ahcich3:0:0:0): CAM status: Command timeout
(aprobe2:ahcich3:0:0:0): Retrying command
ahcich1: Timeout on slot 8 port 0
ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00000100 tfd 50 serr 00000000 cmd 00006817
(aprobe1:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe1:ahcich1:0:0:0): CAM status: Command timeout
(aprobe1:ahcich1:0:0:0): Retrying command
ahcich2: Timeout on slot 28 port 0
ahcich2: is 00000002 cs 00000000 ss 00000000 rs 10000000 tfd 50 serr 00000000 cmd 00007c17
(aprobe0:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich2:0:0:0): CAM status: Command timeout
(aprobe0:ahcich2:0:0:0): Retrying command

Eventually, after enough timeouts the disks were dropped.

camcontrol devlist

Code:

<SAMSUNG HD203WI 1AN10003>         at scbus0 target 0 lun 0 (pass0,ada0)
<SAMSUNG HD203WI 1AN10003>         at scbus0 target 1 lun 0 (pass1,ada1)
<SAMSUNG HD203WI 1AN10003>         at scbus0 target 2 lun 0 (pass2,ada2)
<Hitachi HDS5C3020BLE630 MZ4OAAB0>  at scbus0 target 4 lun 0 (pass3,ada3)
<Port Multiplier 37261095 1706>    at scbus0 target 15 lun 0 (pass4,pmp0)
<SAMSUNG HD203WI 1AN10003>         at scbus1 target 0 lun 0 (pass5,ada4)
<SAMSUNG HD204UI 1AQ10001>         at scbus1 target 1 lun 0 (pass6,ada5)
<SAMSUNG HD203WI 1AN10003>         at scbus1 target 2 lun 0 (pass7,ada6)
<Hitachi HDS5C3020BLE630 MZ4OAAB0>  at scbus1 target 3 lun 0 (pass8,ada7)
<Port Multiplier 37261095 1706>    at scbus1 target 15 lun 0 (pass9,pmp1)
<SAMSUNG HD204UI 1AQ10001>         at scbus2 target 0 lun 0 (pass10,ada8)
<OCZ RALLY2 8.07>                  at scbus8 target 0 lun 0 (pass14,da0)

zpool status

Code:

  pool: backup
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-HC
  scan: scrub in progress since Fri Dec  5 03:02:35 2014
        2.77T scanned out of 18.9T at 51.6M/s, 90h54m to go
        0 repaired, 14.66% done
config:

        NAME              STATE     READ WRITE CKSUM
        backup            ONLINE       2     0     0
          raidz1-0        ONLINE      14    10     0
            label/disk1   ONLINE       0     0     0
            label/disk2   ONLINE       4     1     0
            label/disk3   ONLINE       4    10     0
            label/disk4   ONLINE       4     9     0
          raidz1-1        ONLINE       0     0     0
            label/disk5   ONLINE       0     0     0
            label/disk6   ONLINE       0     0     0
            label/disk7   ONLINE       0     0     0
            label/disk12  ONLINE       0     0     0
          raidz1-2        ONLINE       0     0     0
            label/disk8   ONLINE       0     0     0
            label/disk9   ONLINE       0     0     0
            label/disk10  ONLINE       0     0     0
            label/disk11  ONLINE       0     0     0

errors: 2 data errors, use '-v' for a list

At this point, I was unable to stop the scrub job since some of the disks had gone missing. After a reboot, the three previously missing devices were detected again, and I was able to stop the scrub.

In order to narrow down the problem, I moved all the disks from the MicroServer to one of the enclosures and vice versa, and started a new scrub. The job failed again (around the 25% mark this time), when ahcich1, ahcich2 and ahcich3 started timing out, despite having different disks connected to them.

As none of the disks seem suspect according to smartctl, I'm inclined to believe that the on-board SATA controller has gone bad. Since I've only seen the timeouts under high load and at (seemingly) random times, I reckon the issue might be thermal-related.

Does anyone have any insights into what might be going on, or what I might still try to fix the issue?

Re-flashing the controller firmware (i.e. the entire system BIOS) comes to mind, but I'm somewhat sceptical as to whether it'd have any effect. Unfortunately, I'm unable to swap the SATA breakout cable since its other end is soldered onto the backplane.

User23 · Dec 10, 2014

If only the Samsung disks are failing at the onboard SATA ports your problem is probably Samsung disk firmware related: http://knowledge.seagate.com/articles/en_US/FAQ/223631en

Crivens · Dec 10, 2014

User23, that could be the case. With Samsung disks on an AMD SATA, I had to disable ahci once. Also, the cables may be a cause of trouble.

pva · Dec 10, 2014

User23 said:
If only the Samsung disks are failing at the onboard SATA ports your problem is probably Samsung disk firmware related: http://knowledge.seagate.com/articles/en_US/FAQ/223631en

The thing is that the three drives that initially failed (or rather, seemed to fail) were HD204UI's (the KB link is for HD203WI), which firmwares I've already updated in 2011 (https://forums.freebsd.org/threads/ahci-device-timeouts-while-performing-zfs-scrub.24189/). This is the first time I've experienced any trouble with them since then.

Secondly, it would seem that the HD203WI's are already running an updated firmware version (1AN10003) which isn't affected by the issue detailed in the KB article: http://fredsherbet.com/2014/05/upgr...rives-hd203wi-and-hd204ui/#comment-1682042232

Thanks for your input, though.

pva · Dec 10, 2014

Crivens said:
Also, the cables may be a cause of trouble.

Unfortunately, swapping the breakout cable might be a bit of a pain, since the connectors on the backplane-end aren't standard SATA connectors (http://forums.overclockers.com.au/showpost.php?p=13327990&postcount=1457). Thankfully, they aren't soldered on (as I previously thought), but I'll probably need to rummage around on eBay for a bit to find a spare one.

As another option, I thought about shelling out for a cheap SATA controller (à la http://eu.startech.com/Cards-Adapte...Controller-Card-Mini-SAS-SFF-8087~PEXSAT34SFF, which retails for about £50 on Amazon). Then again, it might be worth trying a new cable first, assuming I can source a replacement.

User23 · Dec 11, 2014

With external power source you could test with different SATA cables outside the drive cages.

gkontos · Dec 11, 2014

Can you try the following:

# echo 'hint.ahci.0.msi="0"' >> /boot/loader.conf
# shutdown -r now

Then scrub the pool again.

pva · Dec 12, 2014

gkontos said:
Can you try the following:

# echo 'hint.ahci.0.msi="0"' >> /boot/loader.conf
# shutdown -r now

Then scrub the pool again.

Thanks for the tip. I haven't had the time to try this out yet, but what would disabling MSI achieve?

User23 · Dec 12, 2014

Code:

hint.ahci.X.msi
controls Message Signaled Interrupts (MSI) usage by the specified con-
troller.

0 MSI disabled;
1 single MSI vector used, if supported (default);
2 multiple MSI vectors used, if supported;

Looks like disabling MSI worked on another HP Microserver

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195349

pva · Dec 12, 2014

User23 said:
Looks like disabling MSI worked on another HP Microserver

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195349

Well, that's interesting! The 10.1 upgrade is the only major change I've made to the system between the last monthly scrub and the timeouts.

The reason I asked what the setting affected was that I've not needed to use it before with the same hardware, and I'd rather not use a system setting to work around failing hardware. Of course, if – as now it seems – this is a genuine regression, then I don't have any misgivings about using it.

pva · Dec 14, 2014

Right, I added

Code:

hint.ahci.0.msi="0"

to /boot/loader.conf, rebooted and started a new scrub.

This time around, I began to see similar errors on the Silicon Image controller for all four disks (ada[0-3]) in one of the external SATA enclosures:

Code:

(ada2:siisch0:0:2:0): READ_FPDMA_QUEUED. ACB: 60 f8 68 6b 81 40 0a 00 00 00 00 00
(ada2:siisch0:0:2:0): CAM status: ATA Status Error
(ada2:siisch0:0:2:0): ATA status: 41 (DRDY ERR), error: 84 (ICRC ABRT )
(ada2:siisch0:0:2:0): RES: 41 84 00 00 00 40 00 00 00 00 00
(ada2:siisch0:0:2:0): Retrying command
[...]
(ada1:siisch0:0:1:0): READ_FPDMA_QUEUED. ACB: 60 b0 e0 c5 e5 40 0a 00 00 00 00 00
(ada1:siisch0:0:1:0): CAM status: ATA Status Error
(ada1:siisch0:0:1:0): ATA status: 41 (DRDY ERR), error: 84 (ICRC ABRT )
(ada1:siisch0:0:1:0): RES: 41 84 00 00 00 40 00 00 00 00 00
(ada1:siisch0:0:1:0): Retrying command
[...]
(ada0:siisch0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b0 70 8c a6 40 0b 00 00 00 00 00
(ada0:siisch0:0:0:0): CAM status: ATA Status Error
(ada0:siisch0:0:0:0): ATA status: 41 (DRDY ERR), error: 84 (ICRC ABRT )
(ada0:siisch0:0:0:0): RES: 41 84 00 00 00 40 00 00 00 00 00
(ada0:siisch0:0:0:0): Retrying command
[...]
(ada3:siisch0:0:4:0): READ_FPDMA_QUEUED. ACB: 60 00 08 a5 16 40 0d 00 00 01 00 00
(ada3:siisch0:0:4:0): CAM status: ATA Status Error
(ada3:siisch0:0:4:0): ATA status: 51 (DRDY SERV ERR), error: 84 (ICRC ABRT )
(ada3:siisch0:0:4:0): RES: 51 84 b7 a5 16 40 0d 00 00 51 00
...
siisch0: port is not ready (timeout 1000ms) status = 00002000

Eventually, the entire enclosure was disconnected. Note that the disks in question are not the same four (nor even all Samsungs) that originally timed out on the AMD controller.

Code:

pmp0 at siisch0 bus 0 scbus0 target 15 lun 0
pmp0: <Port Multiplier 37261095 1706> detached
ada0 at siisch0 bus 0 scbus0 target 0 lun 0
ada0: <SAMSUNG HD203WI 1AN10003> s/n XXXXXXXXXXXXXX detached
ada1 at siisch0 bus 0 scbus0 target 1 lun 0
ada1: <SAMSUNG HD203WI 1AN10003> s/n XXXXXXXXXXXXXX detached
ada2 at siisch0 bus 0 scbus0 target 2 lun 0
ada2: <SAMSUNG HD203WI 1AN10003> s/n XXXXXXXXXXXXXX detached
ada3 at siisch0 bus 0 scbus0 target 4 lun 0
<Hitachi HDS5C3020BLE630 MZ4OAAB0> s/n XXXXXXXXXXXXXX detached

Next, I tried disabling MSI interrupts for the SiI controller (as per the siis(4) man page): hint.siis.0.msi="0". This, however, had no effect: All the four disks started spewing out timeout errors during boot.

ZFS, on the other hand, was desperately trying to resilver the four failed disks, but the resilver hung after the disks were disconnected again.

zpool status

Code:

  pool: backup
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Dec 13 22:50:18 2014
        5.87T scanned out of 18.9T at 2.22M/s, (scan is slow, no estimated time)
        3M resilvered, 31.07% done
config:

        NAME              STATE     READ WRITE CKSUM
        backup            ONLINE       0     0     0
          raidz1-0        ONLINE       0     0     0
            label/disk1   ONLINE       0     0     0
            label/disk2   ONLINE       0     0     0
            label/disk3   ONLINE       0     0     0
            label/disk4   ONLINE       0     0     0
          raidz1-1        ONLINE       0     0     0
            label/disk5   ONLINE       3     0     0  (resilvering)
            label/disk6   ONLINE       8     0     0  (resilvering)
            label/disk7   ONLINE       3     0     0  (resilvering)
            label/disk12  ONLINE       3     0     0  (resilvering)
          raidz1-2        ONLINE       0     0     0
            label/disk8   ONLINE       0     0     0
            label/disk9   ONLINE       0     0     0
            label/disk10  ONLINE       0     0     0
            label/disk11  ONLINE       0     0     0

errors: 536 data errors, use '-v' for a list

I was able to export the pool in order to pause the resilver, but any subsequent access to the pool (e.g. zpool import) will result in a torrent of timeout errors.

At this point it's starting to look like the pool is borked beyond repair. I guess I'm going to have to destroy it, check the disks using sysutils/smartmontools and restore everything from backups.

Crivens · Dec 14, 2014

Maybe you can disable NCQ somehow? That might be a problem.

phoenix · Dec 15, 2014

How are things physically/logically cabled together? It appears you are using port multipliers in the external enclosures. Are those cabled up 1 SATA lane per drive slot? Or do all 4 drives share a single SATA lane? IOW, is there a single SATA cable connecting the enclosure to the controller, 4 separate SATA cables, or a single 8087 SFF cable?

IIRC, SiL and SiiS controller don't play nicely with SATA port multipliers where multiple drives share a single SATA lane. And the ata subsystem can't handle the situation where a single drive behind the port multiplier has issues (it can't send reset requests to a single drive, only to the port multiplier).

If you remove the drives from the enclosures and plug them all directly into the SATA controllers (1 port / 1 cable per drive), do the problems disappear?

pva · Feb 17, 2015

phoenix said:
How are things physically/logically cabled together? It appears you are using port multipliers in the external enclosures. Are those cabled up 1 SATA lane per drive slot? Or do all 4 drives share a single SATA lane? IOW, is there a single SATA cable connecting the enclosure to the controller, 4 separate SATA cables, or a single 8087 SFF cable?

IIRC, SiL and SiiS controller don't play nicely with SATA port multipliers where multiple drives share a single SATA lane. And the ata subsystem can't handle the situation where a single drive behind the port multiplier has issues (it can't send reset requests to a single drive, only to the port multiplier).

If you remove the drives from the enclosures and plug them all directly into the SATA controllers (1 port / 1 cable per drive), do the problems disappear?

The enclosures were connected to the controller using a single eSATA cable per enclosure, viz. all the drives in an enclosure were sharing a single SATA lane.

I've since migrated the pool onto new hardware, which has a discrete SATA port for each drive, and it turned out that one of the drives had started failing. After replacing it everything is working fine again. So your diagnosis was spot on, thanks!

Solved SATA controller going bad?

pva

User23

Crivens

Administrator

pva

pva

User23

gkontos

pva

User23

pva

pva

Crivens

Administrator

phoenix

pva