I'm running FreeBSD 10.1-RELEASE on an HP MicroServer N36L.
The machine has a total of 12 hard disks, 4 of which are connected to the AMD SATA controller on the motherboard, and the other 8 to two external eSATA enclosures (4 disks in each) using a Silicon Image SATA controller.
The disks are set up in a ZFS pool consisting of 3 RAID-Z1 vdevs of 4 disks each.
While performing a periodic scrub of the pool a few days ago, I started seeing multiple timeouts on ahcich1, ahcich2 and ahcich3.
Eventually, after enough timeouts the disks were dropped.
At this point, I was unable to stop the scrub job since some of the disks had gone missing. After a reboot, the three previously missing devices were detected again, and I was able to stop the scrub.
In order to narrow down the problem, I moved all the disks from the MicroServer to one of the enclosures and vice versa, and started a new scrub. The job failed again (around the 25% mark this time), when ahcich1, ahcich2 and ahcich3 started timing out, despite having different disks connected to them.
As none of the disks seem suspect according to smartctl, I'm inclined to believe that the on-board SATA controller has gone bad. Since I've only seen the timeouts under high load and at (seemingly) random times, I reckon the issue might be thermal-related.
Does anyone have any insights into what might be going on, or what I might still try to fix the issue?
Re-flashing the controller firmware (i.e. the entire system BIOS) comes to mind, but I'm somewhat sceptical as to whether it'd have any effect. Unfortunately, I'm unable to swap the SATA breakout cable since its other end is soldered onto the backplane.
uname -a
Code:
FreeBSD microserver 10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 21:02:49 UTC 2014 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64
dmesg
Code:
ahci0: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0
x7000-0x700f mem 0xfe5ffc00-0xfe5fffff irq 19 at device 17.0 on pci0
ahci0: AHCI v1.20 with 4 3Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
[...]
siis0: <SiI3132 SATA controller> port 0xd800-0xd87f mem 0xfe8ffc00-0xfe8ffc7f,0xfe8f8000-0xfe8fbfff irq 18 at device 0.0 on pci2
siisch0: <SIIS channel> at channel 0 on siis0
siisch1: <SIIS channel> at channel 1 on siis0
camcontrol devlist
Code:
<SAMSUNG HD203WI 1AN10003> at scbus0 target 0 lun 0 (pass0,ada0)
<SAMSUNG HD203WI 1AN10003> at scbus0 target 1 lun 0 (pass1,ada1)
<SAMSUNG HD203WI 1AN10003> at scbus0 target 2 lun 0 (pass2,ada2)
<Hitachi HDS5C3020BLE630 MZ4OAAB0> at scbus0 target 4 lun 0 (pass3,ada3)
<Port Multiplier 37261095 1706> at scbus0 target 15 lun 0 (pass4,pmp0)
<SAMSUNG HD203WI 1AN10003> at scbus1 target 0 lun 0 (pass5,ada4)
<SAMSUNG HD204UI 1AQ10001> at scbus1 target 1 lun 0 (pass6,ada5)
<SAMSUNG HD203WI 1AN10003> at scbus1 target 2 lun 0 (pass7,ada6)
<Hitachi HDS5C3020BLE630 MZ4OAAB0> at scbus1 target 3 lun 0 (pass8,ada7)
<Port Multiplier 37261095 1706> at scbus1 target 15 lun 0 (pass9,pmp1)
<SAMSUNG HD204UI 1AQ10001> at scbus2 target 0 lun 0 (pass10,ada8)
<SAMSUNG HD204UI 1AQ10001> at scbus3 target 0 lun 0 (pass11,ada9)
<SAMSUNG HD204UI 1AQ10001> at scbus4 target 0 lun 0 (pass12,ada10)
<SAMSUNG HD204UI 1AQ10001> at scbus5 target 0 lun 0 (pass13,ada11)
<OCZ RALLY2 8.07> at scbus8 target 0 lun 0 (pass14,da0)
While performing a periodic scrub of the pool a few days ago, I started seeing multiple timeouts on ahcich1, ahcich2 and ahcich3.
dmesg
Code:
ahcich3: Timeout on slot 0 port 0
ahcich3: is 00000002 cs 00000000 ss 00000000 rs 00000001 tfd 50 serr 00000000 cmd 00006017
(aprobe2:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe2:ahcich3:0:0:0): CAM status: Command timeout
(aprobe2:ahcich3:0:0:0): Retrying command
ahcich1: Timeout on slot 8 port 0
ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00000100 tfd 50 serr 00000000 cmd 00006817
(aprobe1:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe1:ahcich1:0:0:0): CAM status: Command timeout
(aprobe1:ahcich1:0:0:0): Retrying command
ahcich2: Timeout on slot 28 port 0
ahcich2: is 00000002 cs 00000000 ss 00000000 rs 10000000 tfd 50 serr 00000000 cmd 00007c17
(aprobe0:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich2:0:0:0): CAM status: Command timeout
(aprobe0:ahcich2:0:0:0): Retrying command
camcontrol devlist
Code:
<SAMSUNG HD203WI 1AN10003> at scbus0 target 0 lun 0 (pass0,ada0)
<SAMSUNG HD203WI 1AN10003> at scbus0 target 1 lun 0 (pass1,ada1)
<SAMSUNG HD203WI 1AN10003> at scbus0 target 2 lun 0 (pass2,ada2)
<Hitachi HDS5C3020BLE630 MZ4OAAB0> at scbus0 target 4 lun 0 (pass3,ada3)
<Port Multiplier 37261095 1706> at scbus0 target 15 lun 0 (pass4,pmp0)
<SAMSUNG HD203WI 1AN10003> at scbus1 target 0 lun 0 (pass5,ada4)
<SAMSUNG HD204UI 1AQ10001> at scbus1 target 1 lun 0 (pass6,ada5)
<SAMSUNG HD203WI 1AN10003> at scbus1 target 2 lun 0 (pass7,ada6)
<Hitachi HDS5C3020BLE630 MZ4OAAB0> at scbus1 target 3 lun 0 (pass8,ada7)
<Port Multiplier 37261095 1706> at scbus1 target 15 lun 0 (pass9,pmp1)
<SAMSUNG HD204UI 1AQ10001> at scbus2 target 0 lun 0 (pass10,ada8)
<OCZ RALLY2 8.07> at scbus8 target 0 lun 0 (pass14,da0)
zpool status
Code:
pool: backup
state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://illumos.org/msg/ZFS-8000-HC
scan: scrub in progress since Fri Dec 5 03:02:35 2014
2.77T scanned out of 18.9T at 51.6M/s, 90h54m to go
0 repaired, 14.66% done
config:
NAME STATE READ WRITE CKSUM
backup ONLINE 2 0 0
raidz1-0 ONLINE 14 10 0
label/disk1 ONLINE 0 0 0
label/disk2 ONLINE 4 1 0
label/disk3 ONLINE 4 10 0
label/disk4 ONLINE 4 9 0
raidz1-1 ONLINE 0 0 0
label/disk5 ONLINE 0 0 0
label/disk6 ONLINE 0 0 0
label/disk7 ONLINE 0 0 0
label/disk12 ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
label/disk8 ONLINE 0 0 0
label/disk9 ONLINE 0 0 0
label/disk10 ONLINE 0 0 0
label/disk11 ONLINE 0 0 0
errors: 2 data errors, use '-v' for a list
In order to narrow down the problem, I moved all the disks from the MicroServer to one of the enclosures and vice versa, and started a new scrub. The job failed again (around the 25% mark this time), when ahcich1, ahcich2 and ahcich3 started timing out, despite having different disks connected to them.
As none of the disks seem suspect according to smartctl, I'm inclined to believe that the on-board SATA controller has gone bad. Since I've only seen the timeouts under high load and at (seemingly) random times, I reckon the issue might be thermal-related.
Does anyone have any insights into what might be going on, or what I might still try to fix the issue?
Re-flashing the controller firmware (i.e. the entire system BIOS) comes to mind, but I'm somewhat sceptical as to whether it'd have any effect. Unfortunately, I'm unable to swap the SATA breakout cable since its other end is soldered onto the backplane.