Hi,
I'm running 8.2-RELEASE with the ahci driver loaded at boot time on a HP MicroServer with four Samsung HD204UI drives attached to a simple (striped) ZFS pool via an ATI IXP700 SATA controller:
If I attempt to scrub the pool, at some point one of the disks will time out (this has happened twice, once on ahcich0/ada0 and once on ahcich1/ada1), resulting in a boatload of the following kind of messages in /var/log/messages:
Judging from camcontrol output, after the timeouts the offending disk is taken offline:
This, in turn, results in the scrub job hanging indefinitely:
[cmd=""]zpool status[/cmd]
Furthermore, I'm unable to abort the scrub:
After performing a hard reboot, all the disks came back online again. Furthermore, smartctl reports that they are all in good health:
Unfortunately, since the scrub job is resumed after the boot, it will eventually hang again at some point.
Because I've only encountered this problem while performing a ZFS scrub, my theory is that the 5400 RPM drives buckle under the high IO load, causing the SATA controller or the ahci driver to think that the drive is not responding.
My first question is: How can I stop the scrub job? Should I reboot again and try to stop the scrub using zpool before a timeout occurs?
Secondly, how should I attempt to prevent these timeouts from occurring in the future? Are there any ahci driver parameters or ZFS kernel tunables I could try?
I'm running 8.2-RELEASE with the ahci driver loaded at boot time on a HP MicroServer with four Samsung HD204UI drives attached to a simple (striped) ZFS pool via an ATI IXP700 SATA controller:
Code:
FreeBSD microserver 8.2-RELEASE FreeBSD 8.2-RELEASE #0: Thu Feb 17 02:41:51 UTC 2011
[email]root@mason.cse.buffalo.edu[/email]:/usr/obj/usr/src/sys/GENERIC amd64
Code:
ahci_load="YES"
Code:
ahci0: <ATI IXP700 AHCI SATA controller> port 0xd000-0xd007,0xc000-0xc003,0xb000-0xb007,0xa000-0xa003,0x9000-0x900f mem
0xfe6ffc00-0xfe6fffff irq 19 at device 17.0 on pci0
ahci0: [ITHREAD]
ahci0: AHCI v1.20 with 4 3Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich0: [ITHREAD]
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich1: [ITHREAD]
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich2: [ITHREAD]
ahcich3: <AHCI channel> at channel 3 on ahci0
ahcich3: [ITHREAD]
Code:
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
ada2: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
ada3: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
If I attempt to scrub the pool, at some point one of the disks will time out (this has happened twice, once on ahcich0/ada0 and once on ahcich1/ada1), resulting in a boatload of the following kind of messages in /var/log/messages:
Code:
Jun 1 22:48:59 microserver kernel: ahcich1: Timeout on slot 1
Jun 1 22:48:59 microserver kernel: ahcich1: is 00000000 cs 000007f8 ss 000007fe rs 000007fe tfd 40 serr 00000000
Jun 1 22:48:59 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun 1 22:49:45 microserver kernel: ahcich1: Timeout on slot 10
Jun 1 22:49:45 microserver kernel: ahcich1: is 00000000 cs 00000400 ss 00000000 rs 00000400 tfd 80 serr 00000000
Jun 1 22:49:45 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun 1 22:50:31 microserver kernel: ahcich1: Timeout on slot 10
Jun 1 22:50:31 microserver kernel: ahcich1: is 00000000 cs 00000400 ss 00000000 rs 00000400 tfd 80 serr 00000000
Jun 1 22:50:31 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun 1 22:50:31 microserver kernel: (ada1:ahcich1:0:0:0): lost device
Jun 1 22:51:34 microserver kernel: ahcich1: Timeout on slot 10
Jun 1 22:51:34 microserver kernel: ahcich1: is 00000000 cs 000ffc00 ss 000ffc00 rs 000ffc00 tfd 80 serr 00000000
Jun 1 22:51:34 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun 1 22:51:34 microserver kernel: ahcich1: Poll timeout on slot 19
Jun 1 22:51:34 microserver kernel: ahcich1: is 00000000 cs 00080000 ss 00000000 rs 00080000 tfd 80 serr 00000000
Jun 1 22:52:36 microserver kernel: ahcich1: Timeout on slot 19
Jun 1 22:52:36 microserver kernel: ahcich1: is 00000000 cs 1ff80000 ss 1ff80000 rs 1ff80000 tfd 80 serr 00000000
Jun 1 22:52:36 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun 1 22:52:36 microserver kernel: ahcich1: Poll timeout on slot 28
Jun 1 22:52:36 microserver kernel: ahcich1: is 00000000 cs 10000000 ss 00000000 rs 10000000 tfd 80 serr 00000000
Jun 1 22:53:38 microserver kernel: ahcich1: Timeout on slot 28
Jun 1 22:53:38 microserver kernel: ahcich1: is 00000000 cs f000003f ss f000003f rs f000003f tfd 80 serr 00000000
Jun 1 22:53:38 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun 1 22:53:38 microserver kernel: ahcich1: Poll timeout on slot 5
Jun 1 22:53:38 microserver kernel: ahcich1: is 00000000 cs 00000020 ss 00000000 rs 00000020 tfd 80 serr 00000000
Jun 1 22:54:41 microserver kernel: ahcich1: Timeout on slot 5
Jun 1 22:54:41 microserver kernel: ahcich1: is 00000000 cs 00007fe0 ss 00007fe0 rs 00007fe0 tfd 80 serr 00000000
Jun 1 22:54:41 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun 1 22:54:41 microserver kernel: ahcich1: Poll timeout on slot 14
Jun 1 22:54:41 microserver kernel: ahcich1: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd 80 serr 00000000
Jun 1 22:54:41 microserver root: ZFS: vdev I/O failure, zpool=backup path=/dev/label/disk2 offset=270336 size=8192 error=6
Jun 1 22:54:41 microserver root: ZFS: vdev I/O failure, zpool=backup path=/dev/label/disk2 offset=2000398327808 size=8192 error=6
Jun 1 22:54:41 microserver root: ZFS: vdev I/O failure, zpool=backup path=/dev/label/disk2 offset=2000398589952 size=8192 error=6
Judging from camcontrol output, after the timeouts the offending disk is taken offline:
# camcontrol devlist
Code:
<SAMSUNG HD204UI 1AQ10001> at scbus0 target 0 lun 0 (ada0,pass0)
<SAMSUNG HD204UI 1AQ10001> at scbus2 target 0 lun 0 (ada2,pass2)
<SAMSUNG HD204UI 1AQ10001> at scbus3 target 0 lun 0 (ada3,pass3)
This, in turn, results in the scrub job hanging indefinitely:
[cmd=""]zpool status[/cmd]
Code:
pool: backup
state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://www.sun.com/msg/ZFS-8000-HC
scrub: scrub in progress for 13h5m, 16.73% done, 65h7m to go
config:
NAME STATE READ WRITE CKSUM
backup ONLINE 40 0 0
label/disk1 ONLINE 0 0 0
label/disk2 ONLINE 83 0 0
label/disk3 ONLINE 0 0 0
label/disk4 ONLINE 0 0 0
errors: 1 data errors, use '-v' for a list
Furthermore, I'm unable to abort the scrub:
# zpool scrub -s backup
Code:
cannot scrub backup: pool I/O is currently suspended
After performing a hard reboot, all the disks came back online again. Furthermore, smartctl reports that they are all in good health:
# smartctl -H /dev/ada1
Code:
smartctl 5.40 2010-10-16 r3189 [FreeBSD 8.2-RELEASE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Because I've only encountered this problem while performing a ZFS scrub, my theory is that the 5400 RPM drives buckle under the high IO load, causing the SATA controller or the ahci driver to think that the drive is not responding.
My first question is: How can I stop the scrub job? Should I reboot again and try to stop the scrub using zpool before a timeout occurs?
Secondly, how should I attempt to prevent these timeouts from occurring in the future? Are there any ahci driver parameters or ZFS kernel tunables I could try?