AHCI device timeouts while performing ZFS scrub

pva · Jun 2, 2011

Hi,

I'm running 8.2-RELEASE with the ahci driver loaded at boot time on a HP MicroServer with four Samsung HD204UI drives attached to a simple (striped) ZFS pool via an ATI IXP700 SATA controller:

Code:

FreeBSD microserver 8.2-RELEASE FreeBSD 8.2-RELEASE #0: Thu Feb 17 02:41:51 UTC 2011 
    [email]root@mason.cse.buffalo.edu[/email]:/usr/obj/usr/src/sys/GENERIC  amd64

Code:

ahci_load="YES"

Code:

ahci0: <ATI IXP700 AHCI SATA controller> port 0xd000-0xd007,0xc000-0xc003,0xb000-0xb007,0xa000-0xa003,0x9000-0x900f mem
 0xfe6ffc00-0xfe6fffff irq 19 at device 17.0 on pci0
ahci0: [ITHREAD]
ahci0: AHCI v1.20 with 4 3Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich0: [ITHREAD]
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich1: [ITHREAD]
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich2: [ITHREAD]
ahcich3: <AHCI channel> at channel 3 on ahci0
ahcich3: [ITHREAD]

Code:

ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
ada2: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
ada3: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)

If I attempt to scrub the pool, at some point one of the disks will time out (this has happened twice, once on ahcich0/ada0 and once on ahcich1/ada1), resulting in a boatload of the following kind of messages in /var/log/messages:

Code:

Jun  1 22:48:59 microserver kernel: ahcich1: Timeout on slot 1
Jun  1 22:48:59 microserver kernel: ahcich1: is 00000000 cs 000007f8 ss 000007fe rs 000007fe tfd 40 serr 00000000
Jun  1 22:48:59 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun  1 22:49:45 microserver kernel: ahcich1: Timeout on slot 10
Jun  1 22:49:45 microserver kernel: ahcich1: is 00000000 cs 00000400 ss 00000000 rs 00000400 tfd 80 serr 00000000
Jun  1 22:49:45 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun  1 22:50:31 microserver kernel: ahcich1: Timeout on slot 10
Jun  1 22:50:31 microserver kernel: ahcich1: is 00000000 cs 00000400 ss 00000000 rs 00000400 tfd 80 serr 00000000
Jun  1 22:50:31 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun  1 22:50:31 microserver kernel: (ada1:ahcich1:0:0:0): lost device
Jun  1 22:51:34 microserver kernel: ahcich1: Timeout on slot 10
Jun  1 22:51:34 microserver kernel: ahcich1: is 00000000 cs 000ffc00 ss 000ffc00 rs 000ffc00 tfd 80 serr 00000000
Jun  1 22:51:34 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun  1 22:51:34 microserver kernel: ahcich1: Poll timeout on slot 19
Jun  1 22:51:34 microserver kernel: ahcich1: is 00000000 cs 00080000 ss 00000000 rs 00080000 tfd 80 serr 00000000
Jun  1 22:52:36 microserver kernel: ahcich1: Timeout on slot 19
Jun  1 22:52:36 microserver kernel: ahcich1: is 00000000 cs 1ff80000 ss 1ff80000 rs 1ff80000 tfd 80 serr 00000000
Jun  1 22:52:36 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun  1 22:52:36 microserver kernel: ahcich1: Poll timeout on slot 28
Jun  1 22:52:36 microserver kernel: ahcich1: is 00000000 cs 10000000 ss 00000000 rs 10000000 tfd 80 serr 00000000
Jun  1 22:53:38 microserver kernel: ahcich1: Timeout on slot 28
Jun  1 22:53:38 microserver kernel: ahcich1: is 00000000 cs f000003f ss f000003f rs f000003f tfd 80 serr 00000000
Jun  1 22:53:38 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun  1 22:53:38 microserver kernel: ahcich1: Poll timeout on slot 5
Jun  1 22:53:38 microserver kernel: ahcich1: is 00000000 cs 00000020 ss 00000000 rs 00000020 tfd 80 serr 00000000
Jun  1 22:54:41 microserver kernel: ahcich1: Timeout on slot 5
Jun  1 22:54:41 microserver kernel: ahcich1: is 00000000 cs 00007fe0 ss 00007fe0 rs 00007fe0 tfd 80 serr 00000000
Jun  1 22:54:41 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun  1 22:54:41 microserver kernel: ahcich1: Poll timeout on slot 14
Jun  1 22:54:41 microserver kernel: ahcich1: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd 80 serr 00000000
Jun  1 22:54:41 microserver root: ZFS: vdev I/O failure, zpool=backup path=/dev/label/disk2 offset=270336 size=8192 error=6
Jun  1 22:54:41 microserver root: ZFS: vdev I/O failure, zpool=backup path=/dev/label/disk2 offset=2000398327808 size=8192 error=6
Jun  1 22:54:41 microserver root: ZFS: vdev I/O failure, zpool=backup path=/dev/label/disk2 offset=2000398589952 size=8192 error=6

Judging from camcontrol output, after the timeouts the offending disk is taken offline:
# camcontrol devlist

Code:

<SAMSUNG HD204UI 1AQ10001>         at scbus0 target 0 lun 0 (ada0,pass0)
<SAMSUNG HD204UI 1AQ10001>         at scbus2 target 0 lun 0 (ada2,pass2)
<SAMSUNG HD204UI 1AQ10001>         at scbus3 target 0 lun 0 (ada3,pass3)

This, in turn, results in the scrub job hanging indefinitely:

[cmd=""]zpool status[/cmd]

Code:

  pool: backup
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 13h5m, 16.73% done, 65h7m to go
config:

	NAME           STATE     READ WRITE CKSUM
	backup         ONLINE      40     0     0
	  label/disk1  ONLINE       0     0     0
	  label/disk2  ONLINE      83     0     0
	  label/disk3  ONLINE       0     0     0
	  label/disk4  ONLINE       0     0     0

errors: 1 data errors, use '-v' for a list

Furthermore, I'm unable to abort the scrub:

# zpool scrub -s backup

Code:

cannot scrub backup: pool I/O is currently suspended

After performing a hard reboot, all the disks came back online again. Furthermore, smartctl reports that they are all in good health:
# smartctl -H /dev/ada1

Code:

smartctl 5.40 2010-10-16 r3189 [FreeBSD 8.2-RELEASE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Unfortunately, since the scrub job is resumed after the boot, it will eventually hang again at some point.

Because I've only encountered this problem while performing a ZFS scrub, my theory is that the 5400 RPM drives buckle under the high IO load, causing the SATA controller or the ahci driver to think that the drive is not responding.

My first question is: How can I stop the scrub job? Should I reboot again and try to stop the scrub using zpool before a timeout occurs?

Secondly, how should I attempt to prevent these timeouts from occurring in the future? Are there any ahci driver parameters or ZFS kernel tunables I could try?

AndyUKG · Jun 2, 2011

Hi,

you can stop a scrub with:
# zpool scrub -s poolname

WRT to the problem, I doubt its a case of the disks not handling it because they are 5400rpm or consumer grade. I'd guess its most likely a hardware issue or some issue with the AHCI driver and your specific controller. Have you checked if there is a firmware upgrade available for the SATA controller? Also when the disk disappears from the server (ie not visible from camcontrol) is it always the same disk? If yes you can try replacing that disk...

cheers Andy.

pva · Jun 2, 2011

Hello Andy,

and thanks for your reply!

AndyUKG said:
you can stop a scrub with:
# zpool scrub -s poolname

As I wrote above, I already tried this, but the command failed because "pool I/O is currently suspended". I was, however, able to stop the scrub after rebooting again and immediately issuing the stop command before the job had had a chance to get hung again.

Code:

  pool: backup
 state: ONLINE
 scrub: scrub stopped after 0h2m with 0 errors on Thu Jun  2 16:24:30 2011
config:

	NAME           STATE     READ WRITE CKSUM
	backup         ONLINE       0     0     0
	  label/disk1  ONLINE       0     0     0
	  label/disk2  ONLINE       0     0     0
	  label/disk3  ONLINE       0     0     0
	  label/disk4  ONLINE       0     0     0

errors: No known data errors

Code:

<SAMSUNG HD204UI 1AQ10001>         at scbus0 target 0 lun 0 (ada0,pass0)
<SAMSUNG HD204UI 1AQ10001>         at scbus1 target 0 lun 0 (ada1,pass1)
<SAMSUNG HD204UI 1AQ10001>         at scbus2 target 0 lun 0 (ada2,pass2)
<SAMSUNG HD204UI 1AQ10001>         at scbus3 target 0 lun 0 (ada3,pass3)

AndyUKG said:
WRT to the problem, I doubt its a case of the disks not handling it because they are 5400rpm or consumer grade. I'd guess its most likely a hardware issue or some issue with the AHCI driver and your specific controller. Have you checked if there is a firmware upgrade available for the SATA controller? Also when the disk disappears from the server (ie not visible from camcontrol) is it always the same disk? If yes you can try replacing that disk...

I've already upgraded the server's BIOS to the latest available version (there's no separate firmware update available for the SATA controller).

I also don't think the disk is faulty, because I've seen these timeouts first on ahcich0/ada0 and after rebooting for the first time on ahcich1/ada1.

I wonder whether anybody else has run into similar issues with the IXP700 controller and the ahci driver? Can any further information be gleaned from the timeout error messages as regards to whether the problem might be related to driver/hardware compatibility?

AndyUKG · Jun 2, 2011

You could try disabling AHCI and running a scrub to help zero in on the issue...

carlton_draught · Jun 2, 2011

pva said:
Judging from camcontrol output, after the timeouts the offending disk is taken offline:
# camcontrol devlist

Code:

<SAMSUNG HD204UI 1AQ10001> at scbus0 target 0 lun 0 (ada0,pass0) <SAMSUNG HD204UI 1AQ10001> at scbus2 target 0 lun 0 (ada2,pass2) <SAMSUNG HD204UI 1AQ10001> at scbus3 target 0 lun 0 (ada3,pass3)

...

After performing a hard reboot, all the disks came back online again. Furthermore, smartctl reports that they are all in good health:
# smartctl -H /dev/ada1

Code:

smartctl 5.40 2010-10-16 r3189 [FreeBSD 8.2-RELEASE amd64] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED

Unfortunately, since the scrub job is resumed after the boot, it will eventually hang again at some point.

Because I've only encountered this problem while performing a ZFS scrub, my theory is that the 5400 RPM drives buckle under the high IO load, causing the SATA controller or the ahci driver to think that the drive is not responding.

Secondly, how should I attempt to prevent these timeouts from occurring in the future? Are there any ahci driver parameters or ZFS kernel tunables I could try?

I suspect the Samsungs might be your problem. While I haven't used Samsung yet with ZFS, I've had bad sector trouble with a lot of the ones I had (2TB). I had one mounted as /home on my Ubuntu home PC, and it started getting bad sectors. When I tried to access some of the files, it would take FOREVER, the screen would go gray as it does. Sound familiar? From what I can tell reading here and other forums, this is default behaviour for the Samsungs. They spend a lot of time trying to read a sector before giving up.

My bet is that your Samsung drive is dying, just the -H option is an ankle high hurdle for HDD health. I bet that there are indications of something rotten in the state of Denmark if you look a bit closer. Just the fact that it is only one drive timing out and not the others should be a huge clue that it is a drive problem and not a system problem.

Let's see the output of (if ada1 is your bad drive)
# smartctl -a /dev/ada1

And in the mean time, do the following to force your drive to have all sectors read and tested.
# smartctl -t long /dev/ada1

After that completes, show us again:
# smartctl -a /dev/ada1

This thread points to what your problem is. Scroll through the pro-Samsung opinions to get to the last post. May also be related to this issue, however I'd thoroughly check the drive health first.

tingo · Jun 2, 2011

carlton_draught said:
I suspect the Samsungs might be your problem. While I haven't used Samsung yet with ZFS,

Well, I use Samsung drives with ZFS, and without ZFS, and have done so for some years now. I like Samsung drives because they are quiet and reliable (all drives will fail eventually, but Samsung drives aren't bad in that regard). YMMV.

pva · Jun 3, 2011

carlton_draught said:
Let's see the output of (if ada1 is your bad drive)
# smartctl -a /dev/ada1

(Please find attached the complete smartctl logs, I've only reproduced the relevant portions here.)

Code:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   068   068   025    Pre-fail  Always       -       9777
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       42
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       900
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       42
181 Program_Fail_Cnt_Total  0x0022   100   100   000    Old_age   Always       -       31
191 G-Sense_Error_Rate      0x0022   252   252   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   064   000    Old_age   Always       -       30 (Min/Max 20/34)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       21
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       43

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

carlton_draught said:
And in the mean time, do the following to force your drive to have all sectors read and tested.
# smartctl -t long /dev/ada1

After that completes, show us again:
# smartctl -a /dev/ada1

Code:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0026   055   055   000    Old_age   Always       -       19271
  3 Spin_Up_Time            0x0023   068   068   025    Pre-fail  Always       -       9777
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       42
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       911
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       42
181 Program_Fail_Cnt_Total  0x0022   100   100   000    Old_age   Always       -       31
191 G-Sense_Error_Rate      0x0022   252   252   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   064   000    Old_age   Always       -       29 (Min/Max 20/34)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       22
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       43

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       906         -

It seems the drive passes the extended offline self-test without any errors. The only difference I can discern in the SMART statistics is an increase in the raw Multi_Zone_Error_Rate rate. This, however, is probably of no consequence. I also ran an self-test on ada0 which had also reported timeouts, and it also passed with flying colours.

carlton_draught said:
This thread points to what your problem is. Scroll through the pro-Samsung opinions to get to the last post. May also be related to this issue, however I'd thoroughly check the drive health first.

OTOH, sub.mesa (of ZFSguru fame) states in a discussion on consumer drives and TLER in response to the statement you referred me to:

[O]n FreeBSD the ATA/CAM stack controls the timeouts, and is set to progressively increase the timeouts as they occur; before the disk will be detached. This means that your disk should not be detached due to a simple bad sector timeout, as common on desktop systems. It doesn't use a fixed timeout value; but rather keeps initial timeout low to report to ZFS and anything that lies 'beyond' that disk in the GEOM framework, while retrying with a higher timeout value as they occur; until finally failing if recovery time has expired.

Based on this and the SMART test results, I'm even more inclined to pin this on the SATA controller/ahci driver.

Thanks for the help, though, it's really useful to be able to narrow down the group of potential suspects. A real whodunnit, this!

pva · Jun 3, 2011

AndyUKG said:
You could try disabling AHCI and running a scrub to help zero in on the issue...

I googled around some more, and stumbled upon this thread, where someone reports having seen the same kind of symptoms on a HP MicroServer with Samsung disks. Apparently, a fix has made it into 8-STABLE, but I'm hesitant to switch to it (this is my first FreeBSD install and I have no clue as to how stable the STABLE branch really is

).

Since others are also reporting that the old ATA driver without CAM integration should work better with the IXP700, I'll try that next.

Apparently, switching drivers will cause the device names to change. What effect will this have on the ZFS pool? I'm assuming at least the GEOM labels will point to the wrong devices.

[cmd='']glabel list[/cmd]

Code:

Geom name: ada0
Providers:
1. Name: label/disk1
   Mediasize: 2000398933504 (1.8T)
   Sectorsize: 512
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 3907029167
   length: 2000398933504
   index: 0
Consumers:
1. Name: ada0
   Mediasize: 2000398934016 (1.8T)
   Sectorsize: 512
   Mode: r1w1e2

Geom name: ada1
Providers:
1. Name: label/disk2
   Mediasize: 2000398933504 (1.8T)
   Sectorsize: 512
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 3907029167
   length: 2000398933504
   index: 0
Consumers:
1. Name: ada1
   Mediasize: 2000398934016 (1.8T)
   Sectorsize: 512
   Mode: r1w1e2

Geom name: ada2
Providers:
1. Name: label/disk3
   Mediasize: 2000398933504 (1.8T)
   Sectorsize: 512
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 3907029167
   length: 2000398933504
   index: 0
Consumers:
1. Name: ada2
   Mediasize: 2000398934016 (1.8T)
   Sectorsize: 512
   Mode: r1w1e2

Geom name: ada3
Providers:
1. Name: label/disk4
   Mediasize: 2000398933504 (1.8T)
   Sectorsize: 512
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 3907029167
   length: 2000398933504
   index: 0
Consumers:
1. Name: ada3
   Mediasize: 2000398934016 (1.8T)
   Sectorsize: 512
   Mode: r1w1e2

olav · Jun 3, 2011

For me it sounds like a controller issue. I have Samsung drives and they are very reliable.
Do you run smartctl in the background? Try to disable it. Some controllers have problems with querying s.m.a.r.t data while under load.

aragon · Jun 3, 2011

pva said:
Apparently, switching drivers will cause the device names to change. What effect will this have on the ZFS pool? I'm assuming at least the GEOM labels will point to the wrong devices.

It should be no problem. The labels will still be fine - that's the point of labels.

pva · Jun 4, 2011

olav said:
Do you run smartctl in the background? Try to disable it. Some controllers have problems with querying s.m.a.r.t data while under load.

Yeah, I do. I stopped smartd and began a new scrub job, but it still failed due to device timeouts on ahcich0 (sic!), albeit with a different status this time around:

Code:

  pool: backup
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress for 0h15m, 0.96% done, 26h23m to go
config:

        NAME           STATE     READ WRITE CKSUM
        backup         ONLINE       0     0     0
          label/disk1  ONLINE      34     0     0
          label/disk2  ONLINE       0     0     0
          label/disk3  ONLINE       0     0     0
          label/disk4  ONLINE       0     0     0

pva · Jun 4, 2011

pva said:
Since others are also reporting that the old ATA driver without CAM integration should work better with the IXP700, I'll try that next.

Switching to the old ata(4) driver did the trick, the scrub job completed without errors:

Code:

  pool: backup
 state: ONLINE
 scrub: scrub completed after 7h54m with 0 errors on Sat Jun  4 19:27:19 2011
config:

	NAME           STATE     READ WRITE CKSUM
	backup         ONLINE       0     0     0
	  label/disk1  ONLINE       0     0     0
	  label/disk2  ONLINE       0     0     0
	  label/disk3  ONLINE       0     0     0
	  label/disk4  ONLINE       0     0     0

errors: No known data errors

# atacontrol list

Code:

...
ATA channel 2:
    Master:  ad4 <SAMSUNG HD204UI/1AQ10001> SATA revision 2.x
    Slave:       no device present
ATA channel 3:
    Master:  ad6 <SAMSUNG HD204UI/1AQ10001> SATA revision 2.x
    Slave:       no device present
ATA channel 4:
    Master:  ad8 <SAMSUNG HD204UI/1AQ10001> SATA revision 2.x
    Slave:       no device present
ATA channel 5:
    Master: ad10 <SAMSUNG HD204UI/1AQ10001> SATA revision 2.x
    Slave:       no device present

Fortunately, for my purposes, stability is more important than performance, so I don't mind using the old driver even though I have to give up hot swapping and NCQ. (The box in question is an off-site backup server connected to the Internet over a 3G link, so disk IO is definitely not a bottleneck.) I guess I'll have to give ahci(4) another go in a future release.

jem · Jun 23, 2011

Be aware also that Samsung HD204UI disks manufactured before December 2010 have a firmware bug that can cause bad sectors to be reported under certain conditions.

There is a firmware update available, but it doesn't update the firmware revision reported by the drive, so you have no way of knowing if your disks are affected besides the manufacture date.

See the links in the following:

Code:

smartd[778]: Device: /dev/ada0, WARNING: Using smartmontools or hdparm with this
smartd[778]: drive may result in data loss due to a firmware bug.
smartd[778]: ****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
smartd[778]: Buggy and fixed firmware report same version number!
smartd[778]: See the following web pages for details:
smartd[778]: http://www.samsung.com/global/business/hdd/faqView.do?b2b_bbs_msg_id=386
smartd[778]: http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks

I think I got lucky as I also have a MicroServer with four of these disks, running in AHCI mode, and haven't seen any problems.

pva · Jun 23, 2011

jem said:
Be aware also that Samsung HD204UI disks manufactured before December 2010 have a firmware bug that can cause bad sectors to be reported under certain conditions.

There is a firmware update available, but it doesn't update the firmware revision reported by the drive, so you have no way of knowing if your disks are affected besides the manufacture date.

Thanks for the tip. I just checked the manufacturing dates of my disks, and they have all been manufactured either in December 2010 (2010.12) or January 2011 (2011.01).

I think I got lucky as I also have a MicroServer with four of these disks, running in AHCI mode, and haven't seen any problems.

I actually stumbled upon your dmesg(8) output while googling for a solution, but was left wondering whether you'd experienced the same kind of problems with your setup. Seems like I got my answer after all.

Have you, by any chance, ran a scrub on your pool yet?

I don't know whether it's significant or not, but the disks that previously reported errors (disk1 and disk2 below) are full:

[cmd=]zpool iostat -v[/cmd]

Code:

                  capacity     operations    bandwidth
pool            used  avail   read  write   read  write
-------------  -----  -----  -----  -----  -----  -----
backup         5.66T  1.59T     41     22  5.18M  2.25M
  label/disk1  1.81T  11.7M      0      1  13.4K  9.45K
  label/disk2  1.81T   232M      0      1  9.30K  12.7K
  label/disk3  1.04T   795G     20     10  2.60M  1.13M
  label/disk4  1.00T   831G     20     10  2.56M  1.11M
-------------  -----  -----  -----  -----  -----  -----

jem · Jun 25, 2011

I ran a scrub just after my previous post. It completed without errors in 30 minutes.

Incidentally, you do realize that your pool isn't raidz, right? It's just striping over the disks with no parity. If one disk dies, you'll most likely lose all your data. Compare:

Code:

        NAME           STATE     READ WRITE CKSUM
        backup         ONLINE       0     0     0
          label/disk1  ONLINE      34     0     0
          label/disk2  ONLINE       0     0     0
          label/disk3  ONLINE       0     0     0
          label/disk4  ONLINE       0     0     0

with:

Code:

        NAME        STATE     READ WRITE CKSUM
        pool0       ONLINE       0     0     0
          [RED]raidz1[/RED]    ONLINE       0     0     0
            ada0    ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0

I'm also wondering why two of your disks are showing 1.04 and 1.00TB capacities. Don't you have four 2TB/1.81TiB disks like me?

pva · Jun 25, 2011

jem said:
I ran a scrub just after my previous post. It completed without errors in 30 minutes.

Funny. I should probably update the firmwares of my disks for good measure.

On my system the scrub took a little over 7 hours to complete, so I'm assuming your pool is a little less full than mine.

Incidentally, you do realize that your pool isn't raidz, right? It's just striping over the disks with no parity. If one disk dies, you'll most likely lose all your data.

Yes, I'm well aware that I'm running a simple striped pool. This is a conscious choice of preferring raw capacity and cost over redundancy.

At the risk of veering even more off-topic, I'll elaborate some more:

As I've previously mentioned in this thread, the box in question is an offsite backup server. So, should one or more disks fail, I could just replace the failed disk(s), lug the box back over to the source site and copy over the original data from the source server (a Netgear NAS running their proprietary version of RAID-5).

I've considered switching to a raidz (or raidz2) pool once I've bought a 4 or 5 disk eSATA enclosure and some more disks to go with it. I reckon I'll have to do this sooner rather than later since the NAS has approximately 9 TiB of space and the backup server only has 7.2 TiB's worth.

Moreover, the data being backed up consists of my personal media collection which I've ripped from optical media I own. Hence, in the unlikely event of the proverbial excrement hitting the equally hypothetical fan, i.e. both my NAS and backup server failing simultaneously, I could still re-rip everything from the original media. This would obviously entail spending a few month's worth of free evenings re-ripping everything, but this is a risk I'm willing to run.

I'm also wondering why two of your disks are showing 1.04 and 1.00TB capacities. Don't you have four 2TB/1.81TiB disks like me?

Note that it's the used column that you're referring to. If you add up these values with the values from the avail column, they work out to 1.8TiB. (The reason disk3 and disk4 are only about half full is because I added them to the pool a few months after the first pair.)

jem · Jun 26, 2011

Oops, sorry for the oversight.

pva · Jul 7, 2011

pva said:
Funny. I should probably update the firmwares of my disks for good measure.

Well, it would seem that contrary to previous reports (e.g. on the smartmontools mailing list), disks manufactured in December 2010 stilll need the firmware patch after all.

I had a chance to patch the firmwares of my disks yesterday. Afterwards, I re-enabled ahci(4) and scrubbed the pool. The two disks that had previously failed (and were manufactured in December 2010), presented no trouble this time around.

tingo · Jul 7, 2011

Useful and interesting info. Thanks for mentioning it.

borov · Jul 8, 2011

Funny. I have the same Microserver.

No problem with scrub with a RAIDZ pool on these drives in AHCI mode:

Code:

<WDC WD15EADS-00P8B0 01.00A01>     at scbus0 target 0 lun 0 (ada0,pass0)
<WDC WD15EADS-00P8B0 01.00A01>     at scbus1 target 0 lun 0 (ada1,pass1)
<Hitachi HDS5C3020ALA632 ML6OA580>  at scbus2 target 0 lun 0 (ada2,pass2)

But I have a problem with TIMEOUTS when scrubbing the boot pool located on Intel (actually Kingston) X25-E 32Gb SSD:

Code:

ad1: TIMEOUT - READ_DMA retrying (1 retry left) LBA=9526168
ad1: TIMEOUT - READ_DMA retrying (1 retry left) LBA=9526176
ad1: TIMEOUT - READ_DMA retrying (1 retry left) LBA=9526186

SSD inserted in Optical Drive bay so it is in ATA mode.

AHCI device timeouts while performing ZFS scrub

Attachments