FreeBSD 8.3 w/ ZFS 5 disk timeouts/issues.

salival · Feb 5, 2013

Gentlemen,

My server consists of the following:

CPU: Intel E6600 2.4Ghz core 2 duo
MB: Gigabyte GA P35DS3R
HD: 5x WD20EARX's in a raid Z1.

So when I built this server it worked flawlessly, great speeds over my network and 0 issues. Then I noticed that occasionally I would get AHCI timeouts via dmesg and a very extreme case would be the array would fault.

When you then ran a scrub, no errors would get found or maybe at the very most 200-850KB of data would be resilvered. It would then go for a few days to a few weeks before anything else cropped up.

Queue a few months back it faulted and one of my 2TB drives was showing SMART errors and had dropped its self from the array, Harrah! I thought, there's my issue.. So I purchased another 2TB drive and sent the damaged drive in for RMA. After a week or so I heard from the computer store and they mentioned the drive tested fine, there was nothing wrong with it. The array was back to normal but now, it occasionally does the same thing on a different port. (Was ada2 now ada0)

I've tried replacing SATA cables with no joy but I do notice that when I get a timeout error or the array drops I can power the server down, unplug and replug the SATA connectors and I have no issues for a while.. Could this be a bad controller or just bad cables?

Attached is some printouts from the server and the attached status/resilver information was taken tonight when the array was taken down. Once again, reclipped the SATA connectors and powered it back up and ran a scrub with 0 issues.

Thanks,
Scott

Code:

ahcich2: Timeout on slot 13 port 0
ahcich2: is 00000000 cs 00000000 ss 00006000 rs 00006000 tfd 40 serr 00880000 cmd 0004ce17
ahcich2: Timeout on slot 29 port 0
ahcich2: is 00000000 cs 00000006 ss e0000007 rs e0000007 tfd c0 serr 00880000 cmd 0004c117
ahcich2: Timeout on slot 9 port 0
ahcich2: is 00000000 cs 01000000 ss 01ff0200 rs 01ff0200 tfd c0 serr 00880000 cmd 0004d817

Code:

[root@gateway ~]# zpool status
  pool: storage
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 177M in 0h0m with 0 errors on Tue Feb  5 19:03:06 2013
config:

        NAME        STATE     READ WRITE CKSUM
        storage     ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada0    ONLINE      10 11.0K    18
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0
            ada4    ONLINE       0     0     0

errors: No known data errors

dmesg information: http://pastebin.com/yiLY06hE

FreeBSD 8.3 w/ ZFS 5 disk timeouts/issues.

salival