Recurring data errors on a ZFS mirror pool

Hi everybody,

I have a small home server running with FreeBSD as OS and ZFS on all disks. I've created "rpool" over a year ago and in the last two months, error rates are increasing. Although I've already checked S.M.A.R.T. data, I cannot find anything out of the ordinary that would explain these recurring errors. Most of those errors I can get rid of with a simple ZFS scrub, but sometimes the errors remain and I have to use snapshots.

This is an example of how the root pool (a mirror with two disks) looked today in the morning:

rpool, before ZFS scrub
Code:
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: [url]http://www.sun.com/msg/ZFS-8000-8A[/url]
 scrub: scrub in progress for 0h0m, 2.43% done, 0h21m to go
config:

        NAME         STATE     READ WRITE CKSUM
        rpool        ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            aacd0p3  ONLINE       0     0     0
            aacd1p3  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        //bin/tcsh
        //sbin/devd
        /usr/local/man/whatis
        /usr/local/lib/xcdroast-0.98/bin
        /usr/bin/strip
        rpool/usr:<0x27d3b>
        /usr/local/bin/bash
        rpool/usr:<0x34a57>
        rpool/usr:<0x34a5c>
        rpool/usr:<0x34a5d>
        /usr/local/bin/libtool
        /usr/local/lib/libruby18.so.18
        /usr/bin/nm
        /usr/ports/distfiles/autoconf-2.68.tar.bz2
        /usr/ports/devel/autoconf/ruby18.core
        /usr/ports/distfiles/teTeX/tetex-texmf-3.0.tar.gz

rpool, after ZFS scrub
Code:
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: [url]http://www.sun.com/msg/ZFS-8000-8A[/url]
 scrub: scrub completed after 0h17m with 2 errors on Sun Jul 17 10:24:24 2011
config:

        NAME         STATE     READ WRITE CKSUM
        rpool        ONLINE       0     0     2
          mirror     ONLINE       0     0     4
            aacd0p3  ONLINE       0     0    10  42.5K repaired
            aacd1p3  ONLINE       0     0     7  18K repaired

errors: Permanent errors have been detected in the following files:

        //devd.core
        /usr/ports/devel/autoconf/ruby18.core

Most of the files that are affected have been written once (during installation) on the system and suddenly, they have data errors. I'm also surprised of the fact that both disks show errors and not just one. I thought if one disks always had errors, I'd just have it replaced but in this case I'm confused.

One other thing that just occurred to me:
I've reconfigured the system two months ago and all the disks that show data errors are connected via an Adaptec RAID 3805 HBA (8 in total, 2 root pool disks, and 6 data pool disks). I don't use any of the RAID features of the Adaptec card and let ZFS handle the RAID setup.

I appreciate any ideas that help me to identify the source of the problem.

Have a nice day everyone!
 
ana5azi said:
I appreciate any ideas that help me to identify the source of the problem.
It would help if you could post the version of FreeBSD you're running (and if it is a -STABLE or -CURRENT, the date at which you last updated the source tree), as well as the ZFS filesystem version and pool version (these are normally displayed at boot time, or when the ZFS kernel modules are loaded). Also, the brand / model number / firmware version of the disk drives and disk controller(s) you are using.
 
Hi,

I'm running FreeBSD 8.2 Release p2 and the last port upgrade was done last Friday (15. July)
ZFS filesystem version 4
ZFS storage pool version 15

Puh, and now the tricky stuff ...

The "rpool" is made of two mirrored HDDs, connected to the Adaptec RAID 3805
Code:
Model Family:     Seagate Momentus 7200.4 series, 500GB
Device Model:     ST9500420AS
Firmware Version: 0002SDM1

The "dpool" is made of six HDDs utilizing RAIDZ1, also connected to the Adaptec RAID 3805
Code:
Model Family:     Seagate Barracuda 7200.12 family, 1TB
Device Model:     ST31000528AS
Firmware Version: CC38

The information about the Adaptec Raid HBA:
Code:
aac0: <Adaptec RAID 3805> mem 0xfe800000-0xfe9fffff irq 18 at device 14.0 on pci3
aac0: Enabling 64-bit address support
aac0: Enable Raw I/O
aac0: Enable 64-bit array
aac0: New comm. interface enabled
aac0: [ITHREAD]
aac0: Adaptec 3805, aac driver 2.1.9-1


The "mpool" consists of four HDDs as RAIDZ1, directly connected to the controller of the mainboard, which is an Asus M4A785TD-V.
Code:
Model Family:     Seagate Barracuda LP, 2TB
Device Model:     ST32000542AS
Firmware Version: CC34

The information about the on-board controller:
Code:
atapci0: <ATI IXP700/800 SATA300 controller> port ... mem 0xfe4ffc00-0xfe4fffff irq 22 at device 17.0 on pci0
atapci0: AHCI v1.10 controller with 6 3Gbps ports, PM supported


The only pool that stays clean all the time, is the "mpool" which is connected directly to the mainboard. Initially, when I set up the system first, I did not use the Adaptec HBA and the system was running for about 6 months without a single data error. Due to performance issues I switched from the cheap HBA that I used first to the Adaptec 3805 and I deactivated "write caching" on the HBA from the beginning on.

Enjoy your day!
 
Back
Top