zfs errors, what do you make of this?

I'm not sure what to make of this exactly. I originally saw this message:
Code:
  pool: wonspool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P

and i had one checksum error.

So i ran a scrub (i hadn't run one in maybe a month....i'm thinking now that was too long)



and this is the result.
Code:
  pool: wonspool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 9h0m with 0 errors on Thu Dec 17 17:01:11 2009
config:

	NAME             STATE     READ WRITE CKSUM
	wonspool         ONLINE       0     0     0
	  raidz1         ONLINE       0     0     0
	    label/won0   ONLINE       0     0     2  85.5K repaired
	    label/won1   ONLINE       0     0     1  42.5K repaired
	    label/won2   ONLINE       0     0     1  43K repaired
	    label/won3   ONLINE       0     0     0
	  raidz1         ONLINE       0     0     0
	    label/won4   ONLINE       0     0     0
	    label/won5   ONLINE       0     0     2  85.5K repaired
	    label/won6   ONLINE       0     0     6  256K repaired
	    label/won7   ONLINE       0     0     0
	  raidz1         ONLINE       0     0     0
	    label/won8   ONLINE       0     0     6  256K repaired
	    label/won9   ONLINE       0     0     4  171K repaired
	    label/won10  ONLINE       0     0     4  171K repaired
	    label/won11  ONLINE       0     0    29  1.13M repaired
	logs             ONLINE       0     0     0
	  label/zil0     ONLINE       0     0     0

errors: No known data errors



now, i can tell zfs worked exactly as it was designed here, but should i be worried about label/won11 with it's 29 errors? or should i be more worried that they are spread out across the devices like this?


also, is it possibel to find out more info about which files were messed up
 
Checksum errors you don't have to worry about too much, as that's possibly just some flipped bits somewhere in the storage stack (disk, controller, RAM, CPU, etc). What's on disk doesn't match the checksum for that block.

Read/write errors are what you really need to worry about. Once those start to climb, you need to think about replacing the drive.

However, considering you have repairs done to almost every disk in the pool, I'd start looking for causes. Heat? Dust? Bad RAM? etc. You should not have that many errors across that many disks all at the same time. :)

On our 24-drive boxes, we rarely see more than 1 drive with errors at any one time and those are generally read errors due to dying drives (RAID controller shows ECC errors on the drive).
 
Keep an eye on it. :) If there continues to be errors across all the drives in subsequent scrubs, you may want to start investigating replacement hardware.
 
phoenix said:
Keep an eye on it. :) If there continues to be errors across all the drives in subsequent scrubs, you may want to start investigating replacement hardware.

ok, i'll run another scrub and see what happens....i also noticed this when it hit an error:
Code:
Dec 16 19:29:50 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won11 offset=674551107072 size=43520
Dec 17 03:22:06 wonslung-raidz sudo: wonslung : TTY=pts/3 ; PWD=/var ; USER=root ; COMMAND=/usr/bin/su
Dec 17 03:30:00 wonslung-raidz sudo: wonslung : TTY=pts/4 ; PWD=/usr/home/wonslung ; USER=root ; COMMAND=/usr/bin/su
Dec 17 07:56:59 wonslung-raidz sudo: wonslung : TTY=pts/5 ; PWD=/usr/home/wonslung ; USER=root ; COMMAND=/usr/bin/su
Dec 17 14:30:34 wonslung-raidz kernel: hptrr: [0,0] completion error, flags=84
Dec 17 14:30:37 wonslung-raidz kernel:
Dec 17 14:30:37 wonslung-raidz kernel: hptrr: ATA regs: error 4, sector count 80, LBA low 501e, LBA mid ff, LBA high b, device 0, status 41
Dec 17 14:30:37 wonslung-raidz kernel:
Dec 17 14:30:37 wonslung-raidz kernel: hptrr: start channel [0,0]
Dec 17 14:30:37 wonslung-raidz kernel: hptrr: channel [0,0] started successfully
Dec 17 14:30:41 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won11 offset=687593111552 size=2048
Dec 17 14:34:07 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won1 offset=693255964160 size=43520
Dec 17 14:35:10 wonslung-raidz kernel: hptrr: [0,0] completion error, flags=84
Dec 17 14:35:11 wonslung-raidz kernel:
Dec 17 14:35:11 wonslung-raidz kernel: hptrr: ATA regs: error 4, sector count 80, LBA low 5091, LBA mid bc, LBA high d5, device 0, status 41
Dec 17 14:35:11 wonslung-raidz kernel:
Dec 17 14:35:11 wonslung-raidz kernel: hptrr: start channel [0,0]
Dec 17 14:35:11 wonslung-raidz kernel: hptrr: channel [0,0] started successfully
Dec 17 14:35:07 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won11 offset=690794025472 size=44032
Dec 17 14:35:07 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won11 offset=690794069504 size=43520
Dec 17 14:35:21 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won8 offset=692907514368 size=44032
Dec 17 14:36:00 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won2 offset=693494694912 size=44032
Dec 17 14:36:20 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won11 offset=693110102016 size=43520
Dec 17 14:36:22 wonslung-raidz kernel: hptrr: [0,0] completion error, flags=84
Dec 17 14:36:22 wonslung-raidz kernel:
Dec 17 14:36:22 wonslung-raidz kernel: hptrr: ATA regs: error 40, sector count e8b0, LBA low 50fd, LBA mid 6a, LBA high 28, device 40, status 41
Dec 17 14:36:22 wonslung-raidz kernel:
Dec 17 14:36:22 wonslung-raidz kernel: hptrr: start channel [0,0]
Dec 17 14:36:22 wonslung-raidz kernel: hptrr: channel [0,0] started successfully
Dec 17 15:07:30 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won11 offset=662447910400 size=44032
Dec 17 15:07:31 wonslung-raidz kernel: hptrr: [0,0] completion error, flags=84
Dec 17 15:07:31 wonslung-raidz kernel:
Dec 17 15:07:31 wonslung-raidz kernel: hptrr: ATA regs: error 4, sector count 80, LBA low 4dba, LBA mid 9c, LBA high 1e, device 0, status 41
Dec 17 15:07:31 wonslung-raidz kernel:
Dec 17 15:07:31 wonslung-raidz kernel: hptrr: start channel [0,0]
Dec 17 15:07:30 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won11 offset=662447954432 size=43520
Dec 17 15:07:31 wonslung-raidz kernel: hptrr: channel [0,0] started successfully
Dec 17 15:13:27 wonslung-raidz kernel: hptrr: [0,0] completion error, flags=84
Dec 17 15:13:28 wonslung-raidz kernel:
Dec 17 15:13:28 wonslung-raidz kernel: hptrr: ATA regs: error 40, sector count 1040, LBA low 4e04, LBA mid 1c, LBA high e7, device 40, status 41
Dec 17 15:13:28 wonslung-raidz kernel:
Dec 17 15:13:28 wonslung-raidz kernel: hptrr: start channel [0,0]
Dec 17 15:13:28 wonslung-raidz kernel: hptrr: channel [0,0] started successfully
Dec 17 15:13:28 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won11 offset=677761676800 size=43520
Dec 17 15:20:17 wonslung-raidz kernel: hptrr: [0,0] completion error, flags=84
Dec 17 15:20:18 wonslung-raidz kernel:
Dec 17 15:20:18 wonslung-raidz kernel: hptrr: ATA regs: error 40, sector count 1888, LBA low 4fe1, LBA mid 2, LBA high e7, device 40, status 41
Dec 17 15:20:18 wonslung-raidz kernel:
Dec 17 15:20:18 wonslung-raidz kernel: hptrr: start channel [0,0]
Dec 17 15:20:18 wonslung-raidz kernel: hptrr: channel [0,0] started successfully
Dec 17 15:20:18 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won11 offset=686350598144 size=44032
Dec 17 15:20:18 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won11 offset=686350642176 size=43520
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won6 offset=684323852800 size=44032
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won6 offset=684323896832 size=43520
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won9 offset=692884438016 size=44032
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won9 offset=692884482048 size=43520
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won0 offset=693483714048 size=44032
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won0 offset=693483758080 size=43520
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won8 offset=692890314752 size=44032
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won8 offset=692890358784 size=43520
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won5 offset=684335797248 size=44032
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won5 offset=684335841280 size=43520
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won6 offset=684339430400 size=44032
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won6 offset=684339474432 size=43520
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won8 offset=692901019136 size=44032
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won8 offset=692901063168 size=43520
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won10 offset=692902463488 size=44032
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won10 offset=692902507520 size=43520
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won9 offset=692904958976 size=44032
Dec 17 15:34:25 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won9 offset=692905003008 size=43520
 
Looks like ATA timeouts or ECC errors or something like that, between the controller and the harddrive. hptrr is resetting the ATA channel a lot. Something's definitely not right, but can't really tell if it's the controller or the disk or both or ?
 
man....something is def. wrong.

I tried to run another scrub today and it seems to do fine for awhile, then all the sudden the raid card warning buzzer goes off.

I forgot the exact error (the beeping was so loud and annoying that i restarted the machine)

But it said something along the lines of
""Synchronize Cache Failed!"

and zpool status shows a ton of i/o errors on a single disk.

these are the only actual errors i can pull from /var/log/messages
Code:
Dec 22 10:54:49 wonslung-raidz root: ZFS: checksum mismatch, zpool=wonspool path=/dev/label/won11 offset=698021728256 size=43520
Dec 22 10:54:49 wonslung-raidz kernel: hptrr: [0,0] completion error, flags=84
Dec 22 10:54:49 wonslung-raidz kernel: 
Dec 22 10:54:49 wonslung-raidz kernel: hptrr: ATA regs: error 4, sector count 10a0, LBA low 5101, LBA mid e8, LBA high 44, device 40, status 41
Dec 22 10:54:49 wonslung-raidz kernel: 
Dec 22 10:54:49 wonslung-raidz kernel: hptrr: start channel [0,0]
Dec 22 10:54:49 wonslung-raidz root: ZFS: vdev I/O failure, zpool=wonspool path=/dev/label/won11 offset=262144 size=8192 error=22
Dec 22 10:54:49 wonslung-raidz kernel: hptrr: fail to start channel [0,0]
Dec 22 10:54:49 wonslung-raidz kernel: hptrr: device disconnected on channel [0,0]
 
Back
Top