Why does ZFS keep resilvering at every reboot?

Hi,

I have a FreeBSD 8.2 box running a ZFS mirror over two hard drives (drive0 and drive1). One had a problem and was replaced using the replace command. After the resilvering was completed, the status would still show
Code:
Replacing drive0/old and drive0
After reboot, the same status would still be there and the resilvering would restart from the beginning.

I then detached both drive0 and drive0/old and attached drive0.
Resilvering started again and completed. The status looked normal except for three errors in some files.

But when I rebooted the machine, the resilvering started again and it keeps restarting every time the machine is rebooted.

Why is this happening? And how to stop it?
 
You might want to install sysutils/smartmontools, then run
# smartctl -t short /dev/whatever
for both drives, wait a couple of minutes, then check drive health by running
# smartctl -H /dev/whatever

It might also help people to understand the situation if you provide the output of # zpool status (make sure to paste it into code tags - see here for available BBCodes.)
 
Thank you for the advice. Both drives are reporting a healthy status. This is the output of the command [CMD=""]zpool status[/CMD]
Code:
  pool: zroot
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 1h13m with 3 errors on Sun Jun 24 09:44:32 2012
config:

	NAME           STATE     READ WRITE CKSUM
	zroot          ONLINE       0     0     3
	  mirror       ONLINE       0     0     8
	    gpt/disk1  ONLINE       0     0     8
	    gpt/disk0  ONLINE       0     0     8  161G resilvered

errors: 3 data errors, use '-v' for a list

And so if I reboot the machine it will start again to resilver disk0 entirely.
 
Hi,

Well, I didn't. I did not think this could be related as I have seen this type of error with another array (corrupted files). It is supposed to indicate that there is not enough valid copies left to fix the data. I usually replaced the files or deleted them if they were not important. In any case, it did not lead the drive to resilver at every reboot.

If that can help, this is the complete message:
Code:
pool: zroot
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 1h13m with 3 errors on Sun Jun 24 09:44:32 2012
config:

	NAME           STATE     READ WRITE CKSUM
	zroot          ONLINE       0     0     3
	  mirror       ONLINE       0     0     8
	    gpt/disk1  ONLINE       0     0     8
	    gpt/disk0  ONLINE       0     0     8  161G resilvered

errors: Permanent errors have been detected in the following files:

        <0x15f>:<0x0>
        <0x16f>:<0x3038>
        <0x175>:<0x10cc>

By the way zpool scrub zroot will as well lead to resilvering the drive even if everything seemed ok with it.
One thing I noticed though is that the checksum of the drives and the mirror are identical but different from the checksum of the pool. Is this normal?
 
One or more devices has experienced an error resulting in data corruption.

Either ZFS is broken and telling lies (reporting errors where there are none) OR your hardware (drive/cable/controller/PSU/other) is broken.

Given that I'd trust end to end checksumming further than I'd trust SMART results, I'd be backing up all my data, swapping drive cables, and running proper hardware tests on the drives / motherboard controller.


I guess it comes down to this question: Do you trust ZFS? If so, it is telling you your drives are incurring errors, whether or not that is cable/controller/disk surface is another question to answer...
 
Back
Top