ZFS scrub/resilver confusion

Greetings all,

I have two hard drives split into 10 GB rpool and 20 GB free hog datapool each, both pools arranged as mirrors. The configuration has been working without problems until I decided that it was a good idea to do periodical scrub.

First, after scrub

Code:
zpool datapool status
reported too many errors on one of the hard drives and marked it FAULTED. Ditto for the rpool.

I have cleared the errors:

Code:
zpool clear datapool
zpool clear rpool

and the

Code:
zpool status

reported that the hard drives were resilvered, ONLINE, and working with no known data errors.

I have repeated the exercise with the same results. What I am puzzled about is the fact that according to my understanding, the scrubbing and resilvering mechanism is substantially the same. So why one reports errors and the other does not? Or do I have one of the drives failing and being repaired in the background?

Any insight would be appreciated.

Kindest regards,

M
 
Scrubbing and resilvering aren't the same thing. Resilvering involves copying things from one disk to the other in order to regain consistency. As long as there's at least one good copy of a block that matches the checksum it will be copied to wherever else it needs to be.

Scrubbing just checks to make sure that everything is as it should and does some maintenance.

I'm sure there's more to it than that, but in simple terms that's what's going on.

Issuing a scrub command should not result in errors under normal conditions. This is the part of your situation which is problematic. It could be that a disk is going bad, a faulty controller, something up ZFS or probably other things, but for whatever reason ZFS is finding corruption of the data. Issuing a scrub command will turn up any errors that might be there, that is by design, it is unlikely that the command is itself causing any corruption.

Resilvering fixing the problem is to be expected, that's what it's there for, but I'd be really careful until I found the cause, and definitely make sure that there are adequate backups.
EDIT: If it's the same disk each time, I'd seriously consider replacing the thing, a hard disk that can't be trusted is worse than one that is outright broken.
 
hedwards,

thank you very much for the explanation. Yes, it is the same hard drive that is reporting the errors, so it would support your argument that it is a problem with the hard disk.

Before I replace it, is there a tool to check its functionality? Should smartools not report a problem?

Kindest regards,

M
 
It could be something else, but, the hard drive dieing is the most likely, usually when it's the controller or software it isn't the same disk. I'd probably go to the website of the manufacturer and see what tools they use for determining that. They should also give you some indication as to whether or not you can just return the disk.

Usually that software is reliable enough that the manufacturer will use it for the purposes of deciding whether or not to replace. The main problem with SMART is that it tends to miss a lot and unless the drive flags it, you might not know about it.

EDIT: The reason I suggested replacing it is that it's cheaper than data recovery typically. Also if you don't already have a UPS, you might consider getting one, the failure rate on my HDDs plummeted when I picked one up.
 
hedwards,

thank you once again for the response. I will see what tools the manufacturer has.

I am afraid that you are correct about the smartools, I now recall that on a different machine smartools did not report any problems, but a built-in test flagged a drive as faulty.

I do have a backup, I have learned the hard way when I had previously narrowly escaped losing data.

Kindest regards,

M
 
Do an extended self test with# smartctl -t long drive, might take a while but it should be a better indicator of problems than just looking at SMART attributes.
 
Back
Top