Solved zfs mount failed with error 5

Hi all, thanks for your help in advance!

After a power failure last night, my BSD server will no longer boot correctly. It is a Dell R710 server with six disks in a raidz2 and a PERC raid controller where each drive is its own mfi device - I know this is not the optimal controller but it is the machine I had. It has been running fine for years - I am not sure which version of BSD is on it, but I believe it is 11.

When booting, it hangs in mountroot with the message: "Mounting from zfs:zroot/ROOT/default failed with error 5."
The loader variable is:

vfs.root.mountfrom=zfs:/zroot/ROOT/default

Hitting the ? at the mountroot prompt shows a large number of devices. There are:
/gpt/zfs[0-5]
/gpt/gptboot[0-5]
mfid[0-5]
mfip[0-5]p[1-3]
and a few others for the cd and USB.

Booting from a USB stick and using "zfs import" I can see that the zfs pool is online with all disks online.

I am at a loss at what to do and want to be careful not to lose data from the pool. I don't know what else to post here, but am totally open to collecting other data if it will help resolve the problem.
 
I went to ensure that it was the case it was there, and in attempting to mount the pool to check found that it was corrupted, though it did say it was online. It said to mount it with -F and I would lose the final six seconds of data, but then to scrub the pool

I mounted it with zpool -f -F -R /mnt zroot, which went fine, and it is now scrubbing. Hopefully it will boot again when complete.
 
To help those googling as this is pretty much the only proper hit I found.

For the first time ever I had a ZFS fail to mount, and with this error code. (a datacentre went crazy with power cycles)

I had to boot rescue media, then ran the force import flag, it reported it rolled back 30 seconds of data, so 30 seconds of writes lost.
I exported and rebooted normally.

I did run a scrub after the reboot but no issues were found.

I am wondering if the force mount/import with rollback should be default behaviour, or at least configurable behaviour in rc.conf, its pretty much what innodb does anyway. A automated mount with rollback is preferable to a non booting system, where the solution is going to be the same anyway.
 
For the first time ever I had a ZFS fail to mount, and with this error code. (a datacentre went crazy with power cycles)
Error code 5 means IO error. It means the underlying hardware was unable to read data when mounting. ZFS then forwarded that error up. In case of error 5, the usual options at the hardware level are: (a) try to debug the root cause and fix it, or (b) retry the reads, and hope the next time they succeed (surprisingly common), or (c) ignore the problem, start writing to the disk again, the unreadable data will be overwritten (and therefore readable) sooner or later.

It seems you did some combination of (b) and (c). Hope it works in the long run.
 
And from another angle, losing data should never be a default.
The command to recover, loses the data regardless.

So its a choice of losing inconsistent data automatically and recovering aka, innodb behaviour.
Or failing to boot, system downtime, boot into rescue media, and then losing the same data when recovering manually.

What am I missing here, you have a magic wand to recover without data loss on inconsistent data?

How is it different e.g. to things like automated filesystem check and repairs during boot on legacy file systems.
 
Error code 5 means IO error. It means the underlying hardware was unable to read data when mounting. ZFS then forwarded that error up. In case of error 5, the usual options at the hardware level are: (a) try to debug the root cause and fix it, or (b) retry the reads, and hope the next time they succeed (surprisingly common), or (c) ignore the problem, start writing to the disk again, the unreadable data will be overwritten (and therefore readable) sooner or later.

It seems you did some combination of (b) and (c). Hope it works in the long run.
It wasnt a case of retrying and just getting lucky. I consistently repeated the mechanism.

Retrying to mount normally just failed consistently.
Doing a forced mount consistently worked with the exact same amount of roll back.

If error code 5 is supposed to mean what you say, then we have a bigger problem it that it is being misreported. The issue happened due to the datacentre being liberal with power cycling the server. The root case is already known, inconsistency in the data, due to sudden abrupt loss of power.
 
What am I missing here, you have a magic wand to recover without data loss on inconsistent data?
What you are missing is that different people have different goals and priorities.
Not everyone would be happy with what you propose.
You can customize your own system for yourself, by all means.
 
The root case is already known, inconsistency in the data, due to sudden abrupt loss of power.
No, the root cause is not known.
ZFS is resilient to sudden power losses if it runs on proper and properly configured hardware.
Either of those things is apparently not the case for the affected server of yours.
 
If error code 5 is supposed to mean what you say, then we have a bigger problem it that it is being misreported.
I don't think so. The likely explanation is that somewhere below the file system there was an IO error. That could for example be the disk interface (SATA or SAS) being only partially working and sometimes getting communication errors, or it could be one of the disks having an unreadable block, and that block being one that is needed for ZFS startup (mounting). We don't know the details, if all we know is that error 5 came out at the top.
 
What you are missing is that different people have different goals and priorities.
Not everyone would be happy with what you propose.
You can customize your own system for yourself, by all means.
I asked how to change the behaviour, and you just responded I can do it, if you know how, try to be more sharing of this information please.
 
No, the root cause is not known.
ZFS is resilient to sudden power losses if it runs on proper and properly configured hardware.
Either of those things is apparently not the case for the affected server of yours.
It might be the case, but the power loss event coincided.
 
Back
Top