ZFS cannot import '' I/O error

lemon · Aug 6, 2018

Hi

I'm hoping someone can assist me with this. I have a single backup device which is refusing to mount. There was no power failure per se but a shutdown was executed with the force option. At the time it would have been doing reads no writes. I was running a directory comparison before I resync'd.

Upon reboot I constantly get this
******
cannot import 'rbackup': I/O error
Destroy and re-create the pool from
a backup source.
*****
a straight zpool import shows this
*****
pool: rbackup
id: 11441342042684107067
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

rbackup ONLINE
sde ONLINE
cache
sdf3
******

I tried zpool import -nfFX -R /rbackup/ rbackup and it ran for a few minutes returned with no error. Sadly this did not solve the issue.

Keep in mind I was running on Linux / Manjaro with 4.17 Kernel at the time. I'm asking here cause I booted my BSD box and it get's the same error. Having used ZFS longer on BSD

.

SirDice · Aug 6, 2018

It's possible the pool is so corrupted there's nothing else to do but destroy it and recreate it. ZFS is great and it's capable of fixing a lot of errors on its own but it's not foolproof and certain errors just cannot be fixed. It's also very much possible the pool is corrupted because the disk itself is bad.

lemon · Aug 6, 2018

SirDice said:
It's also very much possible the pool is corrupted because the disk itself is bad.

Thanks for the fast response.

I thought as much but I did a DD clone to another 4TB drive and got no errors. I have then also tried mounting the "new" drive as well with the same results. With respect to "ZFS is great". I would think if the filesystem was being written to I'd agree but in this case it was doing reads so I would expect it to be more resilient? As I said this wasn't even a power failure. The last user "write" to the drive was about a month ago.

ralphbsz · Aug 6, 2018

Trying to decide whether the root cause is a hardware problem with the disk, or not:
So ZFS says "I/O error".
But running a dd works perfect, without any IO errors.
Those two statements seem to contradict each other.

I think when any disk IO errors happen, they are logged in /var/log/messages. Can you look there for any recent error messages?

And can you please run smartctl on the disk drive, and look at the output to see whether the disk is reporting any IO errors, or worse?

Why am I asking all this? Because if it is a hardware problem, you MIGHT be able to fix it. For example, it could be a bad SATA cable, or a loose power connection on the disk drive. That's repairable, and then you get your data back. Or you MIGHT be able to understand the root cause, and know that it will never get fixed, for example if the disk drive itself has announced pending failure and has lots of sector errors. On the other hand, if it is corruption of on-disk structures by ZFS, then it's more scary.

lemon · Aug 14, 2018

Smart reports no errors and disk is OK. As I mentioned I took the drive to another pc and DD it. Both the new and old drive get the same error. Is there a way I can rollback to specific "transaction"?

bds · Aug 14, 2018

zpool import -T txg
I'd suggest -F, then -FX, then -T in that order, having first taken a backup of the original media.
Going back to the "I/O error": does the device have a disk label/partition table and if so is it valid? And as ralphbsz suggests, check for more detail in /var/log/messages.

lemon · Aug 16, 2018

So turns out a faulty SATA cable on the host appears to have caused this. Needless to say it was only reading...as no "new" information was written for around a month so I'm still hoping to recover the data. When the DD was done it was done on another PC.

Running -FX on both the old original and new DD'd drive I get the following after around 10 mins
***
cannot import 'rbackup': one or more devices is currently unavailable
***
I did notice some errors in log about task "l2arc_feed:682 blocked for more than 120 seconds"

I thought maybe it was moaning about cache device but it is present and tried with -FXm and got the same result.
I'm not familiar with -T txg?

ShelLuser · Aug 16, 2018

Well, according to zpool(8) ( man zpool) there is no such thing as a -T option for import, only iostat. But you said you were doing this on Linux? Maybe that's what bds is referring to, I don't know. But having said that, -X is also unknown in that retrospect, why quote Linux commands here anyway? That serves no purpose and will only confuse others, let's stick to FreeBSD here.

Anyway, the issue at hand can be somewhat explained. Even though dd might be perfectly capable of reading all the sectors of a disk, this doesn't automatically imply that the data it reads is fully consistent with whatever the filesystem expects of it. So although dd does exclude a few causes, it isn't foolproof. It is most certainly no proof at all that the filesystem is consistent.

I also think you're performing some very risky operations on that pool. If a pool is at fault and you value your data then definitely don't try to start mounting stuff r/w immediately or try to force recovery mode. That's taking unnecessary risks.

At first: # zpool import -NfR /mnt -o readonly=yes zbackup for example. This will try to import the pool yet readonly and without automatically mounting any filesystems. That way you can try zpool status -v to check on the pool (might want to share its output) and then zfs list to check if it actually picked up any filesystem(s). If so you can use zfs mount to try and mount those, see zfs(8) for details.

That way you might be able to set up a situation where you can actually (safely) access your data.

lemon · Aug 22, 2018

Thanks ShelLuser, with respect to the DD it's done first so that any attempts at mounting etc mean I have a full image which I can muck with to my hearts content

.

running zpool import -NfR /mnt -o readonly=yes rbackup
cannot import 'rbackup': I/O error
Destroy and re-create the pool from
a backup source.
So sadly zpool status -v only shows my other pools.

lemon · Aug 22, 2018

In terms of /var/log/messages

There are quite a bit of messages like this when running -FX

Aug 22 14:38:44 ZFS[82511]: vdev state changed, pool_guid=11441342042684107067 vdev_guid=16960454535649514919
Aug 22 14:38:44 ZFS[82550]: vdev state changed, pool_guid=11441342042684107067 vdev_guid=16960454535649514919
Aug 22 14:38:44 ZFS[82597]: vdev state changed, pool_guid=11441342042684107067 vdev_guid=16960454535649514919
Aug 22 14:38:44 ZFS[82710]: vdev state changed, pool_guid=11441342042684107067 vdev_guid=16960454535649514919
Aug 22 14:38:45 ZFS[82762]: vdev state changed, pool_guid=11441342042684107067 vdev_guid=16960454535649514919

And once again followed by "one or more devices is currently unavailable".
Keep in mind this is a single drive which was using one of the partitions on an SSD for write cache. That SSD is present.

ZFS cannot import '' I/O error

Administrator