Solved ZFS kernel panic: Solaris(panic): blkptr at 0xfffffe006bf0d4a8 DVA 0 has invalid OFFSET 70368744177664

rforberger · Jul 11, 2020

Hi,

I have a problem with my ZFS pool(s). My FreeBSD machine, release 12.1-RELEASE-p7, reboots every day with a kernel panic with the panic string: Solaris(panic): blkptr at 0xfffffe006bf0d4a8 DVA 0 has invalid OFFSET 70368744177664 .

I researched a bit on the web and found this might be related to ZFS checksum inconsistencies.

I tried to run zdb -bcsvL <zfs_pool>, and it shows me the following error:

Code:

Traversing all blocks to verify checksums ...

20.3G completed (  80MB/s) estimated time remaining: 0hr 04min 09sec        zdb_blkptr_cb: Got error 122 reading <89, 1291486, 0, 0> DVA[0]=<0:1234dfa000:1000> [L0 ZFS plain file] fletcher4 lz4 LE contiguous unique single size=c800L/1000P birth=9193909L/9193909P fill=1 cksum=5d22642c31:f3e9d9c7a26a:15e92acac404c64:6ae12a949dd01798 -- skipping
40.0G completed (  78MB/s) estimated time remaining: 0hr 00min 00sec
Error counts:

        errno  count
          122  1

The pool has been created according to

Code:

zfs get creation zroot
NAME   PROPERTY  VALUE                   SOURCE
zroot  creation  Sa. Jan. 19 17:02 2019  -

.

I have tried to set the kernel option vfs.zfs.recover: 1, but it doesn't mitigate the error when running zdb.

Any ideas how I could solve this? Recreating the pool is actually not an option, since this is my FreeBSD root file system.

Best regards,
rforberger

rforberger · Jul 15, 2020

I found out now, that the affected machine always reboots, when a backup with bacula is launched on a zfs filesystem with the above kernel panic.

I guess the zfs filesystem is broken as it says:

Code:

 # zpool status -v home_rz3
  pool: home_rz3
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub canceled on Sun Jul 12 19:53:50 2020
config:

        NAME        STATE     READ WRITE CKSUM
        home_rz3    ONLINE       0     0     0
          raidz3-0  ONLINE       0     0     0
            da8     ONLINE       0     0     0
            da9     ONLINE       0     0     0
            da10    ONLINE       0     0     0
            da11    ONLINE       0     0     0
            da12    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        home_rz3:<0x0>

Any ideas how to fix these "Permanent errors"? Recreating the pool is actually not an option, since I don't have enough free disks available.

Jose · Jul 15, 2020

Are these hardware errors? I don't know of any way to fix hardware errors with software.

rforberger · Jul 15, 2020

Hi Jose ,
I don't know if these are hardware errors. I don't see any disk errors in dmesg. The disks are not that old as well, about 1 year.

How else can I check if the devices of the pool can have hardware errors?

Jose · Jul 15, 2020

Have you run SMART tests on your drives?

rforberger · Jul 15, 2020

Jose
Now I have run the SMART tests. The freebsd system is virtualized on VMWare ESXi, but I don't see any SMART errors on the ESXi hosts on the affected physical disks.

Jose · Jul 15, 2020

Sorry, I don't know much about Esxi, and I don't plan to learn. I got over proprietary Unices in the late '90s.

rforberger · Jul 15, 2020

OK, well I think it looks like I will have to recreate the zpool anyway.

Any other suggestions are welcome.

rforberger · Jul 17, 2020

I cannot recreate the zpool on a different disks, using zfs send .. | zfs receive .., as the LiveCD also crashes with a kernel panic with the panic string

Solaris(panic): blkptr at 0xfffffe006bf0d4a8 DVA 0 has invalid OFFSET 70368744177664.

Even accessing the files on the zpool directly causes this kernel panic.

I have tried to scrub the pool, but without any positive result.

Any ideas how I can repair the zfs pool? I was thinking ZFS is very robust.

Best regards,
rforberger

JohnnySorocil · Jul 17, 2020

Try it with -STABLE or -CURRENT? Maybe that error (or something virtualizing related) is fixed in the meantime, maybe not

rforberger · Jul 17, 2020

JohnnySorocil
Ok thanks. Maybe an idea. But I don't want to break the zpool with -STABLE or -CURRENT since at least -CURRENT might not be stable enough. I have important data on this zpool.

Any other ideas?

rforberger · Jul 20, 2020

I have found out now, that when I set the kernel option vfs.zfs.recover=1, the kernel panic turns into a WARNING and the system will not reboot.

But now I get the kernel panic as a warning and an input/output error when trying to read the affected zpool. I have tried zfs send ... | zfs receive ... and trying to transfer the data directly with bsdtar. Both result in an input/output error.

Any chances to read the data completely anyway?

Maybe with a Linux LiveCD and ZFS support?

I don't think it's a hardware error though, since my disks are reporting as healthy.

I need to access the data from the broken zpool.

rforberger · Jul 21, 2020

OK, I got off the files and directories via rsnyc from the broken pool now. Some files had input/output error, so I might be missing some files.

I re-created the pool on the same devices and it's running fine.

I couldn't read the broken zpool with zfs send ... with a Linux LiveCD with ZFS on Linux support (version 0.8.4) and neither with FreeBSD-13-CUREENT.