ZFS ZFS panic every 2-3 days

lapo · Jun 14, 2025

I've been using this ZFS pool since FreeBSD 7.4 (AFAIR) as my home storage, but a few months ago the host started panic'ing every 2-3 days. What's strange is that it can (and routinely does) complete a zfs scrub with zero problems.

Code:

panic: Solaris(panic): z: blkptr at 0xfffffe014c1602e0x DVA 0 has invalid VDEV 16775171
cpuid = 4
time = 1749674875
KDB: stack backtrace:
#0 0xffffffff80ba8f1d at kdb_backtrace+0x5d
#1 0xffffffff80b5aa11 at vpanic+0x161
#2 0xffffffff80b5a8a3 at panic+0x43
#3 0xffffffff8236e8cf at vcmn_err+0xdf
#4 0xffffffff82467f25 at zfs_panic_recover+0x55
#5 0xffffffff82533e60 at zfs_blkptr_verify_log+0x130
#6 0xffffffff82533b41 at zfs_blkptr_verify+0x251
#7 0xffffffff82534125 at zio_free+0x25
#8 0xffffffff823fb620 at dsl_dataset_block_kill+0x2a0
#9 0xffffffff823cec0f at dbuf_write_done+0x4f
#10 0xffffffff823b2aeb at arc_write_done+0x38b
#11 0xffffffff8253b5ce at zio_done+0xc7e
#12 0xffffffff82535258 at zio_execute+0x38
#13 0xffffffff80bbe4d2 at taskqueue_run_locked+0x182
#14 0xffffffff80bbf722 at taskqueue_thread_loop+0xc2
#15 0xffffffff80b13641 at fork_exit+0x81
#16 0xffffffff81024dee at fork_trampoline+0xe
Uptime: 1d21h50m13s
Dumping 5192 out of 16232 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

What could it be causing it?

Is there a way to know which vdev has the given number? Maybe it's disappearing from /dev suddendly? But that should just degrade the pool, not create a panic.

BTW: I'm using ZIL and L2ARC on an SSD to give the system a bit of speed, but I already tried removing both, and the behavior was exactly the same: panic every 2-3 days.

rfriemer · Jun 14, 2025

What is the status of the zpool?

# zpool status <Name of the Pool>

Look into the debug.log

# less /var/log/debug.log

Perhaps you can find some informations there?

lapo · Jun 14, 2025

Status is ONLINE, no errors in any vdev (it is a RAID-Z1), scrub completes successfully with 0 errors.
debug.log has only a few unrelated usbhid-ups warning from nut.

Deleted member 82802 · Jun 14, 2025

Tried memtest?

lapo · Jun 14, 2025

Good idea, I forgot about trying that. Will do.

cracauer@ · Jun 14, 2025

Unfortunately that error message is appearing twice in the code, one after another.

1) vdevid >= spa->spa_root_vdev->vdev_children

2) vd == NULL

So we don't know which one it is unless you hack up the code to make the error messages unique.

The only idea I can offer is to go through the disks one by one - dropping the device,wiping it and re-adding it. Maybe that wipes out that bad block. But it is risky, if you run into an error during this run your array goes offline.

Andriy · Jun 15, 2025

lapo said:
Is there a way to know which vdev has the given number?

Unless you have 167751712 vdevs in your pool, no vdev has that number.
"zfs_panic_recover" in the stack trace suggests that there is a knob (sysctl, tunable) that you can set to not panic when the problem is detected.
It seems from the stack trace that the problematic block was being overwritten anyways.

lapo · Jun 15, 2025

cracauer@ said:
Unfortunately that error message is appearing twice in the code, one after another.

Good catch… I guess it can only be the first one though: the vd value in the second log comes from vdev_t *vd = spa->spa_root_vdev->vdev_child[vdevid]; and since vdevid is a sequential number as stated by Andriy, that generate an array bounds error, not produce a null.

lapo · Jun 15, 2025

Just got a new one… vdev number is not changing, so I guess it's not a memory problem (not during read at least, might have been during the write of that block). Still wonder why that block doesn't show up in a full scrub, but does come up always after ~47h of "idle" usage of the NAS.

Code:

panic: Solaris(panic): z: blkptr at 0xfffffe014784bc60x DVA 0 has invalid VDEV 16775171
cpuid = 0
time = 1749833369
KDB: stack backtrace:
#0 0xffffffff80ba8f1d at kdb_backtrace+0x5d
#1 0xffffffff80b5aa11 at vpanic+0x161
#2 0xffffffff80b5a8a3 at panic+0x43
#3 0xffffffff8234b8cf at vcmn_err+0xdf
#4 0xffffffff82444f25 at zfs_panic_recover+0x55
#5 0xffffffff82510e60 at zfs_blkptr_verify_log+0x130
#6 0xffffffff82510b41 at zfs_blkptr_verify+0x251
#7 0xffffffff82511125 at zio_free+0x25
#8 0xffffffff823d8620 at dsl_dataset_block_kill+0x2a0
#9 0xffffffff823abc0f at dbuf_write_done+0x4f
#10 0xffffffff8238faeb at arc_write_done+0x38b
#11 0xffffffff825185ce at zio_done+0xc7e
#12 0xffffffff82512258 at zio_execute+0x38
#13 0xffffffff80bbe4d2 at taskqueue_run_locked+0x182
#14 0xffffffff80bbf722 at taskqueue_thread_loop+0xc2
#15 0xffffffff80b13641 at fork_exit+0x81
#16 0xffffffff81024dee at fork_trampoline+0xe
Uptime: 1d19h53m13s

Andriy · Jun 16, 2025

Seem like you have a vmcore, you can use kgdb to learn about about the block pointer.
Like its other DVAs, properties, where it resides, etc.

lapo · Jun 21, 2025

Actually… I got no core under /var/crash, I never noticed.
I use encrypted swap, but that's been supported for a while and dumpon -l shows the correct device (stripped of the .eli suffix), so I'll have to debug that first.
Oh, I see: it's not very documented, but GELI swap needs to be mounted sw,late in order for savecore to run.
Waiting for next panic.

ZFS ZFS panic every 2-3 days

Deleted member 82802

Guest