ZFS ZFS panic every 2-3 days

I've been using this ZFS pool since FreeBSD 7.4 (AFAIR) as my home storage, but a few months ago the host started panic'ing every 2-3 days. What's strange is that it can (and routinely does) complete a zfs scrub with zero problems.

Code:
panic: Solaris(panic): z: blkptr at 0xfffffe014c1602e0x DVA 0 has invalid VDEV 16775171
cpuid = 4
time = 1749674875
KDB: stack backtrace:
#0 0xffffffff80ba8f1d at kdb_backtrace+0x5d
#1 0xffffffff80b5aa11 at vpanic+0x161
#2 0xffffffff80b5a8a3 at panic+0x43
#3 0xffffffff8236e8cf at vcmn_err+0xdf
#4 0xffffffff82467f25 at zfs_panic_recover+0x55
#5 0xffffffff82533e60 at zfs_blkptr_verify_log+0x130
#6 0xffffffff82533b41 at zfs_blkptr_verify+0x251
#7 0xffffffff82534125 at zio_free+0x25
#8 0xffffffff823fb620 at dsl_dataset_block_kill+0x2a0
#9 0xffffffff823cec0f at dbuf_write_done+0x4f
#10 0xffffffff823b2aeb at arc_write_done+0x38b
#11 0xffffffff8253b5ce at zio_done+0xc7e
#12 0xffffffff82535258 at zio_execute+0x38
#13 0xffffffff80bbe4d2 at taskqueue_run_locked+0x182
#14 0xffffffff80bbf722 at taskqueue_thread_loop+0xc2
#15 0xffffffff80b13641 at fork_exit+0x81
#16 0xffffffff81024dee at fork_trampoline+0xe
Uptime: 1d21h50m13s
Dumping 5192 out of 16232 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

What could it be causing it?

Is there a way to know which vdev has the given number? Maybe it's disappearing from /dev suddendly? But that should just degrade the pool, not create a panic.

BTW: I'm using ZIL and L2ARC on an SSD to give the system a bit of speed, but I already tried removing both, and the behavior was exactly the same: panic every 2-3 days.
 
What is the status of the zpool?

# zpool status <Name of the Pool>

Look into the debug.log

# less /var/log/debug.log

Perhaps you can find some informations there?
 
Status is ONLINE, no errors in any vdev (it is a RAID-Z1), scrub completes successfully with 0 errors.
debug.log has only a few unrelated usbhid-ups warning from nut.
 
Unfortunately that error message is appearing twice in the code, one after another.

1) vdevid >= spa->spa_root_vdev->vdev_children

2) vd == NULL

So we don't know which one it is unless you hack up the code to make the error messages unique.

The only idea I can offer is to go through the disks one by one - dropping the device,wiping it and re-adding it. Maybe that wipes out that bad block. But it is risky, if you run into an error during this run your array goes offline.
 
Is there a way to know which vdev has the given number?
Unless you have 167751712 vdevs in your pool, no vdev has that number.
"zfs_panic_recover" in the stack trace suggests that there is a knob (sysctl, tunable) that you can set to not panic when the problem is detected.
It seems from the stack trace that the problematic block was being overwritten anyways.
 
Unfortunately that error message is appearing twice in the code, one after another.
Good catch… I guess it can only be the first one though: the vd value in the second log comes from vdev_t *vd = spa->spa_root_vdev->vdev_child[vdevid]; and since vdevid is a sequential number as stated by Andriy, that generate an array bounds error, not produce a null.
 
Just got a new one… vdev number is not changing, so I guess it's not a memory problem (not during read at least, might have been during the write of that block). Still wonder why that block doesn't show up in a full scrub, but does come up always after ~47h of "idle" usage of the NAS.
Code:
panic: Solaris(panic): z: blkptr at 0xfffffe014784bc60x DVA 0 has invalid VDEV 16775171
cpuid = 0
time = 1749833369
KDB: stack backtrace:
#0 0xffffffff80ba8f1d at kdb_backtrace+0x5d
#1 0xffffffff80b5aa11 at vpanic+0x161
#2 0xffffffff80b5a8a3 at panic+0x43
#3 0xffffffff8234b8cf at vcmn_err+0xdf
#4 0xffffffff82444f25 at zfs_panic_recover+0x55
#5 0xffffffff82510e60 at zfs_blkptr_verify_log+0x130
#6 0xffffffff82510b41 at zfs_blkptr_verify+0x251
#7 0xffffffff82511125 at zio_free+0x25
#8 0xffffffff823d8620 at dsl_dataset_block_kill+0x2a0
#9 0xffffffff823abc0f at dbuf_write_done+0x4f
#10 0xffffffff8238faeb at arc_write_done+0x38b
#11 0xffffffff825185ce at zio_done+0xc7e
#12 0xffffffff82512258 at zio_execute+0x38
#13 0xffffffff80bbe4d2 at taskqueue_run_locked+0x182
#14 0xffffffff80bbf722 at taskqueue_thread_loop+0xc2
#15 0xffffffff80b13641 at fork_exit+0x81
#16 0xffffffff81024dee at fork_trampoline+0xe
Uptime: 1d19h53m13s
 
Seem like you have a vmcore, you can use kgdb to learn about about the block pointer.
Like its other DVAs, properties, where it resides, etc.
 
Actually… I got no core under /var/crash, I never noticed.
I use encrypted swap, but that's been supported for a while and dumpon -l shows the correct device (stripped of the .eli suffix), so I'll have to debug that first.
Oh, I see: it's not very documented, but GELI swap needs to be mounted sw,late in order for savecore to run.
Waiting for next panic.
 
Back
Top