Kernel panic when deleting ZFS snapshot

mheppner

New Member

Reaction score: 2
Messages: 17

Recently, I've been getting a kernel panic every morning and I realized it was the same time cron was executing sysutils/zfsnap to delete old snapshots. I narrowed it down to one snapshot that was causing the issue, zroot@2017-05-21_01.00.00--2w. Since that is recursive from the parent zfs, I have no idea which dataset is the actual culprit. I can reproduce the panic by trying to delete that specific snapshot. Other snapshots can be deleted. The pool is scrubbed weekly and hasn't had any errors.

I'm not very experienced with kernel panics, so how can I figure out what's going on? This just seems like a generic paging error. Is this a hardware issue, maybe a bad memory stick? Hard drive failure? SMART tests look fine and I just replaced the memory not too long ago. I updated the OS a few weeks ago, but the issue is still happening.


Code:
FreeBSD freebsd 11.0-RELEASE-p10 FreeBSD 11.0-RELEASE-p10 #0 r318606: Mon May 22 00:36:40 EDT 2017     root@freebsd:/usr/obj/usr/src/sys/GENERIC  amd64

panic: page fault
Code:
Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address   = 0x48
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff822048e3
stack pointer           = 0x28:0xfffffe045991b640
frame pointer           = 0x28:0xfffffe045991b6f0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 5 (txg_thread_enter)
trap number             = 12
panic: page fault
cpuid = 2
KDB: stack backtrace:
#0 0xffffffff80b24477 at kdb_backtrace+0x67
#1 0xffffffff80ad97e2 at vpanic+0x182
#2 0xffffffff80ad9653 at panic+0x43
#3 0xffffffff80fa1d51 at trap_fatal+0x351
#4 0xffffffff80fa1f43 at trap_pfault+0x1e3
#5 0xffffffff80fa14ec at trap+0x26c
#6 0xffffffff80f845a1 at calltrap+0x8
#7 0xffffffff82205e84 at dsl_destroy_snapshot_sync_impl+0x894
#8 0xffffffff82206577 at dsl_destroy_snapshot_sync+0x97
#9 0xffffffff822097e4 at dsl_sync_task_sync+0xc4
#10 0xffffffff8220851b at dsl_pool_sync+0x3cb
#11 0xffffffff82227fae at spa_sync+0x7ce
#12 0xffffffff82231549 at txg_sync_thread+0x389
#13 0xffffffff80a90455 at fork_exit+0x85
#14 0xffffffff80f84ade at fork_trampoline+0xe
 

ShelLuser

Son of Beastie

Reaction score: 1,809
Messages: 3,600

I narrowed it down to one snapshot that was causing the issue, zroot@2017-05-21_01.00.00--2w. Since that is recursive from the parent zfs, I have no idea which dataset is the actual culprit.
But you can look that up: zfs list -rt snapshot zroot | less.

Note that I'm not familiar with sysutils/zfsnap, I never bothered with external tools because making and removing snapshots is so trivial that it's easily scripted yourself.

I can reproduce the panic by trying to delete that specific snapshot. Other snapshots can be deleted. The pool is scrubbed weekly and hasn't had any errors.
Is the scrubbing finished when these operations are performed?

Still, generally speaking errors like these are caused by hardware issues. What is the actual status of the pools? Is everything still healthy?
 
OP
M

mheppner

New Member

Reaction score: 2
Messages: 17

But you can look that up: zfs list -rt snapshot zroot | less.
I should have been clearer. I can certainly view all of the snapshots, and from the 10 or so that I tried, I could delete them without any issues. I found the parent one listed above that caused the kernel panic, but I don't want to try all of the child ones to figure out which is the culprit.

Note that I'm not familiar with sysutils/zfsnap, I never bothered with external tools because making and removing snapshots is so trivial that it's easily scripted yourself.
That's pretty much all that zfsnap is, it's just a handy shell script that has automated creating/deleting snapshots. It's just using the normal zfs commands.

Is the scrubbing finished when these operations are performed?

Still, generally speaking errors like these are caused by hardware issues. What is the actual status of the pools? Is everything still healthy?
Yes, I've let the pool scrub itself twice, and it's still having the issue. The pool has always been in HEALTHY state. The SMART tests on all the drives look fine to me and smartmontools hasn't flagged anything yet.
 
Top