Help Request debugging Kernel Crash (ZFS/ZIL Related)

Hi,

Every few days my ZFS based fileserver dies with a spontaneous reboot, with stack traces pointing I believe to an issue with ZFS and ZIL.

Kernel - '12.1-STABLE r363542 GENERIC amd64'

Crashes result in vm faults, such as :
Code:
panic: vm_fault: fault on nofault entry, addr: 0xffffffff80b90000
cpuid = 0
time = 1595900728
KDB: stack backtrace:
#0 0xffffffff80c00d95 at kdb_backtrace+0x65
#1 0xffffffff80bb535b at vpanic+0x17b
#2 0xffffffff80bb51d3 at panic+0x43
#3 0xffffffff80ee8ed2 at vm_fault+0x24d2
#4 0xffffffff80ee68e0 at vm_fault_trap+0x60
#5 0xffffffff81084a7c at trap_pfault+0x19c
#6 0xffffffff81083f76 at trap+0x286
#7 0xffffffff8105c918 at calltrap+0x8
#8 0xffffffff82576d64 at rangelock_enter+0x4f4
#9 0xffffffff8257b5b6 at zfs_get_data+0x156
#10 0xffffffff8254b9ee at zil_commit_impl+0xafe
#11 0xffffffff82581365 at zfs_freebsd_fsync+0xa5
#12 0xffffffff8120501b at VOP_FSYNC_APV+0x7b
#13 0xffffffff80c909b1 at kern_fsync+0x191
#14 0xffffffff81085487 at amd64_syscall+0x387
#15 0xffffffff8105d23e at fast_syscall_common+0xf8
It appears that we've got issues with ZFS itself. This box has been running FreeBSD for a long time, with the root pool being created back on
2015-09-19.14:31:21 zpool create -o altroot=/mnt -O canmount=off -m none zroot raidz /dev/gpt/disk0 /dev/gpt/disk1 /dev/gpt/disk2 /dev/gpt/disk3

It's been through multiple upgrades since then, both the OS and the zpool is at the latest version via 'zpool upgrade' as and when new features are added.

zpools are scrubbed every 30 days, with one due soon:
Code:
zsh/2 1348 [2] % zpool status
  pool: zroot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 12:11:24 with 0 errors on Sat Jul 25 19:53:11 2020
config:

        NAME           STATE     READ WRITE CKSUM
        zroot          ONLINE       0     0     0
          raidz1-0     ONLINE       0     0     0
            gpt/disk0  ONLINE       0     0     0
            gpt/disk1  ONLINE       0     0     0
            gpt/disk2  ONLINE       0     0     0
            gpt/disk3  ONLINE       0     0     0

errors: No known data errors
I can't see any issues with the underlying disks themselves, they are subjected to SMART tests at regular intervals, and are replaced when failing.

I find it hard to think there is a new issue with the kernel, so I suspect something else is at play - perhaps I have failing RAM? I'd be keen to diagnose this though, before spending any cash and creating yet more electronic landfill.

Can anyone recommend any direction I can take my investigations?
 
Here's a better backtrace, via kgdb.
Code:
(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:371
#2  0xffffffff80bb4f75 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:451
#3  0xffffffff80bb53b3 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:880
#4  0xffffffff80bb51d3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:807
#5  0xffffffff80ee8ed2 in vm_fault (map=<optimized out>, vaddr=<unavailable>, fault_type=2 '\002', fault_flags=0, m_hold=0x0)
    at /usr/src/sys/vm/vm_fault.c:727
#6  0xffffffff80ee68e0 in vm_fault_trap (map=0xfffff80003001000, vaddr=<optimized out>, fault_type=<optimized out>, fault_flags=0, signo=0x0, 
    ucode=0x0) at /usr/src/sys/vm/vm_fault.c:574
#7  0xffffffff81084a7c in trap_pfault (frame=0xfffffe005df67840, usermode=false, signo=<unavailable>, ucode=<unavailable>)
    at /usr/src/sys/amd64/amd64/trap.c:828
#8  0xffffffff81083f76 in trap (frame=0xfffffe005df67840) at /usr/src/sys/amd64/amd64/trap.c:407
#9  <signal handler called>
#10 0xffffffff82470a32 in avl_insert (tree=0xfffff8012750a6d8, new_data=0xfffff80155f2cf00, where=<optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/common/avl/avl.c:511
#11 0xffffffff82576d64 in rangelock_enter (rl=<optimized out>, off=<optimized out>, len=<optimized out>, type=<optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_rlock.c:192
#12 0xffffffff8257b5b6 in zfs_get_data (arg=<optimized out>, lr=0xfffffe005cda0788, buf=0x0, lwb=0xfffff8000dd8cdc0, zio=0xfffff8002f46a850)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:1341
#13 0xffffffff8254b9ee in zil_lwb_commit (zilog=0xfffff8000de63000, itx=0xfffff800be157800, lwb=0xfffff8000dd8cdc0)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:1665
#14 zil_process_commit_list (zilog=0xfffff8000de63000) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:2243
#15 zil_commit_writer (zilog=0xfffff8000de63000, zcw=<optimized out>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:2376
#16 zil_commit_impl (zilog=0xfffff8000de63000, foid=<optimized out>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:2890
#17 0xffffffff82581365 in zfs_fsync (vp=<optimized out>, syncflag=<optimized out>, cr=<optimized out>, ct=<optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:2634
#18 zfs_freebsd_fsync (ap=<optimized out>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:5093
#19 0xffffffff8120501b in VOP_FSYNC_APV (vop=0xffffffff8262d180 <zfs_vnodeops>, a=0xfffffe005df67bb8) at vnode_if.c:1331
#20 0xffffffff80c909b1 in VOP_FSYNC (vp=0xfffff800667bc1e0, waitfor=<error reading variable: Cannot access memory at address 0x1>, 
    td=0xfffff8019cba7740) at ./vnode_if.h:549
#21 kern_fsync (td=0xfffff8019cba7740, fd=<optimized out>, fullsync=true) at /usr/src/sys/kern/vfs_syscalls.c:3404
#22 0xffffffff81085487 in syscallenter (td=0xfffff8019cba7740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:144
#23 amd64_syscall (td=0xfffff8019cba7740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1167
#24 <signal handler called>
#25 0x000000080a0432ea in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffdf3f6c98
 
Back
Top