Hi,
Every few days my ZFS based fileserver dies with a spontaneous reboot, with stack traces pointing I believe to an issue with ZFS and ZIL.
Kernel - '12.1-STABLE r363542 GENERIC amd64'
Crashes result in vm faults, such as :
It appears that we've got issues with ZFS itself. This box has been running FreeBSD for a long time, with the root pool being created back on
2015-09-19.14:31:21
It's been through multiple upgrades since then, both the OS and the zpool is at the latest version via 'zpool upgrade' as and when new features are added.
zpools are scrubbed every 30 days, with one due soon:
I can't see any issues with the underlying disks themselves, they are subjected to SMART tests at regular intervals, and are replaced when failing.
I find it hard to think there is a new issue with the kernel, so I suspect something else is at play - perhaps I have failing RAM? I'd be keen to diagnose this though, before spending any cash and creating yet more electronic landfill.
Can anyone recommend any direction I can take my investigations?
Every few days my ZFS based fileserver dies with a spontaneous reboot, with stack traces pointing I believe to an issue with ZFS and ZIL.
Kernel - '12.1-STABLE r363542 GENERIC amd64'
Crashes result in vm faults, such as :
Code:
panic: vm_fault: fault on nofault entry, addr: 0xffffffff80b90000
cpuid = 0
time = 1595900728
KDB: stack backtrace:
#0 0xffffffff80c00d95 at kdb_backtrace+0x65
#1 0xffffffff80bb535b at vpanic+0x17b
#2 0xffffffff80bb51d3 at panic+0x43
#3 0xffffffff80ee8ed2 at vm_fault+0x24d2
#4 0xffffffff80ee68e0 at vm_fault_trap+0x60
#5 0xffffffff81084a7c at trap_pfault+0x19c
#6 0xffffffff81083f76 at trap+0x286
#7 0xffffffff8105c918 at calltrap+0x8
#8 0xffffffff82576d64 at rangelock_enter+0x4f4
#9 0xffffffff8257b5b6 at zfs_get_data+0x156
#10 0xffffffff8254b9ee at zil_commit_impl+0xafe
#11 0xffffffff82581365 at zfs_freebsd_fsync+0xa5
#12 0xffffffff8120501b at VOP_FSYNC_APV+0x7b
#13 0xffffffff80c909b1 at kern_fsync+0x191
#14 0xffffffff81085487 at amd64_syscall+0x387
#15 0xffffffff8105d23e at fast_syscall_common+0xf8
2015-09-19.14:31:21
zpool create -o altroot=/mnt -O canmount=off -m none zroot raidz /dev/gpt/disk0 /dev/gpt/disk1 /dev/gpt/disk2 /dev/gpt/disk3
It's been through multiple upgrades since then, both the OS and the zpool is at the latest version via 'zpool upgrade' as and when new features are added.
zpools are scrubbed every 30 days, with one due soon:
Code:
zsh/2 1348 [2] % zpool status
pool: zroot
state: ONLINE
scan: scrub repaired 0 in 0 days 12:11:24 with 0 errors on Sat Jul 25 19:53:11 2020
config:
NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gpt/disk0 ONLINE 0 0 0
gpt/disk1 ONLINE 0 0 0
gpt/disk2 ONLINE 0 0 0
gpt/disk3 ONLINE 0 0 0
errors: No known data errors
I find it hard to think there is a new issue with the kernel, so I suspect something else is at play - perhaps I have failing RAM? I'd be keen to diagnose this though, before spending any cash and creating yet more electronic landfill.
Can anyone recommend any direction I can take my investigations?