Fatal trap 12 crash on R14.2-P1

Howdy,

I woke up yesterday to internet connection being down. I noticed that my firewall (FreeBSD with pf) was not responding. Since I had several calls that morning, I did not get much chance to troubleshoot so I put my backup 14.1 FreeBSD online to get working. I hooked up my failed box to a monitor this evening and starting looking for clues and this is the only thing I was able to find:

Code:
Feb 11 01:15:48 system kernel:
Feb 11 01:15:48 system syslogd: last message repeated 1 times
Feb 11 01:15:48 system kernel: Fatal trap 12: page fault while in kernel mode
Feb 11 01:15:48 system kernel: cpuid = 3; apic id = 06
Feb 11 01:15:48 system kernel: fault virtual address    = 0x40000000000
Feb 11 01:15:48 system kernel: fault code        = supervisor read data, page not present
Feb 11 01:15:48 system kernel: instruction pointer    = 0x20:0xffffffff82839df8
Feb 11 01:15:48 system kernel: stack pointer            = 0x28:0xfffffe00d95d0e90
Feb 11 01:15:48 system kernel: frame pointer            = 0x28:0xfffffe00d95d0ec0
Feb 11 01:15:48 system kernel: code segment        = base 0x0, limit 0xfffff, type 0x1b
Feb 11 01:15:48 system kernel:             = DPL 0, pres 1, long 1, def32 0, gran 1
Feb 11 01:15:48 system kernel: processor eflags    =
Feb 11 01:16:35 system syslogd: kernel boot file is /boot/kernel/kernel
Feb 11 01:16:35 system kernel: ---<<BOOT>>---

I am running:
Code:
uname -a
FreeBSD system 14.2-RELEASE-p1 FreeBSD 14.2-RELEASE-p1 GENERIC amd64
CPU: Intel(R) N100 (806.40-MHz K8-class CPU) (Alder Lake)

This is the second problem I have with this machine, a Beelink Mini PC EQ Series. First issue was a kernel crash but since I upgraded to 14.2, did not had that issue again.

Not sure if anyone can get a sense of what could have happened from the messages section above but I want to enable crash dumps in case this happens again and in case this is a bug, is this all I need to add to my rc.conf:

Code:
dumpdev="AUTO"
dumpdir="/var/crash"
savecore_enable="YES"

Thank you for your help!
 
It happened again but I have more information now:

Code:
Feb 20 02:54:24 system dhclient[23683]: New Broadcast Address (igc0): 136.50.207.255
Feb 20 02:54:24 system dhclient[23687]: New Routers (igc0): 136.50.192.1
Feb 20 07:04:59 system syslogd: kernel boot file is /boot/kernel/kernel
Feb 20 07:04:59 system kernel:
Feb 20 07:04:59 system syslogd: last message repeated 1 times
Feb 20 07:04:59 system kernel: Fatal trap 12: page fault while in kernel mode
Feb 20 07:04:59 system kernel: cpuid = 1; apic id = 02
Feb 20 07:04:59 system kernel: fault virtual address    = 0xc8
Feb 20 07:04:59 system kernel: fault code               = supervisor read data, page not present
Feb 20 07:04:59 system kernel: instruction pointer      = 0x20:0xffffffff80c247c3
Feb 20 07:04:59 system kernel: stack pointer            = 0x28:0xfffffe00d948fbe0
Feb 20 07:04:59 system kernel: frame pointer            = 0x28:0xfffffe00d948fc10
Feb 20 07:04:59 system kernel: code segment             = base 0x0, limit 0xfffff, type 0x1b
Feb 20 07:04:59 system kernel:                  = DPL 0, pres 1, long 1, def32 0, gran 1
Feb 20 07:04:59 system kernel: processor eflags = interrupt enabled, resume, IOPL = 0
Feb 20 07:04:59 system kernel: current process          = 18 (syncer)
Feb 20 07:04:59 system kernel: rdi: fffffe001d84b840 rsi: fffff80003f3a0e0 rdx: 0000000000000001
Feb 20 07:04:59 system kernel: rcx: 0000000000000000  r8: fffff80003dceea0  r9: fffff80003dd8ea0
Feb 20 07:04:59 system kernel: rax: 0000000032e7e900 rbx: fffffe001d84b840 rbp: fffffe00d948fc10
Feb 20 07:04:59 system kernel: r10: fffff80003dd2b40 r11: 0000000000000008 r12: fffff80003f3a0e0
Feb 20 07:04:59 system kernel: r13: fffff80003f3a128 r14: 0000000000000001 r15: fffff80003f3a148
Feb 20 07:04:59 system kernel: trap number              = 12
Feb 20 07:04:59 system kernel: panic: page fault
Feb 20 07:04:59 system kernel: cpuid = 1
Feb 20 07:04:59 system kernel: time = 1740056642
Feb 20 07:04:59 system kernel: KDB: stack backtrace:
Feb 20 07:04:59 system kernel: #0 0xffffffff80b8b88d at kdb_backtrace+0x5d
Feb 20 07:04:59 system kernel: #1 0xffffffff80b3dc11 at vpanic+0x131
Feb 20 07:04:59 system kernel: #2 0xffffffff80b3dad3 at panic+0x43
Feb 20 07:04:59 system kernel: #3 0xffffffff81025a0b at trap_fatal+0x40b
Feb 20 07:04:59 system kernel: #4 0xffffffff81025a56 at trap_pfault+0x46
Feb 20 07:04:59 system kernel: #5 0xffffffff80ffc388 at calltrap+0x8
Feb 20 07:04:59 system kernel: #6 0xffffffff80c24efd at reassignbuf+0x16d
Feb 20 07:04:59 system kernel: #7 0xffffffff80c00419 at bdirty+0x39
Feb 20 07:04:59 system kernel: #8 0xffffffff80c0025b at bdwrite+0x7b
Feb 20 07:04:59 system kernel: #9 0xffffffff80e656c2 at ffs_update+0x352
Feb 20 07:04:59 system kernel: #10 0xffffffff80e9204e at ffs_sync+0x60e
Feb 20 07:04:59 system kernel: #11 0xffffffff80c2f8bf at sync_fsync+0x10f
Feb 20 07:04:59 system kernel: #12 0xffffffff80c2e4b7 at sched_sync+0x487
Feb 20 07:04:59 system kernel: #13 0xffffffff80af760f at fork_exit+0x7f
Feb 20 07:04:59 system kernel: #14 0xffffffff80ffd3ee at fork_trampoline+0xe
Feb 20 07:04:59 system kernel: Timeout initializing vt_vga
Feb 20 07:04:59 system kernel: Uptime: 6d15h20m14s
Feb 20 07:04:59 system kernel: Dumping 1267 out of 16121 MB:..2%..11%..21%..31%..41%..51%..61%..71%..81%..91%---<<BOOT>>---

I was not able to find much on functions reassignbuf, bdirty, bdwrite, etc. Are these network buffer or regular memory related functions? Was wondering if anyone could shed more light on what’s causing this. I search different keywords in the bug tool but did not find a match. There is another post on reassignbuf but it was related to a Realtek driver. I am running 2x Intel I226-V (igc driver). Any extra pointers would help!
 
ffs_sync is in the filesystem code. the faulting address at 0xc8 implies this is a dereference of a null pointer to a struct. most likely cause: your disk is corrupted, or your memory is bad, or both.
 
The various b... functions are buffer cache. They are being called here by a file system function for the FFS file system.

ffs_sync is in the filesystem code. the faulting address at 0xc8 implies this is a dereference of a null pointer to a struct. most likely cause: your disk is corrupted, or your memory is bad, or both.
A corrupted disk should not cause a kernel fault. If there were no bugs, then any data read from disk (even if it is file system internal metadata) should be checked before being dereferenced, and not cause a page fault in the kernel. So the possible explanations are memory errors, or a bug. My educated guess is memory error, since bugs in FFS (and the buffer cache it uses) should be very rare these days.
 
In my experience with FreeBSD, sometimes (not very often) a disk in need of fsck causes a panic. Remedy: reboot in single-user and run fsck until you are certain that the disk is good.
A panic (controlled exit of the kernel with a clear error message) is sort of OK, not great, but one can live with it. A page fault is not OK, since for an end user, knowing what to do is virtually impossible. But I understand that the world is not perfect.
 
No matter the state of the filesystem, i.e. no matter what state the on disk structures are, kernel should not panic when accessing the structure. If it does it's a nasty bug. If you do have crash dump available (and it seems you do), please open a PR for it. Bear in mind crash dump may include sensitive data in it.

It's impossible to tell what code your first dump crash on due to lack of information (other than suspicious vaddr). The %rip could be pointing to kernel module, doesn't seem to be within kernel.

Other crash dump does have more information. The crash happened in buf_vlist_add, here:
Code:
(kgdb) x/12i 0xffffffff80c247c3
   0xffffffff80c247c3 <buf_vlist_add+67>:    cmp    0xc8(%rcx),%rax
   0xffffffff80c247ca <buf_vlist_add+74>:    jle    0xffffffff80c2483d <buf

Where in this case it's as atax1a mentioned, dereferencing NULL pointer. It seems it was walking internal buffer structures. Your dump info also shows %rcx being 0.

But that's as much as one can do without crash dump. Debugging FS is not my cup of coffee (still have splay tree nightmares back form uni days;) ), that PR is the best way to go.
Also I guess this information doesn't help you much anyway becuase it doesn't tell you what you as user could do about it. And this structure could be a victim of other issues (both SW and HW). It would help a lot if you can reproduce the issue.
 
As I was re-reading this thread I noticed I showed you bad offset to the structure. The code where it crashed is correct though. I can't recheck it now as I dont' have access to FreeBSD at a moment. But it bothers me that I see that mistake..
edit: I've removed the incorrect quotes. It was walking internal buffer structures, it's enough to leave it at that. I've set one testing VM with 14.2 (p1 kernel) with UFS, will stress it for a day or so.
 
Back
Top