Other Debugging crash

jbo@ · Oct 12, 2021

I just experienced something I never experienced in all my years of using FreeBSD before: A crash!

I've been compiling some small code base (C, C++, cmake) with devel/jetbrains-clion while listening to some music via www/firefox when the system suddenly froze and then automatically rebooted.

How would I go about understanding what happened?

Machine specs:

Intel Core i7-8086K
64 GB DDR memory (non-ECC)
Nvidia Quadro P5000
NVMe SSD
FreeBSD 13.0-RELEASE

Here's uname -a

Code:

FreeBSD fbsd_beefy01 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #0: Tue Aug 24 07:33:27 UTC 2021     root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

Here are the first few lines of /var/crash/core.0.txt:

Code:

fbsd_beefy01 dumped core - see /var/crash/vmcore.0

Tue Oct 12 17:54:24 CEST 2021

FreeBSD fbsd_beefy01 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #0: Tue Aug 24 07:33:27 UTC 2021     root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

panic: privileged instruction fault

GNU gdb (GDB) 11.1 [GDB v11.1 for FreeBSD]
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd13.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

Unread portion of the kernel message buffer:


Fatal trap 1: privileged instruction fault while in kernel mode
cpuid = 5; apic id = 05
instruction pointer    = 0x20:0xffffffff80f275ed
stack pointer           = 0x0:0xfffffe01c65a6840
frame pointer           = 0x0:0xfffffe01c65a6930
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 34881 (cc1)
trap number        = 1
panic: privileged instruction fault
cpuid = 5
time = 1634053989
KDB: stack backtrace:
#0 0xffffffff80c574c5 at kdb_backtrace+0x65
#1 0xffffffff80c09ea1 at vpanic+0x181
#2 0xffffffff80c09d13 at panic+0x43
#3 0xffffffff8108b1b7 at trap_fatal+0x387
#4 0xffffffff8108a67e at trap+0x8e
#5 0xffffffff81061958 at calltrap+0x8
#6 0xffffffff80f2741d at vm_fault_trap+0x6d
#7 0xffffffff8108b3b8 at trap_pfault+0x1f8
#8 0xffffffff8108a9ed at trap+0x3fd
#9 0xffffffff81061958 at calltrap+0x8
Uptime: 3h34m36s
Dumping 3404 out of 65415 MB:..1%..11%..21%..31%..41% (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort) ..51% (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort) ..61% (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort) ..71% (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort) ..81% (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort) ..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55    /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory.
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c09a96 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80c09f10 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80c09d13 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff8108b1b7 in trap_fatal (frame=0xfffffe01c65a6780, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:915
#6  0xffffffff8108a67e in trap (frame=0xfffffe01c65a6780)
    at /usr/src/sys/amd64/amd64/trap.c:212
#7  <signal handler called>
#8  0xffffffff80f275ed in vm_fault (map=<optimized out>,
    map@entry=0xfffffe019ccd89f0, vaddr=vaddr@entry=34433785856,
    fault_type=fault_type@entry=2 '\002', fault_flags=<optimized out>,
    fault_flags@entry=0, m_hold=<optimized out>, m_hold@entry=0x0)
    at /usr/src/sys/vm/vm_fault.c:1302
#9  0xffffffff80f2741d in vm_fault_trap (map=0xfffffe019ccd89f0,
    vaddr=<optimized out>, vaddr@entry=34433785856,
    fault_type=<optimized out>, fault_flags=fault_flags@entry=0,
    signo=0xfffffe01c65a6ac4, ucode=0xfffffe01c65a6ac0)
    at /usr/src/sys/vm/vm_fault.c:631
#10 0xffffffff8108b3b8 in trap_pfault (frame=frame@entry=0xfffffe01c65a6b00,
    usermode=true, signo=0x0, signo@entry=0xfffffe01c65a6ac4,
    ucode=ucode@entry=0xfffffe01c65a6ac0)
    at /usr/src/sys/amd64/amd64/trap.c:817
#11 0xffffffff8108a9ed in trap (frame=0xfffffe01c65a6b00)
    at /usr/src/sys/amd64/amd64/trap.c:340
#12 <signal handler called>
#13 0x0000000802b89403 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffbec8
(kgdb)

grahamperrin · Oct 13, 2021

UFS or ZFS?

I don't know whether __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 was significant, but for reference:

<https://github.com/freebsd/freebsd-...94729cdad8c2/sys/amd64/include/pcpu_aux.h#L55> (line 55 of the source code).

jbo@ · Oct 13, 2021

Argentum · Oct 13, 2021

jbodenmann said:
How would I go about understanding what happened?

Can you repeat it? If you try the same things again.

_martin · Oct 13, 2021

Kernel crashed because privileged instruction fault while in kernel mode. It happened on CPU 5 when cc1 was running. In userspace (frame #13, 0x802b89403) signal handler was called handling a trap. Reason for it (frame #8) was write into a page that is not writable (fault type 2). It seems though that during that time another signal was called. Due to too many optimized out entries it's hard to say what happened. It seems it got a trap of unknown origin (reserved/unknown).

This is a very good candidate for PR. As always it does help if you can replicate this.

mark_j · Oct 13, 2021

I would add that trap 1 is often physical memory errors, so it would be wise to run a memory tester, just in case.

_martin · Oct 13, 2021

It actually seems crash happened because the 2nd handler jumped incorrectly in the kernel code. Highly unlikely event but then it did receive unknown trap. Crashing IP was 0xffffffff80f275ed but disassembling vm_fault in my VM shows this (truncated code):

Code:

   0xffffffff80f275e0 <+160>:    cmp    r12d,0x6
   0xffffffff80f275e4 <+164>:    jne    0xffffffff80f28557 <vm_fault+4119>
   0xffffffff80f275ea <+170>:    mov    BYTE PTR [rbp-0x90],r15b
   0xffffffff80f275f1 <+177>:    mov    rsi,QWORD PTR [rbp-0xa0]
   0xffffffff80f275f8 <+184>:    movzx  edx,r14b

To confirm we have the same patch level can you run this:

Code:

gdb /boot/kernel/kernel
set height 60
disass vm_fault

And paste the code around 0xffffffff80f275e0.

Argentum · Oct 14, 2021

mark_j said:
I would add that trap 1 is often physical memory errors, so it would be wise to run a memory tester, just in case.

This is what I just wanted to say, that random bit hardware error is a candidate. 64GB is a big memory and if this is not error corrected (desktop) then random bit error becomes possible. Cosmic particle penetrating the right place in right time may be the root cause.

jbo@ · Oct 14, 2021

Argentum said:
Can you repeat it? If you try the same things again.

It happened three more times that day. At one point I stopped firing up firefox for listening to music and just went straight to compiling and it happened again.
Now to the very unfortunate part: I wasn´t smart enough to realize that I should keep a copy of the code base that I was compiling handy.

I was working on some embedded firmware (targetting STM32H750, using devel/cmake and devel/gcc-arm-embedded. The work I was doing at the moment of the crash was refactoring a project which was originally make based to use cmake. I do have a hunch that this only happened when I was dicking around with passing the linker script to the compiler/linker.

The crash which in total happened four times was consistent: It always crashed when it was just (almost?) done compiling and just about to call the linker.

I'll try to reproduce this.

_martin said:
To confirm we have the same patch level can you run this:

Code:

gdb /boot/kernel/kernel set height 60 disass vm_fault

And paste the code around 0xffffffff80f275e0.

Here you go:

Code:

   0xffffffff80f275ba <+122>:    mov    $0xffffffff,%eax
   0xffffffff80f275bf <+127>:    mov    %rax,-0xc8(%rbp)
   0xffffffff80f275c6 <+134>:    xor    %eax,%eax
   0xffffffff80f275c8 <+136>:    mov    %rax,-0xd0(%rbp)
   0xffffffff80f275cf <+143>:    movl   $0x0,-0xa8(%rbp)
   0xffffffff80f275d9 <+153>:    jmp    0xffffffff80f275ea <vm_fault+170>
   0xffffffff80f275db <+155>:    nopl   0x0(%rax,%rax,1)
   0xffffffff80f275e0 <+160>:    cmp    $0x6,%r12d
   0xffffffff80f275e4 <+164>:    jne    0xffffffff80f28557 <vm_fault+4119>
   0xffffffff80f275ea <+170>:    mov    %r15b,-0x90(%rbp)
   0xffffffff80f275f1 <+177>:    mov    -0xa0(%rbp),%rsi
   0xffffffff80f275f8 <+184>:    movzbl %r14b,%edx
   0xffffffff80f275fc <+188>:    mov    %rbx,%rdi
   0xffffffff80f275ff <+191>:    lea    -0x40(%rbp),%rcx
   0xffffffff80f27603 <+195>:    lea    -0x60(%rbp),%r8

In case this is relevant:

Code:

jbo@fbsd_beefy01 /u/h/jbo> gdb --version
GNU gdb (GDB) 11.1 [GDB v11.1 for FreeBSD]
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

_martin · Oct 14, 2021

Thanks for the output. It's what I've suspected (we do have the same kernel) - $rip is in the middle of the instruction. This is what was executed then:

Code:

(kgdb) x/4i 0xffffffff80f275ed
   0xffffffff80f275ed <vm_fault+173>:   jo     0xffffffff80f275ee <vm_fault+174>
   0xffffffff80f275ef <vm_fault+175>:   (bad)
   0xffffffff80f275f0 <vm_fault+176>:   dec    DWORD PTR [rax-0x75]
   0xffffffff80f275f3 <vm_fault+179>:   mov    ch,0x60
(kgdb) x/3i 0xffffffff80f275ee
   0xffffffff80f275ee <vm_fault+174>:   (bad)
   0xffffffff80f275ef <vm_fault+175>:   (bad)
   0xffffffff80f275f0 <vm_fault+176>:   dec    DWORD PTR [rax-0x75]
(kgdb)

It really doesn't' matter if jump was taken or not, either location has bad instruction. And hence the trap 1, illegal instruction.
What is interesting (well, at least for me

) is what happened in the 2nd signal. Even if there was double fault it should not be problem.

Now no matter what it doesn't hurt to do the offline memtest and let the machine go through it for the few hours. Just to rule out the memory issue.

Do you have coredump still available ? Could you do this (assuming vmcore.0 is still the crash you started this thread with) : kgdb /boot/kernel/kernel /var/crash/vmcore.0 and

Code:

f 8
x/4i $pc
f 7
x/4i $pc
f 5
x/4i $pc
i r

jbo@ · Oct 14, 2021

As mentioned the crash happened three more times. Unfortunately I was (or rather still am) not experienced enough to know how to preserve any crash relevant information in a useful manner. For the future, I take it that I can just archive /var/crash?

That being said, I will first share the last version of /var/log/core.0.txt:

Code:

fbsd_beefy01 dumped core - see /var/crash/vmcore.0

Tue Oct 12 17:54:24 CEST 2021

FreeBSD fbsd_beefy01 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #0: Tue Aug 24 07:33:27 UTC 2021     root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

panic: privileged instruction fault

GNU gdb (GDB) 11.1 [GDB v11.1 for FreeBSD]
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd13.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

Unread portion of the kernel message buffer:


Fatal trap 1: privileged instruction fault while in kernel mode
cpuid = 5; apic id = 05
instruction pointer    = 0x20:0xffffffff80f275ed
stack pointer           = 0x0:0xfffffe01c65a6840
frame pointer           = 0x0:0xfffffe01c65a6930
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 34881 (cc1)
trap number        = 1
panic: privileged instruction fault
cpuid = 5
time = 1634053989
KDB: stack backtrace:
#0 0xffffffff80c574c5 at kdb_backtrace+0x65
#1 0xffffffff80c09ea1 at vpanic+0x181
#2 0xffffffff80c09d13 at panic+0x43
#3 0xffffffff8108b1b7 at trap_fatal+0x387
#4 0xffffffff8108a67e at trap+0x8e
#5 0xffffffff81061958 at calltrap+0x8
#6 0xffffffff80f2741d at vm_fault_trap+0x6d
#7 0xffffffff8108b3b8 at trap_pfault+0x1f8
#8 0xffffffff8108a9ed at trap+0x3fd
#9 0xffffffff81061958 at calltrap+0x8
Uptime: 3h34m36s
Dumping 3404 out of 65415 MB:..1%..11%..21%..31%..41% (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort) ..51% (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort) ..61% (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort) ..71% (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort) ..81% (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort) ..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55    /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory.
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c09a96 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80c09f10 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80c09d13 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff8108b1b7 in trap_fatal (frame=0xfffffe01c65a6780, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:915
#6  0xffffffff8108a67e in trap (frame=0xfffffe01c65a6780)
    at /usr/src/sys/amd64/amd64/trap.c:212
#7  <signal handler called>
#8  0xffffffff80f275ed in vm_fault (map=<optimized out>,
    map@entry=0xfffffe019ccd89f0, vaddr=vaddr@entry=34433785856,
    fault_type=fault_type@entry=2 '\002', fault_flags=<optimized out>,
    fault_flags@entry=0, m_hold=<optimized out>, m_hold@entry=0x0)
    at /usr/src/sys/vm/vm_fault.c:1302
#9  0xffffffff80f2741d in vm_fault_trap (map=0xfffffe019ccd89f0,
    vaddr=<optimized out>, vaddr@entry=34433785856,
    fault_type=<optimized out>, fault_flags=fault_flags@entry=0,
    signo=0xfffffe01c65a6ac4, ucode=0xfffffe01c65a6ac0)
    at /usr/src/sys/vm/vm_fault.c:631
#10 0xffffffff8108b3b8 in trap_pfault (frame=frame@entry=0xfffffe01c65a6b00,
    usermode=true, signo=0x0, signo@entry=0xfffffe01c65a6ac4,
    ucode=ucode@entry=0xfffffe01c65a6ac0)
    at /usr/src/sys/amd64/amd64/trap.c:817
#11 0xffffffff8108a9ed in trap (frame=0xfffffe01c65a6b00)
    at /usr/src/sys/amd64/amd64/trap.c:340
#12 <signal handler called>
#13 0x0000000802b89403 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffbec8
(kgdb)

From the same "crash data", here's kgdb /boot/kernel/kernel /var/crash/vmcore.0:

Code:

GNU gdb (GDB) 11.1 [GDB v11.1 for FreeBSD]
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd13.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

Unread portion of the kernel message buffer:


Fatal trap 1: privileged instruction fault while in kernel mode
cpuid = 5; apic id = 05
instruction pointer    = 0x20:0xffffffff80f275ed
stack pointer           = 0x0:0xfffffe01c65a6840
frame pointer           = 0x0:0xfffffe01c65a6930
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 34881 (cc1)
trap number        = 1
panic: privileged instruction fault
cpuid = 5
time = 1634053989
KDB: stack backtrace:
#0 0xffffffff80c574c5 at kdb_backtrace+0x65
#1 0xffffffff80c09ea1 at vpanic+0x181
#2 0xffffffff80c09d13 at panic+0x43
#3 0xffffffff8108b1b7 at trap_fatal+0x387
#4 0xffffffff8108a67e at trap+0x8e
#5 0xffffffff81061958 at calltrap+0x8
#6 0xffffffff80f2741d at vm_fault_trap+0x6d
#7 0xffffffff8108b3b8 at trap_pfault+0x1f8
#8 0xffffffff8108a9ed at trap+0x3fd
#9 0xffffffff81061958 at calltrap+0x8
Uptime: 3h34m36s
Dumping 3404 out of 65415 MB:..1%..11%..21%..31%..41% (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort) ..51% (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort) ..61% (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort) ..71% (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort) ..81% (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort) ..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55    /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory.
(kgdb) f 8
#8  0xffffffff80f275ed in vm_fault (map=<optimized out>, map@entry=0xfffffe019ccd89f0, vaddr=vaddr@entry=34433785856, fault_type=fault_type@entry=2 '\002', fault_flags=<optimized out>,
    fault_flags@entry=0, m_hold=<optimized out>, m_hold@entry=0x0) at /usr/src/sys/vm/vm_fault.c:1302
1302    /usr/src/sys/vm/vm_fault.c: No such file or directory.
(kgdb) x/4i $pc
=> 0xffffffff80f275ed <vm_fault+173>:    jo     0xffffffff80f275ee <vm_fault+174>
   0xffffffff80f275ef <vm_fault+175>:    (bad)
   0xffffffff80f275f0 <vm_fault+176>:    decl   -0x75(%rax)
   0xffffffff80f275f3 <vm_fault+179>:    mov    $0x60,%ch
(kgdb) f 7
#7  <signal handler called>
(kgdb) x/4i $pc
=> 0xffffffff81061958 <calltrap+8>:    jmp    0xffffffff81064340 <doreti>
   0xffffffff8106195d <calltrap+13>:    nopl   (%rax)
   0xffffffff81061960 <alltraps_noen_u>:    mov    %rdi,(%rsp)
   0xffffffff81061964 <alltraps_noen_u+4>:    mov    %gs:0x20,%rdi
(kgdb) f 5
#5  0xffffffff8108b1b7 in trap_fatal (frame=0xfffffe01c65a6780, eva=0) at /usr/src/sys/amd64/amd64/trap.c:915
915    /usr/src/sys/amd64/amd64/trap.c: No such file or directory.
(kgdb) x/4i $pc
=> 0xffffffff8108b1b7:    nopw   0x0(%rax,%rax,1)
   0xffffffff8108b1c0 <trap_pfault>:    push   %rbp
   0xffffffff8108b1c1 <trap_pfault+1>:    mov    %rsp,%rbp
   0xffffffff8108b1c4 <trap_pfault+4>:    push   %r15
(kgdb) i r
rax            <unavailable>
rbx            0x0                 0
rcx            <unavailable>
rdx            <unavailable>
rsi            <unavailable>
rdi            <unavailable>
rbp            0xfffffe01c65a6660  0xfffffe01c65a6660
rsp            0xfffffe01c65a6610  0xfffffe01c65a6610
r8             <unavailable>
r9             <unavailable>
r10            <unavailable>
r11            <unavailable>
r12            0x0                 0
r13            0xfffffe01c65a6780  -2191400474752
r14            0x1                 1
r15            0xffffffff8121d3ee  -2128489490
rip            0xffffffff8108b1b7  0xffffffff8108b1b7
eflags         <unavailable>
cs             0x20                32
ss             0x28                40
ds             <unavailable>
es             <unavailable>
fs             <unavailable>
gs             <unavailable>
fs_base        <unavailable>
gs_base        <unavailable>
(kgdb)

And in case this is helpful in any way:

Code:

jbo@fbsd_beefy01 ~> ls -l /var/crash
total 2471958
-rw-r--r--  1 root  wheel           2 Oct 12 19:21 bounds
-rw-r--r--  1 root  wheel     5778141 Oct 12 17:54 core.txt.0
-rw-r--r--  1 root  wheel     5780557 Oct 12 18:57 core.txt.1
-rw-r--r--  1 root  wheel     1619033 Oct 12 19:11 core.txt.2
-rw-r--r--  1 root  wheel     1563847 Oct 12 19:17 core.txt.3
-rw-r--r--  1 root  wheel      127562 Oct 12 19:21 core.txt.4
-rw-------  1 root  wheel         497 Oct 12 17:54 info.0
-rw-------  1 root  wheel         480 Oct 12 18:57 info.1
-rw-------  1 root  wheel         507 Oct 12 19:10 info.2
-rw-------  1 root  wheel         507 Oct 12 19:17 info.3
-rw-------  1 root  wheel         480 Oct 12 19:21 info.4
lrwxr-xr-x  1 root  wheel           6 Oct 12 19:21 info.last -> info.4
-rw-r--r--  1 root  wheel           5 Apr  9  2021 minfree
-rw-------  1 root  wheel  3569807360 Oct 12 17:54 vmcore.0
-rw-------  1 root  wheel  3269287936 Oct 12 18:57 vmcore.1
-rw-------  1 root  wheel  2837680128 Oct 12 19:11 vmcore.2
-rw-------  1 root  wheel  2728898560 Oct 12 19:17 vmcore.3
-rw-------  1 root  wheel  2642509824 Oct 12 19:21 vmcore.4
lrwxr-xr-x  1 root  wheel           8 Oct 12 19:21 vmcore.last -> vmcore.4

_martin · Oct 14, 2021

Yes, just keep the contents of the /var/crash. As you can see from the ll you pasted the number of the crash automatically increases and older dumps are automatically preserved. If you keep crashing you may run out of free space though.

Now looking at your backtrace again I see frame #8 is where the issue occurred. It's kinda d'oh moment but I just saw that. So trap->pagefault->write on RO page-> ? ?
At least that's my take on it so far. But why that happened that's really interesting.

Did the system crash always on the same issue ? grep Panic\ S /var/crash/info.*

jbo@ · Oct 14, 2021

_martin said:
Did the system crash always on the same issue ? grep Panic\ S /var/crash/info.*

Here you go:

Code:

jbo@fbsd_beefy01 ~ [2]> sudo grep Panic\ S /var/crash/info.*
/var/crash/info.0:  Panic String: privileged instruction fault
/var/crash/info.1:  Panic String: page fault
/var/crash/info.2:  Panic String: Unrecoverable machine check exception
/var/crash/info.3:  Panic String: Unrecoverable machine check exception
/var/crash/info.4:  Panic String: page fault
/var/crash/info.last:  Panic String: page fault

To an untrained novice like me it would seem that the crashes are always related to memory (access?).

Unfortunately, I am currently unable to perform a memory check on this machine. I won't be able to do that until mid next week.
While this is analytically irrelevant: This machine has been my daily driver for all my professional endevours for the past >3 years.
I've only used Windows 10 on this machine up until a few weeks ago (now a dual boot).

Argentum · Oct 14, 2021

jbodenmann said:
Unfortunately, I am currently unable to perform a memory check on this machine. I won't be able to do that until mid next week.
While this is analytically irrelevant: This machine has been my daily driver for all my professional endevours for the past >3 years.
I've only used Windows 10 on this machine up until a few weeks ago (now a dual boot).

Not sure of course, but it still looks like a hardware issue. I may be wrong, but it looks very much so. And if it is repeatable, it is just not random bit error, but likely a fault. In your position I would suspect the GPU first and then RAM. If you have 64GB of RAM, you can easily try to temporarily remove a half of it and check if the issue persists. If so, put that half back and remove the first 32GB-s and try again.
Also, if you can just temporarily replace the Quadro with something else and see if the issue persists. Nvidia drivers are closed source and that could explain the fact that there is no symbol table match in trace (just hypothesis).

_martin · Oct 14, 2021

System can crash for many reasons, both SW and HW related. Here cc1 did write out of bounds (write crossed the page boundary) that caused the page fault. Not a problem per say, process will die.

Here though problem occurred in vm_fault(). The way it happened - jump to improper location - does suggest a possibility of the HW issue. The goto label RetryFault is at address 0xffffffff80f275ea and your system jumped to 0xffffffff80f275ed. Interesting comparison in last byte:

Code:

0xea  1110 1010
0xed  1110 1101

Which is quite interesting problem. So that memtest is really a good idea.
To compare could you share the backtrace of the last crash (#4) ?

Argentum · Oct 14, 2021

_martin said:
System can crash for many reasons, both SW and HW related. Here cc1 did write out of bounds (write crossed the page boundary) that caused the page fault. Not a problem per say, process will die.

Here though problem occurred in vm_fault(). The way it happened - jump to improper location - does suggest a possibility of the HW issue. The goto label RetryFault is at address 0xffffffff80f275ea and your system jumped to 0xffffffff80f275ed. Interesting comparison in last byte:

Code:

0xea 1110 1010 0xed 1110 1101

Which is quite interesting problem. So that memtest is really a good idea.
To compare could you share the backtrace of the last crash (#4) ?

That is why I suspect HW issue here. No application in user space can crash the kernel, including cc1.

_martin · Oct 14, 2021

Argentum said:
No application in user space can crash the kernel

Should not, but for sure can and does due to bugs in kernel.

It's worth checking out other crashes to see the stack trace there. It does sound plausible those crashes are due to memory issues.

jbo@ · Oct 14, 2021

Thanks for all your efforts guys!

Unfortunately I won't be able to provide any more information on this matter until Wednesday/Thursday next week. I won't have access to that machine in question until then.
I'll certainly run a proper memtest next week as well.

_martin · Oct 14, 2021

jbodenmann This is interesting:

Code:

MCA: Bank 0, Status 0x9400004000040150
MCA: Global Cap 0x0000000000000c0c, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x906ea, APIC ID 8
MCA: CPU 8 COR (1) ICACHE L0 IRD error
MCA: Address 0x1ffff80bbf800

Your system crashed just before this message. This is a HW reported error. Have a look at sysutils/mcelog if you can extract any other information.
The L0 icache (CPU) is interesting and does fit 100% to the issue observed.
(I was given the core file to examine, I saw the message there).

Argentum · Oct 15, 2021

_martin said:
Should not, but for sure can and does due to bugs in kernel.

Memory management is implemented in CPU hardware and tested throughout many FreeBSD versions and installations. User space program crashing kernel is very unlikely here.

_martin · Oct 15, 2021

Argentum said:
User space program crashing kernel is very unlikely here.

In general you are incorrect, as I mentioned above. Usersapce program crashing a kernel is a thing (how do you think vast majority of exploits work?).

In this case it is most likely HW related. But I already mentioned that above, I did analyse the crash. So I don't see reason of your statement here.

Argentum · Oct 15, 2021

_martin said:
In general you are incorrect, as I mentioned above. Usersapce program crashing a kernel is a thing (how do you think vast majority of exploits work?).

Have you filed a Problem Report on that?

_martin · Oct 15, 2021

Argentum said:
Have you filed a Problem Report on that?

On what ? That FreeBSD (as any other operating systems I'm aware of) are able to crash kernel from userspace ?

Argentum · Oct 15, 2021

_martin said:
On what ? That FreeBSD (as any other operating systems I'm aware of) are able to crash kernel from userspace ?

Other operating systems are out of scope here, but the community assumes that anybody who finds a bug will report it in https://bugs.freebsd.org/bugzilla/

facedebouc · Oct 15, 2021

Argentum said:
Other operating systems are out of scope here, but the community assumes that anybody who finds a bug will report it in https://bugs.freebsd.org/bugzilla/

It's not a bug, it's a feature ;-)