Other Debugging crash

_martin · Oct 15, 2021

facedebouc said:
It's not a bug, it's a feature ;-)

Indeed it is. To a byte.

I don't think there's any point to go deeper on this topic, all information is in this thread for future readers.

jbo@ · Oct 16, 2021

_martin said:
I don't think there's any point to go deeper on this topic, all information is in this thread for future readers.

I appreciate the efforts that the community put into looking at this crash.
Unfortunately, there's little I can do until late next week when I can run some hardware tests (memory tests, CPU stress tests, ...) on the machine in question.

I'll certainly report back here ASAP!

jbo@ · Oct 20, 2021

I'm able to access the machine in question again.

I did run a full pass of memtest86 over lunch. No errors showed up. I will run the full test suite with four passes over night.

Fiddling around with sysutils/mcelog:

Code:

jbo@fbsd_beefy01 /u/h/jbo> mcelog --no-dmi --asci --file /var/crash/core.txt.0
Hardware event. This is not a software error.
CPU 8 BANK 0 
ADDR 1ffff80bbf800 
MCG status:
STATUS 9400004000040150 MCGSTATUS 0
MCGCAP c0c APICID 8 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 158 Step 10

One thing I'd like to mention is that I do update the CPU's microcode. Incidentally, I have never done that on any FreeBSD machine before. I do it via /boot/loader.conf:

Code:

cpu_microcode_load="YES"
cpu_microcode_name="/boot/firmware/intel-ucode.bin"

Next thing will be a CPU stress test. Any recommendations there?

_martin · Oct 20, 2021

sysutils/stress seems to be a right tool for this ; never used it on FreeBSD though.

I don't have much experience with the MCE errors on x86 platforms. The MCE log I pasted sounded it's an issue on CPU side. I'm purely assuming this from the icache error that would fit the crash scenario perfectly. But then bank number confuses me as if that points to the memory bank.
How many crashes/dumps do you have ? Check for all occurrences of these errors and compare them (how different are they?).

Is this the workstation ? It doesn't hurt to remove all memory modules, CPU, clean the socket, repaste the cooler and run the test again.

jbo@ · Oct 20, 2021

Yes, this is a workstation. I would prefer not to do anything on the system until the issue is actually identified (partly because I am curious to figure it out and partly because this is my main workhorse).

I have been running (and still am) sysutils/stress for > 1h and the system is still running/stable.
The CPU temperature never exceeds 56.0C. There's a massive Noctua NH-D15 cooler on there. While I get your comment regarding cleaning sockets this would at least show that cooling performance is adequate.

Also, my main workloads tend to be rather CPU intensive. I'd argue that most of my work is already stress-testing the CPU

Argentum · Oct 20, 2021

jbodenmann said:
I have been running (and still am) sysutils/stress for > 1h and the system is still running/stable.
The CPU temperature never exceeds 56.0C. There's a massive Noctua NH-D15 cooler on there. While I get your comment regarding cleaning sockets this would at least show that cooling performance is adequate.

Personally I suspect the Quadro in this case. It might even not to be the actual hardware, but GPU-driver combination. As I have written here before, if you could just temporarily change the GPU for some other model, that would give a good comparison point.

jbo@ · Oct 20, 2021

Argentum said:
Personally I suspect the Quadro in this case. It might even not to be the actual hardware, but GPU-driver combination. As I have written here before, if you could just temporarily change the GPU for some other model, that would give a good comparison point.

Unfortunately these days I have few GPUs just "lying around". The only real options I'd have would be an old Quadro K2000, a GTX 1080 or a Quadro M1000 if really necessary.
However, I'd need a way of reproducing the crash first, otherwise this won't tell me much as I haven't experienced a crash since the original post happened. Is there a mechanism/test we can run to provoke the problem?

_martin · Oct 20, 2021

You could do a make -j12 buildworld just to stress the machine. Build other projects you work on in parallel. Try to crash that machine with whatever workload you can think of you do on that system.
Oh, in my last post I forgot to mention: you can read the dmesg from vmcore also by running: dmesg -M /var/crash/vmcore.{N}.

_martin · Oct 20, 2021

Could you install devel/hwloc2 using your prefered method of installing packages and show the output of lstopo-no-graphics ?

jbo@ · Oct 25, 2021

I've been running the full memtest86 test suite three times over three different nights, I have been stress testing the CPU for days at a time with synthetic workloads as well as just pushing the system intentionally harder while doing my every-day work on the machine. I did some 3D rendering on the GPU...

Nothing shows up. The system is rock solid just as I knew it before.
This together with the fact that the crash happened consecutively every time gcc-arm-embedded was invoking the linker starts to tell me that this might not necessarily be a hardware fault. Of course I understand that all the signs are pointing that way tho.

I really wish I'd have kept a copy of that code base that was crashing on linking...

_martin said:
Could you install devel/hwloc2 using your prefered method of installing packages and show the output of lstopo-no-graphics ?

Here you go:

Code:

jbo@fbsd_beefy01 /u/h/jbo> sudo lstopo-no-graphics
Failed to initialize LevelZero in ze_init(): 2013265921
Machine (62GB total)
  Package L#0
    NUMANode L#0 (P#0 62GB)
    L3 L#0 (12MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#1)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#3)
      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#4)
        PU L#5 (P#5)
      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#6)
        PU L#7 (P#7)
      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#8)
        PU L#9 (P#9)
      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#10)
        PU L#11 (P#11)
  HostBridge
    PCIBridge
      PCI 01:00.0 (VGA)
    PCIBridge
      PCI 02:00.0 (NVMExp)
    PCI 00:17.0 (SATA)
    PCIBridge
      PCI 03:00.0 (NVMExp)
    PCIBridge
      PCI 04:00.0 (Ethernet)
    PCIBridge
      PCIBridge
        PCIBridge
          PCI 08:00.0 (Ethernet)
        PCIBridge
          PCI 09:00.0 (Ethernet)
    PCI 00:1f.6 (Ethernet)

_martin said:
Oh, in my last post I forgot to mention: you can read the dmesg from vmcore also by running: dmesg -M /var/crash/vmcore.{N}.

Here are the last few messages of each vmcore:

/var/crash/vmcore.0:

Code:

Fatal trap 1: privileged instruction fault while in kernel mode
cpuid = 5; apic id = 05
instruction pointer    = 0x20:0xffffffff80f275ed
stack pointer           = 0x0:0xfffffe01c65a6840
frame pointer           = 0x0:0xfffffe01c65a6930
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 34881 (cc1)
trap number        = 1
panic: privileged instruction fault
cpuid = 5
time = 1634053989
KDB: stack backtrace:
#0 0xffffffff80c574c5 at kdb_backtrace+0x65
#1 0xffffffff80c09ea1 at vpanic+0x181
#2 0xffffffff80c09d13 at panic+0x43
#3 0xffffffff8108b1b7 at trap_fatal+0x387
#4 0xffffffff8108a67e at trap+0x8e
#5 0xffffffff81061958 at calltrap+0x8
#6 0xffffffff80f2741d at vm_fault_trap+0x6d
#7 0xffffffff8108b3b8 at trap_pfault+0x1f8
#8 0xffffffff8108a9ed at trap+0x3fd
#9 0xffffffff81061958 at calltrap+0x8
Uptime: 3h34m36s

/var/crash/vmcore.1:

Code:

Fatal trap 12: page fault while in kernel mode
cpuid = 7; apic id = 07
fault virtual address    = 0xffffffffffffff83
fault code        = supervisor write data, page not present
instruction pointer    = 0x20:0xffffffff8108b55a
stack pointer           = 0x0:0xfffffe02098d1ae0
frame pointer           = 0x0:0xfffffe02098d1af0
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 13671 (cc1)
trap number        = 12
panic: page fault
cpuid = 7
time = 1634057457
KDB: stack backtrace:
#0 0xffffffff80c574c5 at kdb_backtrace+0x65
#1 0xffffffff80c09ea1 at vpanic+0x181
#2 0xffffffff80c09d13 at panic+0x43
#3 0xffffffff8108b1b7 at trap_fatal+0x387
#4 0xffffffff8108b20f at trap_pfault+0x4f
#5 0xffffffff8108a86d at trap+0x27d
#6 0xffffffff81061958 at calltrap+0x8
#7 0xffffffff81061958 at calltrap+0x8
Uptime: 56m52s

/var/crash/vmcore.2:

Code:

MCA: Bank 0, Status 0x9400004000040150
MCA: Global Cap 0x0000000000000c0c, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x906ea, APIC ID 8
MCA: CPU 8 COR (1) ICACHE L0 IRD error
MCA: Address 0x1ffff80f29480
MCA: Bank 0, Status 0x9400004000040150
MCA: Global Cap 0x0000000000000c0c, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x906ea, APIC ID 4
MCA: CPU 4 COR (1) ICACHE L0 IRD error
MCA: Address 0x1ffff80f27a80
MCA: Bank 0, Status 0x9400004000040150
MCA: Global Cap 0x0000000000000c0c, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x906ea, APIC ID 6
MCA: CPU 6 COR (1) ICACHE L0 IRD error
MCA: Address 0x1ffff80c26843
MCA: Bank 4, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000000c0c, Status 0x0000000000000005
MCA: Vendor "GenuineIntel", ID 0x906ea, APIC ID 3
MCA: CPU 3 UNCOR PCC internal timer error
MCA: Address 0x1014690
MCA: Misc 0x1014690
timeout stopping cpus
panic: Unrecoverable machine check exception
cpuid = 3
time = 1634058589
KDB: stack backtrace:
Uptime: 12m39s

/var/crash/vmcore.3:

Code:

MCA: Bank 0, Status 0x9400004000040150
MCA: Global Cap 0x0000000000000c0c, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x906ea, APIC ID 9
MCA: CPU 9 COR (1) ICACHE L0 IRD error
MCA: Address 0x1ffff80be7ad4
MCA: Bank 0, Status 0x9400004000040150
MCA: Global Cap 0x0000000000000c0c, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x906ea, APIC ID 4
MCA: CPU 4 COR (1) ICACHE L0 IRD error
MCA: Address 0x1ffff80f35f89
MCA: Bank 4, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000000c0c, Status 0x0000000000000005
MCA: Vendor "GenuineIntel", ID 0x906ea, APIC ID 3
MCA: CPU 3 UNCOR PCC internal timer error
MCA: Address 0x1014654
MCA: Misc 0x1014654
timeout stopping cpus
panic: Unrecoverable machine check exception
cpuid = 3
time = 1634058971
KDB: stack backtrace:
Uptime: 5m21s

/var/crash/vmcore.4:

Code:

Fatal trap 12: page fault while in kernel mode
cpuid = 7; apic id = 07
fault virtual address    = 0xffffffffffffff85
fault code        = supervisor write data, page not present
instruction pointer    = 0x20:0xffffffff80d01103
stack pointer           = 0x28:0xfffffe01a2b2f660
frame pointer           = 0x28:0xfffffe01a2b2f6d0
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 4455 (cbsd)
trap number        = 12
panic: page fault
cpuid = 7
time = 1634059188
KDB: stack backtrace:
#0 0xffffffff80c574c5 at kdb_backtrace+0x65
#1 0xffffffff80c09ea1 at vpanic+0x181
#2 0xffffffff80c09d13 at panic+0x43
#3 0xffffffff8108b1b7 at trap_fatal+0x387
#4 0xffffffff8108b20f at trap_pfault+0x4f
#5 0xffffffff8108a86d at trap+0x27d
#6 0xffffffff81061958 at calltrap+0x8
#7 0xffffffff80d00c63 at vn_io_fault_doio+0x43
#8 0xffffffff80cfcb5c at vn_io_fault1+0x15c
#9 0xffffffff80cfa234 at vn_io_fault+0x1a4
#10 0xffffffff80c76798 at dofilewrite+0x88
#11 0xffffffff80c7630c at sys_write+0xbc
#12 0xffffffff8108babc at amd64_syscall+0x10c
#13 0xffffffff8106227e at fast_syscall_common+0xf8
Uptime: 2m34s

Note how the uptimes went down as I started to be able to reproduce the problem. Once I continued working on the code base I was refactoring (mainly fixing some linker script logic) the problems vanished.

Please let me know if there's anything else I can provide.

_martin · Oct 25, 2021

You didn't do any OS upgrade since then, correct ? Just so that my VM is still on the same version than yours. It wouldn't hurt to have the backtrace for given crashes (you did paste bt for crash 0). Crash 1 and 4 crashed on obviously bad address.
Crash 2 and 3 you pasted is missing information from the beginning so I can't say what it was doing. But you can see MCE being fired, that's smoking gun here.

For crash 4 jump was not done properly. Origianl code:

Code:

   0xffffffff80d01100 <+224>:   mov    rdi,QWORD PTR [rbp-0x30]
   0xffffffff80d01104 <+228>:   test   rdi,rdi
   0xffffffff80d01107 <+231>:   je     0xffffffff80d01120 <vn_write+256>

You ended up in 0xffffffff80d01103, which is

Code:

(kgdb) x/i 0xffffffff80d01103
   0xffffffff80d01103 <vn_write+227>:   ror    BYTE PTR [rax-0x7b],1
(kgdb)

So it could be that $rax - 0x7b was 0xffffffffffffff85.

The very much same is with crash 1:

Code:

   0xffffffff8108b557 <+39>:    ret
   0xffffffff8108b558 <+40>:    mov    rdi,rbx
   0xffffffff8108b55b <+43>:    add    rsp,0x8
   0xffffffff8108b55f <+47>:    pop    rbx

You ended up in 0xffffffff8108b55a which is:

Code:

(kgdb) x/12i 0xffffffff8108b55a
   0xffffffff8108b55a <trap_check+42>:  fisttp WORD PTR [rax-0x7d]
   0xffffffff8108b55d <trap_check+45>:  (bad)

It is interesting to know what was gcc doing to rub the CPU the wrong way but I'd put my wager on faulty CPU.

Thanks for sharing the hwloc ouput. I can't comment on that too much, I wanted to see the output so I can compare that to something I was reading. I can't interpret the MCE logs very well. So it's here for "archiving" purposes for now.

jbo@ · Oct 25, 2021

_martin said:
You didn't do any OS upgrade since then, correct ? Just so that my VM is still on the same version than yours.

In my opinion I didn't. Here's uname -a from right now:

Code:

jbo@fbsd_beefy01 /u/h/j/p/malloy (main)> uname -a
FreeBSD fbsd_beefy01 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #0: Tue Aug 24 07:33:27 UTC 2021     root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

_martin said:
It wouldn't hurt to have the backtrace for given crashes (you did paste bt for crash 0).

Are these the backtraces listed in /var/crash/core.txt.{N}?

_martin said:
It is interesting to know what was gcc doing to rub the CPU the wrong way but I'd put my wager on faulty CPU.

Any ideas on how to provocate this? The machine in question has had quite a beating the last few days running a multitude of different stress tests, regular workloads, intentionally running poudriere builds along my other builds and so on.

_martin · Oct 25, 2021

Yes, it's the same. You can get those backtraces from the debugging session or in the file. It's safe to assume though it's the alike issue as 3 out of 5 dumps you shared have this problem (dump 0,1,4).

Well, you said it yourself -- it seems you had issues when you were cross-compiling. Grab some similar project and try to cross-compile it on this machine.

I can imagine it's bugging you (pun intended) that you can't trigger it with the stress test. Note you are experiences issues deep in the kernel, in the trap handler. Routine that is called pretty much all the time. While not impossible it's not likely you found an issue there. You have have MCA errors (HW is telling you it has problem with itself). You are crashing because you keep jumping incorrectly in the code.

For the stress part I'd be focusing on memory allocations, so stress -m 262144 or something like this. You have plenty of RAM so you need to really stress it. Or use --vm-bytes to allocate larger chunks of memory.

jbo@ · Nov 8, 2021

I was really hoping that after some time I'd get another crash I can report with potentially more options but nope...

As before this machine is running rock solid. It's a dual boot with FreeBSD 13.0-RELEASE and Windows 10. I switch between the two a lot.

I've been throwing everything I got on this machine. Stress tests of all kinds & extends, manual workloads, synthetic workloads, working on it while also running stress tests - both in Windows an FreeBSD. Absolutely nothing is happening.

I even overclocked the system and it's still running rock solid through all batteries of tests and "just working" on it.

_martin · Nov 8, 2021

I hate those kind of Heisen bugs but that's a life. It seems this bug was triggered due to some sort of i-cache issues. But why was the arm gcc rubbing it the right way is really hard to say. You do have proof though that MCA was logged, that is a warning. I don't think this is a 'self-healing' problem (meant as a little joke). That's why I suggested reseating the CPU, memory modules and doing little cleanup there too. Maybe sun was really affecting your CPU after all ;-).

jbo@ · Dec 24, 2021

Unfortunately I don't have any directly helpful news on this. I have been using this particular machine every single day for 8 to 14 hours non-stop (sometimes also with Windows 10, not always just FreeBSD 13). As this is my main work horse I literally throw everything at it and it is usually not just idling around. The system is rock solid.

As mentioned in my previous post I also overclocked the system (CPU / RAM) and left it at that ever since and it just continues rocking day after day without a single issue.

I haven't touched the actual hardware of this machine in over two years. I didn't re-socket or re-paste anything. These crashes happened - now they don't anymore.
I still run the occassional stress test & memtest on this machine over night because I'm paranoid. Nothing to report tho.

_martin · Dec 24, 2021

Thanks for reporting back. That compilation/job you had had to rub the bug really good way. I guess there's no point of testing it any further. Maybe the bug will resurface when you stop looking for it.