NMI CPU panics under a certain load

I've set up FreeBSD 13.1 on a rather old hardware: Intel SE7320SP2 MB with 2 dual-core Xeon CPU and SATA II HDD. Also there is Startech PCI 1000Base-T ready NIC.
Using it as a proxy for a dedicated LAN and a spare ISP. And this server goes down if loaded up considerably, rebooting after NMI panic.
kgdb /boot/kernel/kernel /var/crash/vmcore.% shows me NMI on some CPUids, predominantly on the 2nd Xeon (CPU2 and CPU3) with ISA 28 and EISA FF.
Can I suspect any aberration or malfunction of VRM or even the CPU itself? Or do these panics (10 events over almost 4 months) have another causes?
 
A general suggestion, it's always better to avoid ambiguity by sharing actual log / error / etc messages.
 
A similar panic event occurred just minutes ago. Here's KGDB output of it.
panic_backtrace.JPG

UPD: after 1h07m and 42m respectively two similar panic events occurred with NMI/cpu3. In this instance, cpu3 means either 2nd Xeon processor or its first core, the second core being cpu2.
 
There are multiple sources / possibilities for an NMI. In general, it can be hard to pin point a specific cause.
You can try search your motherboard's manual for NMI -- there are BIOS settings, etc -- and see if you can narrow down the possibilities.
 
There are multiple sources / possibilities for an NMI. In general, it can be hard to pin point a specific cause.
You can try search your motherboard's manual for NMI -- there are BIOS settings, etc -- and see if you can narrow down the possibilities.
The manual states that it can be either CPU IErr or a thermal trip. Though the first Irwindale Xeon was affected only once, a couple months ago, and nearly all panics arise from NMI/cpu3 and NMI/cpu2, both meaning the second Irwindale (cpu3 is the CPU itself and cpu2 means its additional thread), and duration of the event is less than expected for a thermal event - can it be an IErr signal from that CPU stating that it has detected another malfunction?
Also, when I have loaded the server's memory (4x2 GB FB-DIMM DDR2) by a nmap tcp test, the system has panicked on a memory issue - could these facts be linked together? Can it be like that: CPU under certain load addresses a particular memory range that is bad or contain one or more bad locations. Processor encounters a memory failure, cries IErr, it triggers NMI and leads to panic.
 
Modern machines and CPUs have something called Machine Check Architecture; MCA. I'm wondering if these old machines simply triggered an NMI if such an error would happen.

I honestly wouldn't spend too much time on it. This is decades old hardware, already written off many times over. It's probably just cheaper to buy a new server to replace this old beast. Even if you spend some time on it to find out the mainboard has gone bad, you'd still need to replace it, which will be quite a challenge.
 
Does IPMI log say anything about received NMI ? What about dmesg/syslog ? Is the only message about NMI you see one you shown in the picture?
The 0x28 status ( 6300 ICH ), though important bits are check in handler itself.

At this point and with information shared it can be anything.
 
Back
Top