Solved RAM parity error, likely hardware failure.panic: NMI indicates hardware failure

I upgrade a FreeBSD 12.2 to 12.3, after the upgrade the server got into a boot loop, these are the last messages before rebooting:

Code:
cpu0: ‹ACPI CPU› numa-domain B on acpi0
uart0: <16558 or compatible› port 0x3f8-8x3ff ira 4 flags 8x18 on acpi0
uart0: console (115200,n,8,1)
uart1: <16550 or compatible› port 8x2f8-8x2ff irg 3 on acpig
orma: <ISA Option ROMs> at iomem 0xc0000-@xc7fff.0xc8000-Oxcdfff pnpid ORM0088 o
n isal
MI ISA an. ElSA ff
RAM parity error, likely hardware failure.panic: NMI indicates hardware failure
couid = 54
time =1
KDB: stack backtrace:
Uptime: 1s
I had no issue priory to upgrade. Memory was not touched, no hardware changes were done during the upgrade.
Does anyone have an idea?
 
When you have memory problems it's wise to eliminate the easy things first.

I would get a completely different operating system, e.g. Linux, on a USB stick and see if it boots. [Edit: memtest86 is probably much easier.]

If the memory problems persist, then you can be reasonably sure its hardware-related. If so...

Clean the contacts and re-seat each module. Using anti-static strap and mat (or an empty anti-static bag if you don't have a mat):
  1. take out the memory module and rest it on the mat (avoid touching any conductors);
  2. use a high quality pencil eraser to clean the gold contacts; and
  3. firmly re-seat the module back into the motherboard.
If the problem persists, can you reduce the number of memory modules installed? If so, do so. Mark each module uniquely, and then test each module, either alone or in combination, and in each permissible slot (the motherboard manual will tell you how the slots can be used). The idea is to test each module in each permissible slot, and record every result. This will often identify a faulty module (and far less often, a faulty slot).
 
The server boots from a CentOS live cd without issues, I also tried a FreeBSD 12.3 and 13 live image, both fails with the same error as above.
 
The OS doesn't change any of the hardware settings, so it's not possible for FreeBSD to break your system. What is possible is that FreeBSD has a slightly higher memory usage than CentOS, thus hitting the faulty memory more easily.

Did you run a memtest yet? I suggest using sysutils/memtest86 and burning the image to a USB stick. Then boot from that stick to test the memory.
 
What is possible is that FreeBSD has a slightly higher memory usage than CentOS, thus hitting the faulty memory more easily.
To add on here, take note of what gpw928 is also saying. By moving memory modules around you may be able to see if the error is associated with a specific module. But you need to keep good track of what module is where.
The first thing I typically try is simply reseating the modules; don't take it out, just press it down into the connector along the whole length.
 
The first thing I typically try is simply reseating the modules; don't take it out, just press it down into the connector along the whole length.
Swapping them around is a good suggestion, I always test that first. To see if the memory error moves along with the module. Swapping also reseats them (you have to take them out). Killl two birds with one stone.
 
The server is in a data center which is quite far away from where I am.
Updates on what I did: downgraded the server firmware, now the OS boots, networking starts, server replies to icmp, BUT ssh or other network services does not work.
I have a console screen on the server, though that's very slow, it takes 5 to 10 minutes to have a feedback after each keypress.
 
The server
No ECC memory? Because ECC would cause MCA errors to be logged when it detects errors. ECC typically is able to correct those errors but the module would still need to be replaced.

I have a console screen on the server
Through IPMI? Modern IPMI managed servers can boot a disk remotely. It's horribly slow to do over the internet but it'll work nonetheless. Good enough to boot that memtest image.
 
I like all the memory swapping ideas. That is where I would start.
Determine if a module or socket error is reproducible.

I wanted to add a step before re-seating. Blowing down the chassis.
Take compressed air and blow it down with all ram chips installed. Gently.
If you can isolate an individual socket giving errors you would inspect the socket with a magnifier.
Perhaps compressed air here as well as a dust bunny might have lodged in the socket. Very gentle air.
 
The fix (or workaround) for my issue was:
  1. downgrade server firmware to a previous version where FreeBSD boots and stays up
  2. it takes a few seconds for the server to reach the default gateway, which was long enough to fail networking services, including ssh
  3. I was able to work with the console: run command, close console, reopen IPMI console, voila, screen is refreshed. I had to close and re-open the console after each command but I was able to start ssh (and finish the upgrade via ssh)
  4. implement netwait, now the server waits 60s before starting network services, in 60s the default gateway becomes available and I ssh, as well as other network services work. Yay!
As I mentioned, the server is far away from me, doing any sort of physical work on is not possible, at least not until I travel to where the server is. That has to be sorted out though.
 
  • Like
Reactions: mer
Back
Top