Same MCA errors on two different machines

rs-joy · Apr 12, 2018

We have two Supermicro X9DRi-F Storage Servers running FreeNAS with FreeBSD 11.1-Stable.
Since upgrading from FreeBSD9 to 11 we encountered a strange behavior quite similar to this thread.
The systems run fine for a few hours to a few days, then without an obvoius trigger start spamming MCA Memory errors.
Unfortunately the system is inresponsible and can only be restartet with a power reset.
I followed the troubleshooting process in the thread above and the adresses of the memory are not allocated to any physical memory. They simply dont exist.

How is this possible?
Is there a workaround?

SirDice · Apr 12, 2018

Can you post the errors, MCA errors are usually also logged to /var/log/messages? It's a little tricky to identify the correct module and requires some interpretation to find the correct one. It would also be helpful to post a full output from dmidecode(8) to Pastebin (or a similar service).

In any case, MCA errors almost always point to hardware failures. When the system is freshly booted not all memory is being used. After some time of running most of the memory will be used for things like filesystem caches, process caches and a few other things. Judging by the issues you're having I suspect memory errors and probably one or more broken memory modules.

ondra_knezour · Apr 12, 2018

When I saw MCA memory errors last time it was server overheating. X9 is not the newest series, you see errors after some time to warm up, I would check fans and remove dust first. If you have a way to log memory thermal sensor reading, it may help to asses what is going there.

rs-joy · Apr 13, 2018

Thanks for your quick replies.

Here is the messages
https://paste.ee/p/jUBej

And here is the dmidecode
https://paste.ee/p/BpLbb

What i noticed is that the errors occur in the 0x00C00000000 range, but the last (possible) memory adress on this board is 0x00BFFFFFFFF.

The system event log shows this error: Assertion: Memory| Event = Correctable ECC@DIMM?6(CPU1)

I don't think this is a hardware failure, because it occured after updating the operating system on two separate machines.

Could this maybe be solved with a bios update?
I think we are running a few versions behind.

SirDice · Apr 13, 2018

rs-joy said:
What i noticed is that the errors occur in the 0x00C00000000 range, but the last (possible) memory adress on this board is 0x00BFFFFFFFF.

Memory is split up into two 24GB segments:

Code:

Handle 0x002E, DMI type 19, 31 bytes
Memory Array Mapped Address
	Starting Address: 0x00000000000
	Ending Address: 0x005FFFFFFFF
	Range Size: 24 GB
	Physical Array Handle: 0x002D
	Partition Width: 1

Code:

Handle 0x0040, DMI type 19, 31 bytes
Memory Array Mapped Address
	Starting Address: 0x00600000000
	Ending Address: 0x00BFFFFFFFF
	Range Size: 24 GB
	Physical Array Handle: 0x003F
	Partition Width: 1

Giving the machine a total of 48GB. Is this correct? If the machine is supposed to have more memory than that would be the first clue.

Also have a closer look, not all errors are in that strange non-existing range. I also see errors in other regions:

Code:

Apr 13 00:08:13 freenas01 MCA: Bank 9, Status 0xcc016250000800c1
Apr 13 00:08:13 freenas01 MCA: Global Cap 0x0000000001000c12, Status 0x0000000000000000
Apr 13 00:08:13 freenas01 MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 39
Apr 13 00:08:13 freenas01 MCA: CPU 19 COR (1417) OVER MS channel 1 memory error
Apr 13 00:08:13 freenas01 MCA: Address 0x871048240
Apr 13 00:08:13 freenas01 MCA: Misc 0x122100000000208c

Apr 13 00:08:13 freenas01 MCA: Global Cap 0x0000000001000c12, Status 0x0000000000000000
Apr 13 00:08:13 freenas01 MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 32
Apr 13 00:08:13 freenas01 MCA: CPU 12 COR (80) OVER RD channel 1 memory error
Apr 13 00:08:13 freenas01 MCA: Address 0x6cc56d840
Apr 13 00:08:13 freenas01 MCA: Misc 0x140080886

So, I'd probably start with replacing the ones you can match to a specific memory module. COR means it has corrected an error, so ECC does what it's supposed to do. You should definitely replace those modules.

rs-joy · Apr 14, 2018

I scanned through the error log again and it seems like there are errors from all modules populated and controlled by the second cpu.
(ranges 0x6, 0x7, 0x8, 0x9, 0xA, 0xB)
(correspond to the populated E1, E2, F1, F2, G1, H1 slots on the board)
We did a swap and the issue persisted.
So I would think the CPU is bad.
The next step i will attempt is to swap the cpus.

Same MCA errors on two different machines

rs-joy

SirDice

Administrator

ondra_knezour

rs-joy

SirDice

Administrator

rs-joy