Server Crashing, MCA: CPU x UNCOR memory error

Hi all,

I've got a server that has been rock-solid for several years that started crashing on 3 days ago. It has crashed a total of 3 times over 3 days. Two of the crashes were overnight when very little would have been happening, one of the crashes was in the early morning, shortly after a reboot, and at a time of little activity.

Two of the times the system was completely frozen, no error messages on the screen (and nothing apparent) in logs, and completely unresponsive to ping, keyboard, etc.

One time (the middle time), the screen was filled with errors like:

Code:
MCA: Vendor "GenuineIntel", ID 0x106a5, APIC ID 7
MCA: CPU 7 UNCOR PCC AC channel 0 memory error
MCA: Misc 0x0
MCA: Bank 8, Status 0xba0000000008000b0
MCA: Global Cap 0x00000000000001c09, Status 0x0000000000000004

This message seems to be repeated for each of the CPUs (cpu0-cpu15).

Am I correct that this is an ECC error and that the DIMM in bank 8 is throwing some errors? The computer has 8x2gb memory sticks.

Other than the removing what I believe is the offending DIMM, is there anything else to do?

Here's the output of dmidecode:

Code:
# dmidecode 2.11
SMBIOS 2.6 present.

Handle 0x002B, DMI type 16, 15 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: Multi-bit ECC
	Maximum Capacity: 96 GB
	Error Information Handle: Not Provided
	Number Of Devices: 4

Handle 0x002D, DMI type 17, 28 bytes
Memory Device
	Array Handle: 0x002B
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 2048 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM0
	Bank Locator: BANK0
	Type: Other
	Type Detail: Other
	Speed: 1333 MHz
	Manufacturer: Manufacturer00
	Serial Number: SerNum00
	Asset Tag: AssetTagNum0
	Part Number: ModulePartNumber00
	Rank: Unknown

Handle 0x002F, DMI type 17, 28 bytes
Memory Device
	Array Handle: 0x002B
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: No Module Installed
	Form Factor: DIMM
	Set: None
	Locator: DIMM1
	Bank Locator: BANK1
	Type: Other
	Type Detail: Other
	Speed: 1333 MHz
	Manufacturer: Manufacturer01
	Serial Number: SerNum01
	Asset Tag: AssetTagNum1
	Part Number: ModulePartNumber01
	Rank: Unknown

Handle 0x0031, DMI type 17, 28 bytes
Memory Device
	Array Handle: 0x002B
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 2048 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM2
	Bank Locator: BANK2
	Type: Other
	Type Detail: Other
	Speed: 1333 MHz
	Manufacturer: Manufacturer02
	Serial Number: SerNum02
	Asset Tag: AssetTagNum2
	Part Number: ModulePartNumber02
	Rank: Unknown

Handle 0x0033, DMI type 17, 28 bytes
Memory Device
	Array Handle: 0x002B
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 2048 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM3
	Bank Locator: BANK3
	Type: Other
	Type Detail: Other
	Speed: 1333 MHz
	Manufacturer: Manufacturer03
	Serial Number: SerNum03
	Asset Tag: AssetTagNum3
	Part Number: ModulePartNumber03
	Rank: Unknown

Handle 0x0035, DMI type 16, 15 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: Multi-bit ECC
	Maximum Capacity: 96 GB
	Error Information Handle: Not Provided
	Number Of Devices: 4

Handle 0x0037, DMI type 17, 28 bytes
Memory Device
	Array Handle: 0x0035
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 2048 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM0
	Bank Locator: BANK0
	Type: Other
	Type Detail: Other
	Speed: 1333 MHz
	Manufacturer: Manufacturer00
	Serial Number: SerNum00
	Asset Tag: AssetTagNum0
	Part Number: ModulePartNumber00
	Rank: Unknown

Handle 0x0039, DMI type 17, 28 bytes
Memory Device
	Array Handle: 0x0035
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: No Module Installed
	Form Factor: DIMM
	Set: None
	Locator: DIMM1
	Bank Locator: BANK1
	Type: Other
	Type Detail: Other
	Speed: 1333 MHz
	Manufacturer: Manufacturer01
	Serial Number: SerNum01
	Asset Tag: AssetTagNum1
	Part Number: ModulePartNumber01
	Rank: Unknown

Handle 0x003B, DMI type 17, 28 bytes
Memory Device
	Array Handle: 0x0035
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 2048 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM2
	Bank Locator: BANK2
	Type: Other
	Type Detail: Other
	Speed: 1333 MHz
	Manufacturer: Manufacturer02
	Serial Number: SerNum02
	Asset Tag: AssetTagNum2
	Part Number: ModulePartNumber02
	Rank: Unknown

Handle 0x003D, DMI type 17, 28 bytes
Memory Device
	Array Handle: 0x0035
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 2048 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM3
	Bank Locator: BANK3
	Type: Other
	Type Detail: Other
	Speed: 1333 MHz
	Manufacturer: Manufacturer03
	Serial Number: SerNum03
	Asset Tag: AssetTagNum3
	Part Number: ModulePartNumber03
	Rank: Unknown
 
Kuzbad said:
Am I correct that this is an ECC error and that the DIMM in bank 8 is throwing some errors? The computer has 8x2gb memory sticks.
If nothing changed (no updates, no new ports etc.) software-wise a hardware fault would be the most likely cause. Should be fairly simple to test, just remove the offending memory stick and see if that improves things.
 
Serverboards usually have log files for events like this Memory Error, maybe you can find out more about in the BIOS.

Take a look in the main board manual for the memory configuration. I guess you need to remove the RAM modules in pairs
and if the mainboard is a dual socket cpu layout you may need to remove 4 modules (2 pairs).
 
Kuzbad said:
Code:
MCA: Vendor "GenuineIntel", ID 0x106a5, APIC ID 7
MCA: CPU 7 UNCOR PCC AC channel 0 memory error
MCA: Misc 0x0
MCA: Bank 8, Status 0xba0000000008000b0
MCA: Global Cap 0x00000000000001c09, Status 0x0000000000000004
Am I correct that this is an ECC error and that the DIMM in bank 8 is throwing some errors? The computer has 8x2gb memory sticks.
"UNCOR PCC AC" means that it is an uncorrectable error, the processor context is corrupt, and that it was an address or command error. "Bank" in this case refers to the memory controller bank, which may or may not be associated with memory module 8.

I'm more concerned with the "AC" status, because this often indicates a problem with the memory controller. My documentation is older than your Core i7-9xx / Xeon W35xx CPU, so I can't parse the model-specific fields accurately.

I would suggest running something like Memtest86+ to see if you can trigger the error in a more predictable fashion. If you can, I'd suggest shuffling all of the memory modules one slot over and seeing if the "bank" stays the same or changes. If it stays the same, you probably have a CPU / motherboard issue.

Have you checked your CPU fan to make sure it turns freely and is spinning normally when the system is operating? If you decide to use a compressed-gas duster to clean it, make sure the system is powered off and cold before spraying it, and also hold the fan blades to keep the fan from spinning while you spray it. Failing to heed these warnings can result in damage to the fan, CPU, or both.
 
Back
Top