MCA Errors

Hello,
I've noticed several suspicious messages in system log
Code:
<2>1 2022-08-15T22:52:42.537378+00:00 alpha kernel - - - MCA: Bank 8, Status 0x88000040000200cf
<2>1 2022-08-15T22:52:42.538644+00:00 alpha kernel - - - MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
<2>1 2022-08-15T22:52:42.538705+00:00 alpha kernel - - - MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 49
<2>1 2022-08-15T22:52:42.538744+00:00 alpha kernel - - - MCA: CPU 19 COR (1) MS channel ?? memory error
<2>1 2022-08-15T22:52:42.538781+00:00 alpha kernel - - - MCA: Misc 0x1020408000057100
<2>1 2022-08-16T22:46:18.037014+00:00 alpha kernel - - - MCA: Bank 8, Status 0x88000040000200cf
<2>1 2022-08-16T22:46:18.038387+00:00 alpha kernel - - - MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
<2>1 2022-08-16T22:46:18.038450+00:00 alpha kernel - - - MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 49
<2>1 2022-08-16T22:46:18.038490+00:00 alpha kernel - - - MCA: CPU 19 COR (1) MS channel ?? memory error
<2>1 2022-08-16T22:46:18.038529+00:00 alpha kernel - - - MCA: Misc 0x1020408000056c00

If these are memory errors, were they corrected? And, it looks like there is no information to figure out exact module location.
 
This might help:

 
Not really, I don't have address entry here. Channel is not detected either.
Also, ipmitool sel elist shows no memory related messages as well.

Could it me something else if not memory?
 
If you search the internet for that message it seems to be saying an issue with bank 8 of your RAM.

What is your RAM set-up? What is the motherboard?
 
If you search the internet for that message it seems to be saying an issue with bank 8 of your RAM.

What is your RAM set-up? What is the motherboard?
Asus Z8NR-D12 with 2 Xeons E56xx
There is no BANK 8 in dmidecode output:
Code:
abishai@alpha:~ % doas dmidecode -t memory
# dmidecode 3.4
Scanning /dev/mem for entry point.
SMBIOS 2.5 present.

Handle 0x0036, DMI type 16, 15 bytes
Physical Memory Array
    Location: System Board Or Motherboard
    Use: System Memory
    Error Correction Type: Multi-bit ECC
    Maximum Capacity: 96 GB
    Error Information Handle: Not Provided
    Number Of Devices: 12

Handle 0x0038, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_A1
    Bank Locator: BANK0
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: 2FBC69FC
    Asset Tag: AssetTagNum0
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x003A, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_A2
    Bank Locator: BANK1
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: 39FF2843
    Asset Tag: AssetTagNum1
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x003C, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_B1
    Bank Locator: BANK2
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: 2BBC69FC
    Asset Tag: AssetTagNum2
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x003E, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_B2
    Bank Locator: BANK3
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: B7C57AE0
    Asset Tag: AssetTagNum3
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x0040, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_C1
    Bank Locator: BANK0
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: E26F71DF
    Asset Tag: AssetTagNum4
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x0042, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_C2
    Bank Locator: BANK1
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: 4ABB69FC
    Asset Tag: AssetTagNum5
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x0044, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_D1
    Bank Locator: BANK2
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: 26BC69FC
    Asset Tag: AssetTagNum6
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x0046, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_D2
    Bank Locator: BANK3
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: F33D69E7
    Asset Tag: AssetTagNum7
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x0048, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_E1
    Bank Locator: BANK0
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: 3ABC69FC
    Asset Tag: AssetTagNum8
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x004A, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_E2
    Bank Locator: BANK1
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: 1E6F71DF
    Asset Tag: AssetTagNum9
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x004C, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_F1
    Bank Locator: BANK2
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: 31BC69FC
    Asset Tag: AssetTagNum10
    Part Number: 36JSZF1G72PZ-1G4D1

Handle 0x004E, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x0036
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 44 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_F2
    Bank Locator: BANK3
    Type: DDR3
    Type Detail: None
    Speed: 1333 MT/s
    Manufacturer: Micron      
    Serial Number: C96F71DF
    Asset Tag: AssetTagNum11
    Part Number: 36JSZF1G72PZ-1G4D1

Did you try to swap the memory modules and check if the message bank is different?
Well, I have a hope, as where were only 2 messages so far, it was some random aberration and without knowledge of module location all of them must be moved. Also, I don't understand why address information is not available, usually other people got the address and could pinpoint bank by address range.

Can we narrow the search if we assume, that if CPU 19 is a second die, BANK8 must be connected to it?
 
Are MCA errors the same thing as MCE exceptions? We've been pursuing that issue on Linux (AMD Ryzen machines) for >= 2 years. Random reboots due to mostly fake errors. It was all firmware related and the latest firmware upgrades by e.g. ASUS resolved many of them. The thing is that those errors also showed some issues with memory banks but all of them were fake. The errors were related to higher CPU C-states.
 
3 days ago I've received a hundred of these messages in a second and then silence. I've rebooted server and found these messages under IPMI/BMC BIOS entry.

Still, I can't figure out the root of the problem. Looks like it triggered 2 sensors (Memory and OEM Memory), but still no module location.
Maybe, someone seen these messages before and know to to read them?

For now, I interchanged CPU1 and CPU2 memory modules and booted the server. Sporadical nature of errors gives me a hope that it can be electrical contact issues and maybe modules change will help.


1662196159900.png

1662196183700.png


1662196379300.png


1662196395000.png
 
Back
Top