MCA errors

I'm going to bring this up with my vendor, but I'm trying to figure out if the logs are telling me anything more than "bad DIMM".

So I had a box running 12.2-RELEASE p7 lock up. On logging into the IP-KVM, I had a mess of MCA logs on the console, but the host was unresponsive: no keyboard response, software shutdown did nothing, had to reboot. This was 5 days ago, and the box hasn't done anything odd since.

Board is a Supermicro X11SPW-TF.

What seems odd is the sheer volume of messages. It looks like they were spewed out in just over a minute (first one at 17:46:05, last one at 17:47:44) and it's quite a few:

Code:
[root@clweb2 /home/spork]# bzgrep 'MCA:' /var/log/messages.0.bz2 | wc -l
  539977

In digging around on how to match this to a particular DIMM, I saw people noting that the address of the error is helpful, and is usually a single address. Not so much here - 81,184 lines with an address line (and I assume 81,184 is the number of MCA errors logged as well) and if I sort it for unique addresses, it seems like I have 77,524 unique addresses. This seems odd based on what I'm reading about MCA errors.

Code:
[root@clweb2 /home/spork]# bzgrep "MCA: Address" /var/log/messages.0.bz2 | wc -l
81184
[root@clweb2 /home/spork]# bzgrep "MCA: Address" /var/log/messages.0.bz2 |sort -u | wc -l
77524

If I pipe all these logs through mcelog, I get variations on this, with one line always being "CPU 0 BANK 7" or "CPU 0 BANK 13":

Code:
Hardware event. This is not a software error.
CPU 0 BANK 7
MISC 200000c000001086 ADDR 165629c00
MCG status:
M2M: MscodDataRdErr
STATUS dc001f8001010090 MCGSTATUS 0
MCGCAP 7000c14 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 85 Step 4

Hardware event. This is not a software error.
CPU 0 BANK 13
MISC 9000081c3830086 ADDR f348f88c0
MCG status:
MemCtrl: Corrected patrol scrub error
STATUS cc000200000800c0 MCGSTATUS 0
MCGCAP 7000c14 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 85 Step 4

Not shown in mcelog output are messages like this that indicate something ran out of space for recording these errors:

Code:
MCA: Unable to allocate space for an event.

In the IPMI SEL logs I have a few interesting things:

Code:
  11 | 07/02/2021 | 20:14:31 | Fan #0x42 | Lower Critical going low  | Asserted
  12 | 07/02/2021 | 20:14:31 | Fan #0x42 | Lower Non-recoverable going low  | Asserted
  13 | 07/02/2021 | 20:14:37 | Fan #0x42 | Lower Non-recoverable going low  | Deasserted
  14 | 07/02/2021 | 20:14:37 | Fan #0x42 | Lower Critical going low  | Deasserted
  15 | 07/02/2021 | 20:16:16 | Processor #0xff | IERR () | Asserted
  16 | 07/14/2021 | 00:38:50 | Power Supply #0xc9 | Failure detected () | Asserted
  17 | 07/14/2021 | 00:39:05 | Power Supply #0xc9 | Failure detected () | Deasserted
  18 | 07/14/2021 | 18:09:12 | Power Supply #0xc8 | Failure detected () | Asserted
  19 | 07/14/2021 | 18:09:30 | Power Supply #0xc8 | Failure detected () | Deasserted
  1a | 07/16/2021 | 17:48:06 | Memory | Uncorrectable ECC (@DIMMA1(CPU1)) | Asserted

The first four logs on 7/2/21 were me taking the lid off to examine why a LAN port was not latching and then putting the case back on. The "Processor #0xff | IERR () | Asserted" entry is a few minutes after, but of note the box just kept chugging along. The four PSU warnings were me verifying I could query PSU status with ipmitool.

The log on 7/16 ( 1a | 07/16/2021 | 17:48:06 | Memory | Uncorrectable ECC (@DIMMA1(CPU1)) | Asserted) seems to point to a bad DIMM, but then again that brings me back to the big question - is it hardware or not? How trustworthy is this MCA logging?

Any suggestions on other things to look at?
 
One way to find out if it's the DIMM or the mainboard is to record the address that's giving errors now. Then move every DIMM to another slot, swap everything around. If the address moves with the DIMM it's the DIMM that's faulty.

is it hardware or not? How trustworthy is this MCA logging?
They are always hardware errors.
 
Yeah, I saw some talk about using the address line in other posts, but considering I have 77K+ addresses in that log (in a period of 2 seconds), and I can't make the box do this on demand, hard to try that. I noticed other people with this issue seem to have only a handful of memory addresses - that's part of what has me confused - I didn't see anyone with something similar happening.
 
I noticed other people with this issue seem to have only a handful of memory addresses
It's going to depend on the "density" of the DIMM. Bigger DIMMs with less chips on them have a higher density of bits (more bits need to be crammed in fewer chips). If one of those chips fails then a higher density DIMM would have more memory errors than a lower density one.
 
I had memory errors on a board recently. All it took was to reseat the memory chip.
Have you tried to re-insert your problem module?
 
It's going to depend on the "density" of the DIMM. Bigger DIMMs with less chips on them have a higher density of bits (more bits need to be crammed in fewer chips). If one of those chips fails then a higher density DIMM would have more memory errors than a lower density one.
This is where I get stuck. My brain doesn't do hex, so I can't really figure out a way to do something with all these addresses to sort of verify they fall within a range. It would certainly be interesting to do this and then see if they all fall within one DIMM. I mean, the server event log flags a single DIMM. It would also be nice if I could figure out a way to trigger this.

For all I know, my vendor might just ship me another board populated with the same config.
 
I had memory errors on a board recently. All it took was to reseat the memory chip.
Have you tried to re-insert your problem module?
The box is currently doing redundant stuff, so I absolutely plan on pulling it out of the rack and giving it a once-over next time I'm there.
 
My brain doesn't do hex, so I can't really figure out a way to do something with all these addresses to sort of verify they fall within a range.
That's all you need anyway. Look at the output from sysutils/dmidecode, it will tell you which range is used by what DIMM. That will allow you to track down the failing DIMM.
 
That's all you need anyway. Look at the output from sysutils/dmidecode, it will tell you which range is used by what DIMM. That will allow you to track down the failing DIMM.
OK, so at this point I'm just doing this as an exercise in figuring out the relationship between the logs and the dmidecode output. The vendor is looking at this stuff too, and I think they're just going to ship a new DIMM, and we both agree on which DIMM it is based on the server event log in the IPMI device:

Memory | Uncorrectable ECC (@DIMMA1(CPU1)) | Asserted
The manual shows a slot labelled DIMMA1, so I think we're good.

Anyhow, for reference, here's some info out of dmidecode. First each DIMM along with the object one level up from that (which is some grouping of DIMMs I guess - it shows a size of 32GB, and each DIMM is 16GB). I think the "Handle" for each DIMM is what I cross-reference with the next block.

1st Group/Bank(?)

Handle 0x0021, DMI type 19, 31 bytes
Memory Array Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x007FFFFFFFF
Range Size: 32 GB
Physical Array Handle: 0x0020
Partition Width: 2

Handle 0x0022, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0020
Locator: DIMMA1
Bank Locator: P0_Node0_Channel0_Dimm0
Manufacturer: Samsung
Serial Number: 0371A838
Asset Tag: DIMMA1_AssetTag (date:19/08)
Part Number: M393A2K40CB2-CTD

Handle 0x0024, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0020
Locator: DIMMB1
Bank Locator: P0_Node0_Channel1_Dimm0
Manufacturer: Samsung
Serial Number: 0371A8C3
Asset Tag: DIMMB1_AssetTag (date:19/08)
Part Number: M393A2K40CB2-CTD

2nd Group/Bank(?)

Handle 0x0029, DMI type 19, 31 bytes
Memory Array Mapped Address
Starting Address: 0x00800000000
Ending Address: 0x00FFFFFFFFF
Range Size: 32 GB
Physical Array Handle: 0x0028
Partition Width: 2

Handle 0x002A, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0028
Locator: DIMMD1
Bank Locator: P0_Node1_Channel0_Dimm0
Manufacturer: Samsung
Serial Number: 4297DXXX
Asset Tag: DIMMD1_AssetTag (date:19/47)
Part Number: M393A2K40CB2-CTD

Handle 0x002C, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0028
Locator: DIMME1
Bank Locator: P0_Node1_Channel1_Dimm0
Manufacturer: Samsung
Serial Number: 4297DXXX
Asset Tag: DIMME1_AssetTag (date:19/47)
Part Number: M393A2K40CB2-CTD

Elsewhere in the dmidecode output I have this, which is I think laying out address ranges for each DIMM. The "physical device handle" here matches with the DIMMs in the prior snippet.

Handle 0x0031, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x003FFFFFFFF

Range Size: 16 GB
Physical Device Handle: 0x0022
Memory Array Mapped Address Handle: 0x0030
Partition Row Position: 1
Interleave Position: 1
Interleaved Data Depth: 2

Handle 0x0032, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00400000000
Ending Address: 0x007FFFFFFFF

Range Size: 16 GB
Physical Device Handle: 0x0024
Memory Array Mapped Address Handle: 0x0030
Partition Row Position: 1
Interleave Position: 2
Interleaved Data Depth: 2

Handle 0x0034, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x003FFFFFFFF

Range Size: 16 GB
Physical Device Handle: 0x002A
Memory Array Mapped Address Handle: 0x0033
Partition Row Position: 1
Interleave Position: 1
Interleaved Data Depth: 2

Handle 0x0035, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00400000000
Ending Address: 0x007FFFFFFFF

Range Size: 16 GB
Physical Device Handle: 0x002C
Memory Array Mapped Address Handle: 0x0033
Partition Row Position: 1
Interleave Position: 2
Interleaved Data Depth: 2


So I think I can posit that DIMMA1 with a handle of 0x0022 matches the "Memory Device Mapped Address" range of Starting Address: 0x00000000000 and Ending Address: 0x003FFFFFFFF.

But then I look at just a small snippet of my MCA logs and...

(first log line)
Jul 16 17:46:05 clweb2 kernel: MCA: Address 0x106eac7800
Jul 16 17:46:05 clweb2 kernel: MCA: Address 0x107ffe80c0
Jul 16 17:46:05 clweb2 kernel: MCA: Address 0x107ffe94c0
Jul 16 17:46:05 clweb2 kernel: MCA: Address 0x107ffe98c0
Jul 16 17:46:05 clweb2 kernel: MCA: Address 0x107ffea800
Jul 16 17:46:05 clweb2 kernel: MCA: Address 0x107ffeacc0
Jul 16 17:46:05 clweb2 kernel: MCA: Address 0x107ffeb080
Jul 16 17:46:05 clweb2 kernel: MCA: Address 0x107ffee440
Jul 16 17:46:05 clweb2 kernel: MCA: Address 0x107ffee480
Jul 16 17:46:05 clweb2 kernel: MCA: Address 0x107ffef000
... (about 77,000 more lines of addresses)
Jul 16 17:47:44 clweb2 kernel: MCA: Address 0xf394324c0
Jul 16 17:47:44 clweb2 kernel: MCA: Address 0xf39432cc0
Jul 16 17:47:44 clweb2 kernel: MCA: Address 0xf394334c0
Jul 16 17:47:44 clweb2 kernel: MCA: Address 0xf39433cc0
Jul 16 17:47:44 clweb2 kernel: MCA: Address 0xf394344c0
Jul 16 17:47:44 clweb2 kernel: MCA: Address 0xf39434cc0
Jul 16 17:47:44 clweb2 kernel: MCA: Address 0xf394354c0
Jul 16 17:47:44 clweb2 kernel: MCA: Address 0xf39435cc0
(last log line)

Now those don't appear to be in any of the ranges above, so that's where I'm scratching my head about this (or again, I don't do hex, and therefore I'm not reading any of that right).
 
Back
Top