I'm going to bring this up with my vendor, but I'm trying to figure out if the logs are telling me anything more than "bad DIMM".
So I had a box running 12.2-RELEASE p7 lock up. On logging into the IP-KVM, I had a mess of MCA logs on the console, but the host was unresponsive: no keyboard response, software shutdown did nothing, had to reboot. This was 5 days ago, and the box hasn't done anything odd since.
Board is a Supermicro X11SPW-TF.
What seems odd is the sheer volume of messages. It looks like they were spewed out in just over a minute (first one at 17:46:05, last one at 17:47:44) and it's quite a few:
In digging around on how to match this to a particular DIMM, I saw people noting that the address of the error is helpful, and is usually a single address. Not so much here - 81,184 lines with an address line (and I assume 81,184 is the number of MCA errors logged as well) and if I sort it for unique addresses, it seems like I have 77,524 unique addresses. This seems odd based on what I'm reading about MCA errors.
If I pipe all these logs through mcelog, I get variations on this, with one line always being "CPU 0 BANK 7" or "CPU 0 BANK 13":
Not shown in mcelog output are messages like this that indicate something ran out of space for recording these errors:
In the IPMI SEL logs I have a few interesting things:
The first four logs on 7/2/21 were me taking the lid off to examine why a LAN port was not latching and then putting the case back on. The "Processor #0xff | IERR () | Asserted" entry is a few minutes after, but of note the box just kept chugging along. The four PSU warnings were me verifying I could query PSU status with ipmitool.
The log on 7/16 ( 1a | 07/16/2021 | 17:48:06 | Memory | Uncorrectable ECC (@DIMMA1(CPU1)) | Asserted) seems to point to a bad DIMM, but then again that brings me back to the big question - is it hardware or not? How trustworthy is this MCA logging?
Any suggestions on other things to look at?
So I had a box running 12.2-RELEASE p7 lock up. On logging into the IP-KVM, I had a mess of MCA logs on the console, but the host was unresponsive: no keyboard response, software shutdown did nothing, had to reboot. This was 5 days ago, and the box hasn't done anything odd since.
Board is a Supermicro X11SPW-TF.
What seems odd is the sheer volume of messages. It looks like they were spewed out in just over a minute (first one at 17:46:05, last one at 17:47:44) and it's quite a few:
Code:
[root@clweb2 /home/spork]# bzgrep 'MCA:' /var/log/messages.0.bz2 | wc -l
539977
In digging around on how to match this to a particular DIMM, I saw people noting that the address of the error is helpful, and is usually a single address. Not so much here - 81,184 lines with an address line (and I assume 81,184 is the number of MCA errors logged as well) and if I sort it for unique addresses, it seems like I have 77,524 unique addresses. This seems odd based on what I'm reading about MCA errors.
Code:
[root@clweb2 /home/spork]# bzgrep "MCA: Address" /var/log/messages.0.bz2 | wc -l
81184
[root@clweb2 /home/spork]# bzgrep "MCA: Address" /var/log/messages.0.bz2 |sort -u | wc -l
77524
If I pipe all these logs through mcelog, I get variations on this, with one line always being "CPU 0 BANK 7" or "CPU 0 BANK 13":
Code:
Hardware event. This is not a software error.
CPU 0 BANK 7
MISC 200000c000001086 ADDR 165629c00
MCG status:
M2M: MscodDataRdErr
STATUS dc001f8001010090 MCGSTATUS 0
MCGCAP 7000c14 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 85 Step 4
Hardware event. This is not a software error.
CPU 0 BANK 13
MISC 9000081c3830086 ADDR f348f88c0
MCG status:
MemCtrl: Corrected patrol scrub error
STATUS cc000200000800c0 MCGSTATUS 0
MCGCAP 7000c14 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 85 Step 4
Not shown in mcelog output are messages like this that indicate something ran out of space for recording these errors:
Code:
MCA: Unable to allocate space for an event.
In the IPMI SEL logs I have a few interesting things:
Code:
11 | 07/02/2021 | 20:14:31 | Fan #0x42 | Lower Critical going low | Asserted
12 | 07/02/2021 | 20:14:31 | Fan #0x42 | Lower Non-recoverable going low | Asserted
13 | 07/02/2021 | 20:14:37 | Fan #0x42 | Lower Non-recoverable going low | Deasserted
14 | 07/02/2021 | 20:14:37 | Fan #0x42 | Lower Critical going low | Deasserted
15 | 07/02/2021 | 20:16:16 | Processor #0xff | IERR () | Asserted
16 | 07/14/2021 | 00:38:50 | Power Supply #0xc9 | Failure detected () | Asserted
17 | 07/14/2021 | 00:39:05 | Power Supply #0xc9 | Failure detected () | Deasserted
18 | 07/14/2021 | 18:09:12 | Power Supply #0xc8 | Failure detected () | Asserted
19 | 07/14/2021 | 18:09:30 | Power Supply #0xc8 | Failure detected () | Deasserted
1a | 07/16/2021 | 17:48:06 | Memory | Uncorrectable ECC (@DIMMA1(CPU1)) | Asserted
The first four logs on 7/2/21 were me taking the lid off to examine why a LAN port was not latching and then putting the case back on. The "Processor #0xff | IERR () | Asserted" entry is a few minutes after, but of note the box just kept chugging along. The four PSU warnings were me verifying I could query PSU status with ipmitool.
The log on 7/16 ( 1a | 07/16/2021 | 17:48:06 | Memory | Uncorrectable ECC (@DIMMA1(CPU1)) | Asserted) seems to point to a bad DIMM, but then again that brings me back to the big question - is it hardware or not? How trustworthy is this MCA logging?
Any suggestions on other things to look at?