Same MCA errors on two different machines

SirDice · Oct 14, 2015

We have several SuperMicro machines and they're all running fine. Except two machines, they appear to run FreeBSD 9.3-RELEASE just fine but we get a lot of MCA errors in the logs. Most likely memory errors and would need to be replaced.

One machine is a couple of years old, so its not entirely unexpected:

Code:

Oct 12 05:18:26 db2.example.com MCA: Bank 7, Status 0xcc0d16c000010091
Oct 12 05:18:26 db2.example.com MCA: Global Cap 0x0000000001000c1b, Status 0x0000000000000000
Oct 12 05:18:26 db2.example.com MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 36
Oct 12 05:18:26 db2.example.com MCA: CPU 12 COR (13403) OVER RD channel 1 memory error
Oct 12 05:18:26 db2.example.com MCA: Address 0x343353cec0
Oct 12 05:18:26 db2.example.com MCA: Misc 0x142661286
Oct 12 05:19:53 db2.example.com MCA: Bank 7, Status 0xcc010100000400a1
Oct 12 05:19:53 db2.example.com MCA: Global Cap 0x0000000001000c1b, Status 0x0000000000000000
Oct 12 05:19:53 db2.example.com MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 51
Oct 12 05:19:53 db2.example.com MCA: CPU 21 COR (1028) OVER WR channel 1 memory error
Oct 12 05:19:53 db2.example.com MCA: Address 0x343353ce80
Oct 12 05:19:53 db2.example.com MCA: Misc 0x80289686
Oct 12 05:19:53 db2.example.com MCA: Bank 9, Status 0x8c000050000800c0
Oct 12 05:19:53 db2.example.com MCA: Global Cap 0x0000000001000c1b, Status 0x0000000000000000
Oct 12 05:19:53 db2.example.com MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 50
Oct 12 05:19:53 db2.example.com MCA: CPU 20 COR (1) MS channel 0 memory error
Oct 12 05:19:53 db2.example.com MCA: Address 0x3efbda5500
Oct 12 05:19:53 db2.example.com MCA: Misc 0x90000000000208c
Oct 12 05:19:53 db2.example.com MCA: Bank 7, Status 0xcc010100000400a1
Oct 12 05:19:53 db2.example.com MCA: Global Cap 0x0000000001000c1b, Status 0x0000000000000000
Oct 12 05:19:53 db2.example.com MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 37
Oct 12 05:19:53 db2.example.com MCA: CPU 13 COR (1028) OVER WR channel 1 memory error
Oct 12 05:19:53 db2.example.com MCA: Address 0x343353ce80
Oct 12 05:19:53 db2.example.com MCA: Misc 0x80289686
Oct 12 05:19:53 db2.example.com MCA: Bank 9, Status 0x8c000050000800c0
Oct 12 05:19:53 db2.example.com MCA: Global Cap 0x0000000001000c1b, Status 0x0000000000000000
Oct 12 05:19:53 db2.example.com MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 49
Oct 12 05:19:53 db2.example.com MCA: CPU 19 COR (1) MS channel 0 memory error
Oct 12 05:19:53 db2.example.com MCA: Address 0x3efbda5500
Oct 12 05:19:53 db2.example.com MCA: Misc 0x90000000000208c
Oct 12 05:19:53 db2.example.com MCA: Bank 9, Status 0x8c000050000800c0
Oct 12 05:19:53 db2.example.com MCA: Global Cap 0x0000000001000c1b, Status 0x0000000000000000
Oct 12 05:19:53 db2.example.com MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 36
Oct 12 05:19:53 db2.example.com MCA: CPU 12 COR (1) MS channel 0 memory error
Oct 12 05:19:53 db2.example.com MCA: Address 0x3efbda5500
Oct 12 05:19:53 db2.example.com MCA: Misc 0x90000000000208c
Oct 12 06:18:26 db2.example.com MCA: Bank 7, Status 0xcc15b98000010091
Oct 12 06:18:26 db2.example.com MCA: Global Cap 0x0000000001000c1b, Status 0x0000000000000000
Oct 12 06:18:26 db2.example.com MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 36
Oct 12 06:18:26 db2.example.com MCA: CPU 12 COR (22246) OVER RD channel 1 memory error

So it looks like bank 7, 9 and 10 are broken. Running sysutils/mcelog doesn't tell much more than that.

The other machine is fairly new (new enough to still be in warranty):

Code:

Oct  8 15:48:25 db4.example.com MCA: Bank 7, Status 0xcc00008000010090
Oct  8 15:48:25 db4.example.com MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Oct  8 15:48:25 db4.example.com MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 18
Oct  8 15:48:25 db4.example.com MCA: CPU 14 COR (2) OVER RD channel 0 memory error
Oct  8 15:48:25 db4.example.com MCA: Address 0x50ac272080
Oct  8 15:48:25 db4.example.com MCA: Misc 0x1523aba86
Oct  8 15:48:25 db4.example.com MCA: Bank 9, Status 0x8c000051000800c0
Oct  8 15:48:25 db4.example.com MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Oct  8 15:48:25 db4.example.com MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 19
Oct  8 15:48:25 db4.example.com MCA: CPU 15 COR (1) MS channel 0 memory error
Oct  8 15:48:25 db4.example.com MCA: Address 0x50ac272000
Oct  8 15:48:25 db4.example.com MCA: Misc 0x122940200020228c
Oct  8 15:48:25 db4.example.com MCA: Bank 7, Status 0xcc00008000010090
Oct  8 15:48:25 db4.example.com MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Oct  8 15:48:25 db4.example.com MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 17
Oct  8 15:48:25 db4.example.com MCA: CPU 13 COR (2) OVER RD channel 0 memory error
Oct  8 15:48:25 db4.example.com MCA: Address 0x50ac272080
Oct  8 15:48:25 db4.example.com MCA: Misc 0x1523aba86
Oct  8 15:48:25 db4.example.com MCA: Bank 9, Status 0x8c000051000800c0
Oct  8 15:48:25 db4.example.com MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Oct  8 15:48:25 db4.example.com MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 16
Oct  8 15:48:25 db4.example.com MCA: CPU 12 COR (1) MS channel 0 memory error
Oct  8 15:48:25 db4.example.com MCA: Address 0x50ac272000
Oct  8 15:48:25 db4.example.com MCA: Misc 0x122940200020228c
Oct  8 16:41:06 db4.example.com MCA: Bank 7, Status 0x8c00004000010090
Oct  8 16:41:06 db4.example.com MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Oct  8 16:41:06 db4.example.com MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 16
Oct  8 16:41:06 db4.example.com MCA: CPU 12 COR (1) RD channel 0 memory error
Oct  8 16:41:06 db4.example.com MCA: Address 0x50ac272080
Oct  8 16:41:06 db4.example.com MCA: Misc 0x150181886
Oct  8 17:30:50 db4.example.com MCA: Bank 9, Status 0x8c000051000800c0
Oct  8 17:30:50 db4.example.com MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Oct  8 17:30:50 db4.example.com MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 19

Again, it looks like bank 7,9 and 10 are broken (I may not have pasted everything).

Now, I can imagine one bank dying. Three in one machine, although not entirely impossible, is just unlikely. And the same banks on two different machines, one old, one new? Extremely unlikely.

So, I'm wondering if this might be something else. I'd also like to find out which physical bank corresponds with the bank numbers in the MCA errors, sysutils/mcelog doesn't tell me much more than can already be learned from the log.

ondra_knezour · Oct 14, 2015

The sysutils/dmidecode may have some clue about physical identification.

ondra_knezour · Oct 14, 2015

Example output for one memory socket in some older Supermicro box

Code:

Handle 0x0012, DMI type 17, 27 bytes
Memory Device
        Array Handle: 0x0011
        Error Information Handle: No Error
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 2048 MB
        Form Factor: DIMM
        Set: 1
        Locator: DIMM#1A
        Bank Locator: Bank 1
        Type: DDR2
        Type Detail: Synchronous
        Speed: 667 MHz
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified

SirDice · Oct 14, 2015

Yes, that was the first tool I tried. It does show the physical location, but I can't find any relation to the bank number in the MCA reports and the output from dmidecode. So I still don't exactly know which DIMM(s) are broken.

Code:

Handle 0x0057, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x0041
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: P1-DIMMD2
        Bank Locator: P0_Node0_Channel3_Dimm1
        Type: DDR3
        Type Detail: Registered (Buffered)
        Speed: 1066 MHz
        Manufacturer: Hynix Semiconductor
        Serial Number: 4F22F777
        Asset Tag: DimmD2_AssetTag
        Part Number: HMT42GR7AFR4C-RD
        Rank: 1
        Configured Clock Speed: 1066 MHz

These boxes have all their banks filled and are constantly running databases. So I can't just randomly try swapping DIMMs.

ondra_knezour · Oct 14, 2015

This states, that DMI identification is provided to custom script when an error event is triggered, but I have no machine to test it on hand.

Particulary interesting fields from the file above are
# LOCATION C o n s o l i d a t e d l o c a t i o n a s a s i n g l e s t r i n g
# DMI_LOCATION DIMM l o c a t i o n f r om DMI/ SMBIOS i f a v a i l a b l e
# DMI_NAME DIMM i d e n t i f i e r f r om DMI/ SMBIOS i f a v a i l a b l e
# DIMM DIMM number r e p o r t e d by h a r dw a re

SirDice · Oct 14, 2015

I may have just found a way, or at least good enough for an educated guess

Looking at one MCA error:

Code:

Oct 12 05:19:53 db2.example.com MCA: Address 0x3efbda5500

I assume that address is the physical address of the error, every error in that particular bank seems to have the same address. If I then look through dmidecode:

Code:

Handle 0x0064, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x03C00000000
        Ending Address: 0x03FFFFFFFFF
        Range Size: 16 GB
        Physical Device Handle: 0x0063
        Memory Array Mapped Address Handle: 0x005C
        Partition Row Position: 1

This one appears to handle that address range. It refers to a physical device:

Code:

Handle 0x0063, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x005B
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: P2-DIMMF1
        Bank Locator: P1_Node1_Channel1_Dimm0
        Type: DDR3
        Type Detail: Registered (Buffered)
        Speed: 1066 MHz
        Manufacturer: Hynix Semiconductor
        Serial Number: 2892A78C
        Asset Tag: DimmF1_AssetTag
        Part Number: HMT42GR7AFR4C-RD
        Rank: 1
        Configured Clock Speed: 1066 MHz

And that provides me with a physical location on the board.

ondra_knezour · Oct 14, 2015

Lucky you

From what I see on my machine there is one type 20 device which consists from four type 17 devices (DIMMs) and HW handle points to the Motherboard device, probably North bridge.

Terry_Kennedy · Oct 15, 2015

SirDice said:
I may have just found a way, or at least good enough for an educated guess
[snip]
And that provides me with a physical location on the board.

The first log you posted also shows errors at 0x343353ce80 which would seem to be in a different module.

When you see oddball errors happening in more than one place (which could be multiple things in one system or across several systems), the first thing to do is to see what is in common between the places and if anything has changed between when they were working and when they started reporting errors. Have you ruled out the usual suspects (cooling, power, etc.)? If the systems are in the same (or close to each other) racks, was any work performed on any systems in the rack(s)?

Since these are Supermicro systems on the large side, you probably have the IPMI option on the motherboards. Do you monitor / graph the data that IPMI provides (example with nice graphs here)? The memory modules in (at least) the first system also have on-board thermal monitoring, which you should be able to example in the "Server health" (or similar) menu option in the IPMI web interface.

You can also look at the dmidecode output to see if the affected memory has consecutive (or close) serial numbers, and then check the other memory in the system to see if it is in the same serial number range. If it is, and you are actually experiencing multiple module failures (as opposed to thermal problems or similar), you may want to proactively replace all of those modules while you have the system down for maintenance. The cost of the extra modules is likely to be small compared to the cost of unplanned downtime. Plus, if you replace all of the memory you will know that if the problem recurs, the issue is elsewhere. That also gives you a set of memory to run tests on, in a dedicated test system.

SirDice · Oct 21, 2015

Bugger, looks like it's quite broken. I rechecked everything and the old server has errors on all but one of the modules attached to CPU2. There were no errors with the DIMMs on CPU1. This afternoon I have to be at the co-location anyway. I'm going to remove the DIMMs and see how things go from there.

For future reference (and anyone else that comes across this post), the "Locator" (P1-DIMMD2 for example) of the dmidecode output corresponds with markings on the mainboard.

SirDice · Oct 21, 2015

Terry_Kennedy said:
Since these are Supermicro systems on the large side, you probably have the IPMI option on the motherboards. Do you monitor / graph the data that IPMI provides (example with nice graphs here)? The memory modules in (at least) the first system also have on-board thermal monitoring, which you should be able to example in the "Server health" (or similar) menu option in the IPMI web interface.

They have IPMI but it's not used unfortunately. I am going to look for thermal data though, the datacenter these machines are in is quite warm and it may be the heat that's killing them. If I can query the data from the command line I can probably script something for Zabbix to graph.

SirDice · Oct 21, 2015

Although IPMI isn't configured or used, I can use it to gather some information. The old system doesn't have temperatures for the DIMMs, the new system does:

Code:

root@db4:~ # ipmitool sensor
CPU1 Temp        | 40.000     | degrees C  | ok    | 0.000     | 0.000     | 0.000     | 93.000    | 98.000    | 98.000
CPU2 Temp        | 49.000     | degrees C  | ok    | 0.000     | 0.000     | 0.000     | 93.000    | 98.000    | 98.000
PCH Temp         | 38.000     | degrees C  | ok    | -11.000   | -8.000    | -5.000    | 90.000    | 95.000    | 100.000
System Temp      | 27.000     | degrees C  | ok    | -9.000    | -7.000    | -5.000    | 80.000    | 85.000    | 90.000
Peripheral Temp  | 34.000     | degrees C  | ok    | -9.000    | -7.000    | -5.000    | 80.000    | 85.000    | 90.000
Vcpu1VRM Temp    | 32.000     | degrees C  | ok    | -9.000    | -7.000    | -5.000    | 95.000    | 100.000   | 105.000
Vcpu2VRM Temp    | 38.000     | degrees C  | ok    | -9.000    | -7.000    | -5.000    | 95.000    | 100.000   | 105.000
VmemABVRM Temp   | 34.000     | degrees C  | ok    | -9.000    | -7.000    | -5.000    | 95.000    | 100.000   | 105.000
VmemCDVRM Temp   | 35.000     | degrees C  | ok    | -9.000    | -7.000    | -5.000    | 95.000    | 100.000   | 105.000
VmemEFVRM Temp   | 36.000     | degrees C  | ok    | -9.000    | -7.000    | -5.000    | 95.000    | 100.000   | 105.000
VmemGHVRM Temp   | 39.000     | degrees C  | ok    | -9.000    | -7.000    | -5.000    | 95.000    | 100.000   | 105.000
P1-DIMMA1 Temp   | 34.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P1-DIMMA2 Temp   | 32.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P1-DIMMA3 Temp   | 31.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P1-DIMMB1 Temp   | 32.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P1-DIMMB2 Temp   | 32.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P1-DIMMB3 Temp   | 31.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P1-DIMMC1 Temp   | 34.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P1-DIMMC2 Temp   | 32.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P1-DIMMC3 Temp   | 31.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P1-DIMMD1 Temp   | 32.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P1-DIMMD2 Temp   | 32.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P1-DIMMD3 Temp   | 31.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P2-DIMME1 Temp   | 33.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P2-DIMME2 Temp   | 34.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P2-DIMME3 Temp   | 35.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P2-DIMMF1 Temp   | 35.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P2-DIMMF2 Temp   | 37.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P2-DIMMF3 Temp   | 38.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P2-DIMMG1 Temp   | 33.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P2-DIMMG2 Temp   | 34.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P2-DIMMG3 Temp   | 35.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P2-DIMMH1 Temp   | 35.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P2-DIMMH2 Temp   | 37.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
P2-DIMMH3 Temp   | 38.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000
FAN1             | 4200.000   | RPM        | ok    | 300.000   | 500.000   | 700.000   | 25300.000 | 25400.000 | 25500.000
FAN2             | na         |            | na    | na        | na        | na        | na        | na        | na
FAN3             | na         |            | na    | na        | na        | na        | na        | na        | na
FAN4             | 4100.000   | RPM        | ok    | 300.000   | 500.000   | 700.000   | 25300.000 | 25400.000 | 25500.000
FAN5             | na         |            | na    | na        | na        | na        | na        | na        | na
FAN6             | na         |            | na    | na        | na        | na        | na        | na        | na
FANA             | na         |            | na    | na        | na        | na        | na        | na        | na
FANB             | 3900.000   | RPM        | ok    | 300.000   | 500.000   | 700.000   | 25300.000 | 25400.000 | 25500.000
FANC             | na         |            | na    | na        | na        | na        | na        | na        | na
12V              | 11.937     | Volts      | ok    | 10.173    | 10.299    | 10.740    | 12.945    | 13.260    | 13.386
5VCC             | 4.974      | Volts      | ok    | 4.246     | 4.298     | 4.480     | 5.390     | 5.546     | 5.598
3.3VCC           | 3.282      | Volts      | ok    | 2.789     | 2.823     | 2.959     | 3.554     | 3.656     | 3.690
VBAT             | 3.078      | Volts      | ok    | 2.376     | 2.480     | 2.584     | 3.494     | 3.598     | 3.676
Vcpu1            | 1.809      | Volts      | ok    | 1.242     | 1.260     | 1.395     | 1.899     | 2.088     | 2.106
Vcpu2            | 1.809      | Volts      | ok    | 1.242     | 1.260     | 1.395     | 1.899     | 2.088     | 2.106
VDIMMAB          | 1.191      | Volts      | ok    | 0.948     | 0.975     | 1.047     | 1.344     | 1.425     | 1.443
VDIMMCD          | 1.191      | Volts      | ok    | 0.948     | 0.975     | 1.047     | 1.344     | 1.425     | 1.443
VDIMMEF          | 1.191      | Volts      | ok    | 0.948     | 0.975     | 1.047     | 1.344     | 1.425     | 1.443
VDIMMGH          | 1.191      | Volts      | ok    | 0.948     | 0.975     | 1.047     | 1.344     | 1.425     | 1.443
5VSB             | 4.974      | Volts      | ok    | 4.246     | 4.298     | 4.480     | 5.390     | 5.546     | 5.598
3.3VSB           | 3.299      | Volts      | ok    | 2.789     | 2.823     | 2.959     | 3.554     | 3.656     | 3.690
1.5V PCH         | 1.518      | Volts      | ok    | 1.320     | 1.347     | 1.401     | 1.644     | 1.671     | 1.698
1.2V BMC         | 1.209      | Volts      | ok    | 1.020     | 1.047     | 1.092     | 1.344     | 1.371     | 1.398
1.05V PCH        | 1.041      | Volts      | ok    | 0.870     | 0.897     | 0.942     | 1.194     | 1.221     | 1.248
Chassis Intru    | 0x0        | discrete   | 0x0000| na        | na        | na        | na        | na        | na
PS1 Status       | 0x1        | discrete   | 0x0100| na        | na        | na        | na        | na        | na
PS2 Status       | 0x1        | discrete   | 0x0100| na        | na        | na        | na        | na        | na
AOC_SAS Temp     | 63.000     | degrees C  | ok    | -11.000   | -8.000    | -5.000    | 95.000    | 100.000   | 105.000

This is fairly easy to get into Zabbix. Maybe the older system can get it too but I may need to update the BIOS and IPMI. In any case, all temperatures look to be well within specs. So overheating isn't the problem.

Thanks Terry_Kennedy for the hint

Terry_Kennedy · Oct 31, 2015

SirDice said:
Although IPMI isn't configured or used, I can use it to gather some information. The old system doesn't have temperatures for the DIMMs, the new system does:

The IPMI firmware on (at least some) Supermicro boards can get into a mode where it doesn't report the memory temperature, even if the modules have thermal sensors. Supermicro support told me to "update" the firmware to the same (latest) version while leaving the "preserve configuration" option unchecked. That was a bit of a pain as I had to reconfigure all of my IPMI settings, but it did get the memory temperature to appear.

Vick Khera · Jun 28, 2016

SirDice said:
I may have just found a way, or at least good enough for an educated guess
I assume that address is the physical address of the error, every error in that particular bank seems to have the same address. If I then look through dmidecode:

Did this assumption turn out correct? It seems correct to me, and I'm diagnosing a similar issue. This posting is by far the most helpful one I've found on the topic.

Also any advice on when the ECC corrections are a "problem"? I mean, the point of ECC is to correct errors, so if it happens say once or twice in a year, one can consider it normal operations. What kind of threshold is considered that the memory stick is going bad? In my specific case, it happened three times over the period of 2.5 hours yesterday, all with the same physical address.

SirDice · Jun 28, 2016

Vick Khera said:
Did this assumption turn out correct? It seems correct to me, and I'm diagnosing a similar issue. This posting is by far the most helpful one I've found on the topic.

My assumption was spot on. You can relate the address in the MCA error with the addresses from dmidecode(8) to find the exact memory module.

Also any advice on when the ECC corrections are a "problem"? I mean, the point of ECC is to correct errors, so if it happens say once or twice in a year, one can consider it normal operations.

Any correction is actually a failure. It just means ECC does what it's supposed to do, cover up failures. Normal operations don't produce ECC corrections. If it happens just once or twice a year it simply means the machine never really uses all its memory, so you're not as likely to run into the "bad" part.

Same MCA errors on two different machines

SirDice

Administrator

ondra_knezour

ondra_knezour

SirDice

Administrator

ondra_knezour

SirDice

Administrator

ondra_knezour

Terry_Kennedy

SirDice

Administrator

SirDice

Administrator

SirDice

Administrator

Terry_Kennedy

Vick Khera

SirDice

Administrator