Memory failure?

IPTRACE

Well-Known Member

Reaction score: 24
Messages: 321

Hello!

What do the following erros mean?

Code:
Jan  3 19:13:15 hpv kernel: MCA: Bank 13, Status 0x8c000051000800c0
Jan  3 19:13:15 hpv kernel: MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Jan  3 19:13:15 hpv kernel: MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 48
Jan  3 19:13:15 hpv kernel: MCA: CPU 30 COR (1) MS channel 0 memory error
Jan  3 19:13:15 hpv kernel: MCA: Address 0x35c9cbf780
Jan  3 19:13:15 hpv kernel: MCA: Misc 0x918c2000200228c
.......
Jan  5 16:06:59 hpv kernel: MCA: Bank 8, Status 0x8c00004000010090
Jan  5 16:06:59 hpv kernel: MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Jan  5 16:06:59 hpv kernel: MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 48
Jan  5 16:06:59 hpv kernel: MCA: CPU 30 COR (1) RD channel 0 memory error
Jan  5 16:06:59 hpv kernel: MCA: Address 0x35c9cbf740
Jan  5 16:06:59 hpv kernel: MCA: Misc 0x152606086
Jan  5 16:06:59 hpv kernel: MCA: Bank 8, Status 0x8c00004000010090
Jan  5 16:06:59 hpv kernel: MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Jan  5 16:06:59 hpv kernel: MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 49
Jan  5 16:06:59 hpv kernel: MCA: CPU 31 COR (1) RD channel 0 memory error
Jan  5 16:06:59 hpv kernel: MCA: Address 0x35c9cbf740
Jan  5 16:06:59 hpv kernel: MCA: Misc 0x152606086
 
OP
IPTRACE

IPTRACE

Well-Known Member

Reaction score: 24
Messages: 321

I've found a tool dmidecode and linked the MCA: Addresses with properly address range on memory.
Is it correct?

Code:
Jan  5 16:06:59 hpv kernel: MCA: Bank 8, Status 0x8c00004000010090
Jan  5 16:06:59 hpv kernel: MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Jan  5 16:06:59 hpv kernel: MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 48
Jan  5 16:06:59 hpv kernel: MCA: CPU 30 COR (1) RD channel 0 memory error
Jan  5 16:06:59 hpv kernel: MCA: Address 0x35c9cbf740
Code:
Handle 0x004B, DMI type 19, 31 bytes
Memory Array Mapped Address
        Starting Address: 0x02FFFA00000
        Ending Address: 0x03FFFFFFFFF
        Range Size: 65542 MB
        Physical Array Handle: 0x004A
        Partition Width: 2

Handle 0x004C, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x004A
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 32 GB
        Form Factor: DIMM
        Set: None
        Locator: DIMM_P1_G0
        Bank Locator: P1_Node1_Channel2_Dimm0
        Type: DDR4
        Type Detail: Synchronous
        Speed: 2133 MHz
        Manufacturer: SK Hynix
        Serial Number: 80F1D447
        Asset Tag: DIMM_P1_G0_AssetTag
        Part Number: HMA84GL7MMR4N-TF
        Rank: 4
        Configured Clock Speed: 2133 MHz

The same memory bank as above?
Code:
Jan  3 19:13:15 hpv kernel: MCA: Bank 13, Status 0x8c000051000800c0
Jan  3 19:13:15 hpv kernel: MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Jan  3 19:13:15 hpv kernel: MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 48
Jan  3 19:13:15 hpv kernel: MCA: CPU 30 COR (1) MS channel 0 memory error
Jan  3 19:13:15 hpv kernel: MCA: Address 0x35c9cbf780
 
A

ASX

Guest


What do the following erros mean?

There an utility to decode those messages: sysutils/mcelog

mcelog --ascii [ paste your log to STDIN ]

and got:
Code:
CPU 30 BANK 13
MISC 918c2000200228c ADDR 35c9cbf780
MCG status:
MemCtrl: Corrected patrol scrub error
STATUS 8c000051000800c0 MCGSTATUS 0
MCGCAP 7000c16 APICID 30 SOCKETID 0
CPUID Vendor Intel Family 6 Model 63
Hardware event. This is not a software error.
CPU 30 BANK 8
MISC 152606086 ADDR 35c9cbf740
MCG status:
STATUS 8c00004000010090 MCGSTATUS 0
MCGCAP 7000c16 APICID 30 SOCKETID 0
CPUID Vendor Intel Family 6 Model 63

--> MemCtrl: Corrected patrol scrub error

As I understand some memory error was detected and corrected.

mcelog should also provide the location of the ram bank, but obviously that can't be run from my machine. (see option --dmi).
 

SirDice

Administrator
Staff member
Administrator
Moderator

Reaction score: 13,147
Messages: 39,756

I've found a tool dmidecode and linked the MCA: Addresses with properly address range on memory.
Is it correct?
Yep. Those are memory errors. They've been corrected due to ECC so there's not a direct problem but the module does need to be replaced. And it looks like you have found the correct modules.
 
Top