Ram issue

last1 · Feb 18, 2021

It appears I have an issue with a RAM module or two. I see this in messages

Code:

Feb  5 14:40:17 dfs12 kernel: MCA: Bank 7, Status 0x8c00004000010093
Feb  5 14:40:17 dfs12 kernel: MCA: Global Cap 0x0000000001000c17, Status 0x0000000000000000
Feb  5 14:40:17 dfs12 kernel: MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 0
Feb  5 14:40:17 dfs12 kernel: MCA: CPU 0 COR (1) RD channel 3 memory error
Feb  5 14:40:17 dfs12 kernel: MCA: Address 0x5d419fdc0
Feb  5 14:40:17 dfs12 kernel: MCA: Misc 0x40185286
Feb  7 15:02:57 dfs12 kernel: MCA: Bank 7, Status 0x8c00004000010093
Feb  7 15:02:57 dfs12 kernel: MCA: Global Cap 0x0000000001000c17, Status 0x0000000000000000
Feb  7 15:02:57 dfs12 kernel: MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 0
Feb  7 15:02:57 dfs12 kernel: MCA: CPU 0 COR (1) RD channel 3 memory error
Feb  7 15:02:57 dfs12 kernel: MCA: Address 0x5d419fdc0
Feb  7 15:02:57 dfs12 kernel: MCA: Misc 0x1421ad486
Feb 14 08:57:11 dfs12 kernel: MCA: Bank 12, Status 0x8c000041000800c3
Feb 14 08:57:11 dfs12 kernel: MCA: Global Cap 0x0000000001000c17, Status 0x0000000000000000
Feb 14 08:57:11 dfs12 kernel: MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 0
Feb 14 08:57:11 dfs12 kernel: MCA: CPU 0 COR (1) MS channel 3 memory error
Feb 14 08:57:11 dfs12 kernel: MCA: Address 0x5d419fdc0
Feb 14 08:57:11 dfs12 kernel: MCA: Misc 0x90840800080028c
Feb 14 13:42:46 dfs12 kernel: MCA: Bank 7, Status 0x8c00004000010093
Feb 14 13:42:46 dfs12 kernel: MCA: Global Cap 0x0000000001000c17, Status 0x0000000000000000
Feb 14 13:42:46 dfs12 kernel: MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 0
Feb 14 13:42:46 dfs12 kernel: MCA: CPU 0 COR (1) RD channel 3 memory error
Feb 14 13:42:46 dfs12 kernel: MCA: Address 0x5d419fdc0
Feb 14 13:42:46 dfs12 kernel: MCA: Misc 0x1421cae86
Feb 14 14:57:35 dfs12 kernel: MCA: Bank 7, Status 0x8c00004000010093
Feb 14 14:57:35 dfs12 kernel: MCA: Global Cap 0x0000000001000c17, Status 0x0000000000000000
Feb 14 14:57:35 dfs12 kernel: MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 0
Feb 14 14:57:35 dfs12 kernel: MCA: CPU 0 COR (1) RD channel 3 memory error
Feb 14 14:57:35 dfs12 kernel: MCA: Address 0x5d419fdc0
Feb 14 14:57:35 dfs12 kernel: MCA: Misc 0x152180086

I am correlating the address 0x5d419fdc0 with dmidecode to match on this DIMM

Code:

Handle 0x0043, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x002F
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16 GB
        Form Factor: DIMM
        Set: None
        Locator: P1-DIMMD1
        Bank Locator: P0_Node0_Channel3_Dimm0
        Type: DDR3
        Type Detail: Registered (Buffered)
        Speed: 1333 MT/s
        Manufacturer: Samsung
        Serial Number: 13A2597E
        Asset Tag: DimmD1_AssetTag
        Part Number: M393B2G70BH0-YH9
        Rank: 2
        Configured Memory Speed: 1333 MT/s

Handle 0x0044, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x00400000000
        Ending Address: 0x007FFFFFFFF
        Range Size: 16 GB
        Physical Device Handle: 0x0043
        Memory Array Mapped Address Handle: 0x0030
        Partition Row Position: 1

What kind of error is that ? Is it an actual error or warning ? The memory is ECC so I'm not fully sure what's happening. ( There are 2 different banks mentioned 7 & 12 but with the same address )

I am seeing some CRC errors with ZFS and I wonder if this could have anything to do with it

the3ajm · Feb 18, 2021

Have you performed a memory test from your system bios diagnostic tool?

im · Feb 18, 2021

last1 said:
The memory is ECC

Your memory module seems like "Registered/buffered + ECC",
Usually ECC may correct only single bit memory errors.
But ECC memory may be completly failed as any other memory.

In case of failed memory you should to replace the faulted module as soon as possible.

SirDice · Feb 18, 2021

ECC detected memory errors but was able to do what it's supposed to do, namely correct those errors (COR). Still need to replace those DIMMs though, the corrections are only saving you from imminent failure.

If you didn't have ECC but plain regular memory you would not see those errors, you just get random crashes of applications, spontaneous resets and all sorts of wonderful and unstable weirdness.

Snurg · Feb 18, 2021

SirDice said:
If you didn't have ECC but plain regular memory you would not see those errors, you just get random crashes of applications, spontaneous resets and all sorts of wonderful and unstable weirdness.

Particulartly nice is when you find some files being altered due to memory or bus errors.
The joy rises to maximum when you find that the file of which you know once was intact, is clobbered on all backups, too.

This led me to use ZFS and avoid non-ECC hardware where possible, and regularly stash away backups I never touch again.

last1 · Feb 18, 2021

SirDice said:
ECC detected memory errors but was able to do what it's supposed to do, namely correct those errors (COR). Still need to replace those DIMMs though, the corrections are only saving you from imminent failure.

If you didn't have ECC but plain regular memory you would not see those errors, you just get random crashes of applications, spontaneous resets and all sorts of wonderful and unstable weirdness.

Is it possible that some corruption is still taking place despite the corrections ?

I am asking because I have an app that stores files on top of ZFS and it computes/stores its own CRCs for each file, and for the past few weeks this app is throwing CRC errors but ZFS hasn't said anything yet.

SirDice · Feb 18, 2021

last1 said:
Is it possible that some corruption is still taking place despite the corrections ?

Only if you have a bunch of those MCA messages that tell you it was unable to correct the error. As long as the error correction works the memory should be working as if nothing happened, that's what it's supposed to do.

Ram issue

last1

the3ajm

im

SirDice

Administrator

Snurg

last1

SirDice

Administrator