finding faulty RAM in FreeNAS machine

kibsd · Sep 11, 2023

Hi there,
I have a FreeNAS running which unfortunately shows a lot of memory errors. There seems to be more at stake than just 1 module (which had been reported in Supermicro BIOS and which I already removed).

In /var/log/messages I still get a lot of

Code:

Sep  7 02:07:20 freenas MCA: CPU 0 COR (1) MS channel 0 memory error
Sep  7 02:07:20 freenas MCA: Address 0x6e0bfb900
Sep  7 02:07:20 freenas MCA: Misc 0x1221000000000086
Sep  7 02:07:20 freenas MCA: Bank 15, Status 0x8c000040000800c0
Sep  7 02:07:20 freenas MCA: Global Cap 0x0000000007000c14, Status 0x0000000000000000
Sep  7 02:07:20 freenas MCA: Vendor "GenuineIntel", ID 0x50657, APIC ID 0

As I read in this thread it is possible to localize the faulty DIMM via the given error address which can be found in the address range that is shown in the Memory Device Mapped Address of dmidecode output. Unfortunately in my case this is not clear as there are two sets of DIMMs that have the same adress ranges. Also, at some earlier stage the error adresses even pointed outside the range of my Memory Device Mapped Addresses completely. It therefore seems as there were be some conversion/calculation involved, maybe via the memory controllers.

Please help!
Here you go with the memory handles straight out of dmidecode. This was originally 64 GB of RAM in 8 sticks.

Memory Array Mapped Addresses:

Code:

Handle 0x0021, DMI type 19, 31 bytes
Memory Array Mapped Address
    Starting Address: 0x00000000000
    Ending Address: 0x007FFFFFFFF
    Range Size: 32 GB
    Physical Array Handle: 0x0020
    Partition Width: 4
    
Handle 0x0029, DMI type 19, 31 bytes
Memory Array Mapped Address
    Starting Address: 0x00800000000
    Ending Address: 0x00FFFFFFFFF
    Range Size: 32 GB
    Physical Array Handle: 0x0028
    Partition Width: 4
    
    
Handle 0x0030, DMI type 19, 31 bytes
Memory Array Mapped Address
    Starting Address: 0x01000000000
    Ending Address: 0x017FFFFFFFF
    Range Size: 32 GB
    Physical Array Handle: 0x0020
    Partition Width: 0

Handle 0x0035, DMI type 19, 31 bytes
Memory Array Mapped Address
    Starting Address: 0x01800000000
    Ending Address: 0x01FFFFFFFFF
    Range Size: 32 GB
    Physical Array Handle: 0x0028
    Partition Width: 0

Memory Device Mapped Addresses (which point to the actual physical memory sticks handles):

Code:

Handle 0x0031, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x00000000000
    Ending Address: 0x001FFFFFFFF
    Range Size: 8 GB
    Physical Device Handle: 0x0022
    Memory Array Mapped Address Handle: 0x0030
    Partition Row Position: 1
    Interleave Position: 1
    Interleaved Data Depth: 1

Handle 0x0032, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x00200000000
    Ending Address: 0x003FFFFFFFF
    Range Size: 8 GB
    Physical Device Handle: 0x0023
    Memory Array Mapped Address Handle: 0x0030
    Partition Row Position: 1
    Interleave Position: 1
    Interleaved Data Depth: 1

Handle 0x0033, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x00400000000
    Ending Address: 0x005FFFFFFFF
    Range Size: 8 GB
    Physical Device Handle: 0x0024
    Memory Array Mapped Address Handle: 0x0030
    Partition Row Position: 1

Handle 0x0034, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x00600000000
    Ending Address: 0x007FFFFFFFF
    Range Size: 8 GB
    Physical Device Handle: 0x0026
    Memory Array Mapped Address Handle: 0x0030
    Partition Row Position: 1
    



Handle 0x0036, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x00000000000
    Ending Address: 0x001FFFFFFFF
    Range Size: 8 GB
    Physical Device Handle: 0x002A
    Memory Array Mapped Address Handle: 0x0035
    Partition Row Position: 1
    Interleave Position: 1
    Interleaved Data Depth: 1

Handle 0x0037, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x00200000000
    Ending Address: 0x003FFFFFFFF
    Range Size: 8 GB
    Physical Device Handle: 0x002B
    Memory Array Mapped Address Handle: 0x0035
    Partition Row Position: 1
    Interleave Position: 1
    Interleaved Data Depth: 1

Handle 0x0038, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x00400000000
    Ending Address: 0x005FFFFFFFF
    Range Size: 8 GB
    Physical Device Handle: 0x002C
    Memory Array Mapped Address Handle: 0x0035
    Partition Row Position: 1

Handle 0x0039, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x00600000000
    Ending Address: 0x007FFFFFFFF
    Range Size: 8 GB
    Physical Device Handle: 0x002E
    Memory Array Mapped Address Handle: 0x0035
    Partition Row Position: 1

sko · Sep 11, 2023

1. TrueNAS is not FreeBSD and hence not supported here: https://forums.freebsd.org/threads/ghostbsd-pfsense-truenas-and-all-other-freebsd-derivatives.7290/

2. is this ECC RAM? Also, given they are only 8GB in size and assuming this is server RAM, how old are they/is the whole system?

6502 · Sep 11, 2023

memtest86/memtest86+ will show you the DIMM

kibsd · Sep 11, 2023

Hi there,

1. TrueNAS is not FreeBSD and hence not supported here: https://forums.freebsd.org/threads/ghostbsd-pfsense-truenas-and-all-other-freebsd-derivatives.7290/

hmmm, that's a bit sad

maybe my question is quite general enough to be answered!?

I took over the system - don't know the actual age. It is a Supermicro X11SPL-F board. From what I heard I roughly guess 10 years of age ..

It is ECC RAM .. I already had Memtest86+ running for a while and unfortunately it did not catch any errors. The errors mostly occur when the server gets some load on its disks ..

6502 · Sep 11, 2023

Run memtest86(+) for 3-4 hours. I had a case when RAM error occured after about 3 hours. In the first hour RAM looked OK. Probably the heat is important. Note that memtest86 and 86+ are different tools.

SirDice · Sep 11, 2023

It's ECC memory, memtest isn't going to show memory errors because ECC is correcting them. Running those test will only illicit more MCA messages, they won't show up as memory errors with a test because ECC does what it's supposed to do, correct those errors.

cracauer@ · Sep 11, 2023

memtest86(+) allows you to turn ECC off.

Overall, however, I always had to fall back to physically removing DIMMs to identify faulty ones.

memtest is also a no-load test. A test such as SuperPI is better at uncovering RAM problems such as timing problems.

kibsd · Sep 11, 2023

ah, ok, so I'll check tomorrow with turning ECC off!

cracauer@ Checking each DIMM to find out is what I actually tried to avoid

Why didn't it work out in your case?

I'll take a look at SuperPI!

cracauer@ · Sep 11, 2023

kibsd said:
cracauer@ Checking each DIMM to find out is what I actually tried to avoid Why didn't it work out in your case?

I was too dumb to map messages to which DIMM they refer to

smithi · Sep 11, 2023

kibsd said:
The errors mostly occur when the server gets some load on its disks ..

Tried another power supply?

PMc · Sep 11, 2023

When I got such errors, it read
MCA: Address 0x18d5b1b40 (Mode: Physical Address, LSB: 6)
(note the "Physical Adress" mention - whatever that means...)

I then found these addresses to match those in sysctl vm.phys_segs, and the latter ones are shown with their NUMA domains.
And I know which memory is in which NUMA domain (because they have different size

). That reduced the problem to two possible sticks, which then were easy to swap around.

kibsd · Sep 13, 2023

well .. that's not going so good. I checked with sysctl vm.phys_segs and dmesg but it seems I have only NUMA domains "0" if any at all.

Memtest86 (Passmark) which even in the free version supports ECC error reporting will not show any errors.

Power supply I found a rather interesting idea -- there are redundant power supplies installed in the chassis. How would I test those - just remove one each time? As they both draw power which adds up to total consumption in regular / dual mode I supposed they would compensate for each others instabilities!?

No more ideas on how to track the addresses given to the actuall DIMMS via the Memory Array Mapped Addresses?

ralphbsz · Sep 14, 2023

The last time this happened to me (on an IBM/Lenovo rackmount server), I used Lenovo documentation and the BIOS together. There was a way to look at error counters from the BIOS, and map them to physical DIMMs. You may want to look for Supermicro documentation.

richardtoohey2 · Sep 14, 2023

The last time this happened to me was on Supermicro but I didn't find anything obvious to help map the error message to the appropriate physical bank. The server was just entering use so it was easiest just to swap out the two RAM sticks (not so easy if you've got 4 or 8 sticks). And one quiet day (ha!) I'll try and figure out how to find which of the sticks was bad.

blanchet · Sep 14, 2023

Supermicro x11 motherboards have a Baseboard Management Controller (BMC) with a web interface on a dedicaded network interface.
Connect to this web interface and then check the System Event Log (SEL). It will tell you which DIMM has encountered the errors.

kibsd · Sep 14, 2023

I was working myself through the management interface already some time before and indeed I found errors about a faulty module there.

I removed it but unfortunately now I don't see any errors in the BMC anymore but just in my /var/log/messages.

Just found out that this software from Supermicro might be working with my board: https://www.supermicro.com/de/solutions/management-software/super-diagnostics-offline

I think I'll have a shot as it seems to include memory testing ..

finding faulty RAM in FreeNAS machine

Administrator