Mem Error since 11.2-RELENG

Hello Guys,

I update to release 11.2 and 2 of my HP Servers have memory errors,
Code:
DL380G7:
MCA: Bank 8, Status 0x88000040000200cf
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 33
MCA: CPU 13 COR (1) MS channel ?? memory error
MCA: Misc 0x204081000016000

DL580G7 too much to read and the Machine crash. What happened with 11.2?

best regards ré
 
Sounds like you've just got bad memory. Those are pretty old now, aren't they? Look at the built in management if you have it.
 
Note that this is ECC memory and the error was corrected (COR). So it's unlikely a memory test will find failures, the test may trigger more MCA messages though.

The error may have been there before the upgrade too, and nobody noticed it. Now that you've upgraded the system you're looking at the logging more closely. So it "appears" these errors are due to the upgrade.
 
FWIW I see these periodically; once in some 7 - 8 months on one of my servers. I recently changed the memory to double the capacity, and speed of the RAM. Then saw one a week later (this is new memory). It might be worth noting that the L(123) cache is also memory, and can also throw these errors. I suspect (in my case) that the CPU is working exceptionally hard at the time (heavy load) and that the (cache) memory isn't performing correctly while the CPU is so hot.
Unless they become a frequent occurrance. You can probably disregard the message(s).
As to the FreeBSD upgrade. It may well be that the upgrade introduces something that makes these (tempfails) more evident.

HTH

--Chris
EDIT
I should also note that a failing PSU will cause this (poor quality electricity eg; failing diode(s)). RAM, and CACHE are especially sensitive to the quality of the electricity.
 
The MCA errors are thrown when the CPU detects a possible hardware problem. I've seen these for the L1 cache on my machine. They are usually informational, but they do indicate that there may be a problem. Since CPUs now contain memory controllers, it looks like it registered an error with the memory, but it was also corrected. I wouldn't worry too much about it unless it happens frequently, which would indicate a hardware fault or flaky hardware (memory, CPU, mainboard, PSU, etc...).
 
These days, memory cells are small enough to be corrupted by radiation. How is the sun storm activity? And yes, I _am_ serious.
 
These days, memory cells are small enough to be corrupted by radiation. How is the sun storm activity? And yes, I _am_ serious.

I know you are, as I am well aware of that myself. A cosmic ray has enough energy to flip the bit in a memory cell. So yes, space weather is becoming very important to admins as well.

Space weather, not just for power grids, satellites, and communications operators/providers any more.
 
Back
Top