Some sort of hardware failure?

stratacast1 · Aug 17, 2017

I have an HP Proliant ML110 G6, and I came out to my server this morning to find fans running at 100% and an unresponsive system. I then hard shutdown my server and started it back up and got the error:

Server Asset Text :/
A multiple bit memory error has occurred:
Host Bus

What does that even mean? Did my motherboard just fail or something?

ralphbsz · Aug 18, 2017

Can you check whether that particular model of computer supports ECC memory, and whether ECC memory is actually installed?

In most cases, ECC memory can detect and correct a single bit error, and detect and flag reliably all two bit errors, and detect most multi-bit errors. This error message means that an uncorrectable memory error has occurred (2 or more bits). That means your memory, or memory interface, or some connection is defective.

If your motherboard can not support ECC, or no ECC memory is installed, that error means that something is completely confused.

If ECC memory is installed, that error means that the memory has failed. In many cases, that is actually caused by contact problems. I would start by opening the computer, putting on a static safety strap and grounding yourself (*), and then one at a time pull all memory DIMMs and reseat them. In extreme cases, it may even help to clean any slight corrosion from the DIMM contacts with an eraser (the eraser end of a pencil works fine), just blow the dust from the eraser off. And if that doesn't help, then it was probably not a contact or corrosion problem, but you have had a memory DIMM or motherboard failure.

(* Footnote: Please do ground yourself with an anti-static armband. A few years ago we got a new computer that had a huge amount of memory installed, I think it was 768 GB, and I had to rearrange the DIMMs for best performance. This was at a time when such amount of memory were still considered extraordinary, and used very expensive DIMMs. I was not grounded, standing on a plastic ladder (the computer was rack-mounted high up), and I ended up destroying a DIMM with static electricity. That was a mistake that cost our group several thousand dollars. Don't make that mistake.)

stratacast1 · Aug 19, 2017

Hey ralphbsz, yes the server supports ECC and is using ECC memory. I purchased a new module of RAM to upgrade my system from a 2x4 config to a 4x4 config, and it ran for a week so perhaps one of my DIMMs did just gradually die from static shock? My currently running 2x4 config is fine, as I removed my 2 other sticks. One of the sticks originally came with the server and worked for the week I used it (until I upgraded to 8), and after a week this (supposedly new) new stick of RAM seems to be causing errors. Maybe I will reseat all my memory, shut my server down and run memtest off of a live bootable image

ralphbsz · Aug 19, 2017

Reseat sounds good. Memtest might help, although memory test usually doesn't stress the memory as much as extreme workload. Since this is a high-quality motherboard from a respectable vendor, I would suggest running the diagnostics built into the motherboard; they tend to be better at finding problems. They can also pinpoint the problem to a specific DIMM to replace, while the operating-system memtest will only tell you that "something is wrong somewhere".

Hopefully your problem can be found and fixed (by reseating, cleaning, or replacing a DIMM after ). These intermittent memory problems can be very very annoying, and if they are rare enough, virtually impossible to fix. In my previous job we had a high-end server (two-socket, dozens of cores, 256GB memory, half dozen high-end PCIe IO cards) installed at a customer site, and every few days it would go down with a hardware fault. After months of local service technicians fiddling with it (and replacing all DIMMs in the process), it was decided to take the whole server (including DIMMs and PCIe cards), throw it in the trash, and install a brand-new one. The original problem was never found. After a while, throwing a new $30K server at the problem is cheaper than dozens of customer visits by field service.

leebrown66 · Aug 19, 2017

I've found memtest to be less than reliable in recent years. I had a bad RAM module which memtest running for 3 days didn't detect, but a kernel compile would fail in different places every time.

lol, didn't have 768GB of RAM though, I don't know if even a kernel compile would use all that memory

stratacast1 · Aug 20, 2017

Hahaha woooowww..welp I did some investigation and I'm wondering if 1 of my RAM slots is corroded. All my RAM tested reliable (including the new one) on known-working slots. However, the moment I put sticks in the other 2 memory slots errors fly wild. There was some dust in one slot but I cleaned it out and it's still throwing errors. So my guess is to figure out how to clean the slot (contemplating the eraser method or isopropyl alcohol method) and see if that works. Otherwise, my mobo is apparently only $60-$80 on ebay

Some sort of hardware failure?

stratacast1

ralphbsz

stratacast1

ralphbsz

leebrown66

stratacast1