Solved MCA: CPU internal parity error, need help

I first installed 11.1-Release on an old laptop with minimum openbox setup without any issues. Now I installed 11.1-Stable on i7-3770 box. During building ports there were lot of restart & core dumps. I noticed "MCA: CPU 4 COR (16) over internal parity errors" . Thread/mca.....that.40386 seems to be same as mine. I have not overclocked the CPU, and now don't know how to set Vcore on the fly.

After so many failed attemt finally I managed to install the ports xorg, openbox, tint etc. only after installing of some depended ports & pkg's. I know this is not good. But only choice left is total pkg route & no building world etc. This may be a hardware problem and require your help.
 
If you haven't overclocked or set some weird voltages in the BIOS then it's most likely a hardware defect. Building from source tends to push hardware quite hard so anything that's just a little bit "off" is going to cause issues.
 
Please remember that overclocking can permanently damage CPUs.

Do not wonder if you ran Windows on that computer before: Windows just suppresses such errors.
Defective CPUs appear to run flawlessly this way.

I once had a 3-core Athlon and a BIOS that allowed to activate the disabled cores.
That core then showed L1 cache parity errors. But Windows never complained.
If you are lucky your BIOS allows to disable the defective core...
 
I had an issue like this before, but it's rare to see it. I didn't even know that this was a thing until I saw error messages on the console. You haven't posted the entire message, but the usual culprit for a MCE is L1 cache parity errors. Most errors are correctable, so there is generally no problem. Sometimes you do get an error which is caused by transient events, like a cosmic ray striking a on-chip memory cell. Yes, it happens but it is exceeding rare.

You may have a defective CPU core that is causing problems. The message will indicate which core is causing the problem and what the problem is. In your case, it appears to be core 4? If other cores are also having messages, you could have a faulty power supply or mainboard. On the mainboard, there is a power converter which converts the source voltages into the VCore voltage that the CPU requires. Check the capacitors (tall round cylindrical parts which are standing upright on the board) and see if the tops of any of them are bulging out. If any of them that are, or you see brown stuff on top or the top of the can open, then you have bad capacitors and it's screwing with the VCore voltage stability.

A faulty power supply can cause all kinds of problems. Excessive switching noise on the power rails can cause lots of problems in digital circuitry. Power dropouts can disrupt computer operations and screw with contents of registers, SRAM cache, and such. Memory not so much though (believe it or not, dynamic memory does maintain its information for a short period after power is removed. In fact, it was shown that you could remove memory modules from a machine that is powered on, and then place the modules in a memory reader to dump the contents.).

Beyond that, you probably have a defective CPU and probably should replace it.

Here' a couple of pictures to show you what to look for.

bulged caps.jpg

e_VGA_7600_GT_Sacon_caps_2.jpg
 
Please note that caps don't need to be bulgy.
They can leak without bulging, too

You'll see this when you examine the board carefully.
Make sure that you don't confuse the liquid with the brown goo glue.

Last weekend I scavenged an external SCSI drive case with PS from the university electronics junk container, as I needed some casing for my new BBB.
As according to the date stamping it was 19 yrs old, I proactively opened the PS case to inspect the caps.
One was leaky, recognizable by the liquid that flew onto the PCB. But it was not bulgy!
After I removed the defective cap, the leak was easily visible from the underside.
The liquid on the PCB can be easily cleaned up using IPA.

I still have it in my electronics trash bucket, can photograph and post the photo.
 
Building from source tends to push hardware quite hard so anything that's just a little bit "off" is going to cause issues
Sure Sir, you pointed to the direction. Snurg, it is true that big fat window and other players try to mislead regarding OC only for minor/negetive gain.
I disabled OC in BIOS and seems that things has improved. htop shows that all the CUP's are working in full simultaneously and use much more RAM(~2.5GB).
That bit "off" OC made me puzzled. Probably this was done for games. Now I am trying for palemoon after some potrts update today.It now bed time, will start again tomorrow.
 
Just so that you know, when overclocking, you also need to increase VCore to compensate for it because the CPU requires more power to run at the higher clock rate. This can damage the CPU, but if your concern is performance above hardware reliability.... Power = Volts * Amps, and with CMOS technology, transistors only draw power when switching. If the clock is zero, then the current draw is in the picoamps which is just the leakage current. So if you switch the transistors faster, there's not enough current available to fully switch states, which is why you need to increase VCore.
 
Sorry for OT: Maelstorm
Aside of that you mentioned already, I wonder whether it makes sense at all nowadays to overclock processors, maybe except the highest speed class of a series.
Nowadays the dies are sorted meticulously in narrow ranges depending on their sweet spots.
And even when you use the fastest of a series, it is not at all certain that there is left much of margin to overclock, because manufacturers are known to keep and pile up the best selected dies, so they have some inventory ready when they announce a faster variant.

I sometimes have the feeling that manufacturers support overclocking because they sell more then, due to all the silicon breaking of that.
I once looked into an overclocker forum and it was amazing how many people told about broken CPUs and mobos they were going to replace asap...

I wish there were more information about increasing system stability and power consumption by underclocking...
For example, I'd love to know whether DRAM underclocking for increasing non-ECC reliability makes sense at all, unless you have a BIOS that permits to increase cycles lengths above the values stored in the SPD...
 
I also wish there may be more awareness regarding stability/damage over overclocking. Thus normal users (windows) ignore small abnormality and thus damage their box/parts & blame others. In general I would like to run my car below full capacity for long period & sort period on full throttle.
 
Memory nowadays operates at a speed of about 20MHz - 50MHz or so. Memory speed increased incremently over the years due to the inherent nature of DRAM. It is physically impossible to read/write the cells any faster and maintain reliability, so what they do is read/write more memory in a given memory cycle. Hence the architectural/organizational improvements such as DDR levels and multi-channel communication paths. A read command will bring back something like 16 or 32 words of memory (a word being the native CPU bitwidth). Current design reads memory in blocks because it must go through the caches to get inside the CPU. So a block of 8 words on a 64-bit CPU is the size of one cache line in the L1 cache of most 64-bit CPUs (assuming a 64-byte cache line). The caches are designed to increase temporal and spacial locality to minimize cache misses. This is stuff that manufacturers will not tell you.

Running the memory below spec is somewhat iffy because you are tampering with timing. Not just clock speed and such, but chip-to-chip timing which is critical to getting signals from one place to another with a reasonable amount of integrity. If the BIOS supports it, you *COULD* run slower than the recommended clock, but you increase the cache miss penalty because you are making the system wait longer for that data. Memory hardware runs at the speed of the front side bus. The FSB clock doesn't change, but the timings will.

Memory is actually quite reliable these days. There really is no need to underclock it. My recommendation is to just use what the SPD on the DIMMs say to use. The manufacturers are not going to steer you wrong is this regard. The settings that they specify ensure reliability and performance. It's a trade off...but then again, so is everything else. Besides, the memory manufacturers make a name for themselves based on the reliability of their products. Therefore, if word gets out that their products are not reliable, then they will eventually fold.
 
I as normal user was not aware of that implication of memory. We only go by some forum posts. We have to only depend on the manufacturer for maintenance/replacement. Change of parameters in modern BIOS is so easy and so tempt us. I have learned in a hard way before any damage done on my box. Speed for DIMM is my fault but OC of CPU (by 0.1) were advertised by the manufacturer. So synchronization of all parts is important and thus also trade off.
 
Back
Top