[SOLVED] Weird MCA errors

Posting here about solved problem, hopefully it would be useful for someone.
I was seeing these errors in /var/log/message (and on the console and dmesg.today)
It was going like this for months without anything noticeable performance wise.
System is mostly idle - just running jails for backup service.
OS release was 13.x and upgrade to 14.1 release didn't help.

Code:
MCA: CPU 0 COR EN OVER GCACHE L2 EVICT error
MCA: Address 0x10630800
MCA: Bank 0, Status 0xd400400068000136
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x100f43, APIC ID 0
MCA: CPU 0 COR EN OVER DCACHE L2 DRD error
MCA: Address 0x3ac228800
MCA: Bank 1, Status 0x9400000000000151
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x100f43, APIC ID 0
MCA: CPU 0 COR EN ICACHE L1 IRD error
MCA: Address 0xffff80e38830
MCA: Bank 2, Status 0xd40040000000018a
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x100f43, APIC ID 0
MCA: CPU 0 COR EN OVER GCACHE L2 SNOOP error
MCA: Address 0xdff8800
MCA: Bank 0, Status 0xd400400068000136
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x100f43, APIC ID 0
MCA: CPU 0 COR EN OVER DCACHE L2 DRD error
MCA: Address 0xefd0800
MCA: Bank 1, Status 0xd000000000000171
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
I run them through `mcelog` and didn't get much of insight on why they were there.
What was weird - while they looked like HW errors, they were popping up with 5 minutes interval, almost with a second precision , e.g. 03:42:58 , 03:47.58 etc
Hardware errors are not that periodic.
The only thing that runs on 5 min interval is `atrun` ( see /etc/cron.d/at ) and there is nothing in /var/spool to run.
And the system is pretty much idle , runs couple jails.
I disabled atrun and it didn't help.
Today I pretty much gave up and decided it's HW problem - old AMD Phenom II CPU has cache issues and it's time to replace it.
As I started collecting info through `dmidecode` on CPU and DRAM and motherboard, I noticed that clock speed is reported 1000Hz while I was pretty sure I didn't buy anything that even.
So I rebooted, got into BIOS and sure enough, somewhere in "advanced settings" it had system bus clock set to 200 and multiplier x5. Probably, long time ago I tried to overclock it and failed and it switched to "Fail Safe" settings. So changed it to "Optimized default" and CPU speed changed to 3200 as it supposed to be.
Reboot and all MCA errors disapper.

TL;DR: nanoseconds matter, wrong CPU speed can cause L1-L2-L3 cache error messages. And I beleive they were real cache reading errors.
 
Back
Top