Kernel log message, regarding CPU error

What does this error means?
Code:
My kernel log messages:
+++ /tmp/security.973bREs6	2011-06-16 03:02:51.000000000 +0200
+Timecounter "TSC" frequency 1833143813 Hz quality 800
+MCA: Bank 0, Status 0x946b400000000136
+MCA: Global Cap 0x0000000000000104, Status 0x0000000000000000
+MCA: Vendor "AuthenticAMD", ID 0x681, APIC ID 0
+MCA: CPU 0 COR DCACHE L2 DRD error
+MCA: Address 0x2759a00
 
Seeker said:
What does this error means?
Code:
My kernel log messages:
+++ /tmp/security.973bREs6	2011-06-16 03:02:51.000000000 +0200
+Timecounter "TSC" frequency 1833143813 Hz quality 800
+MCA: Bank 0, Status 0x946b400000000136
+MCA: Global Cap 0x0000000000000104, Status 0x0000000000000000
+MCA: Vendor "AuthenticAMD", ID 0x681, APIC ID 0
+MCA: CPU 0 COR DCACHE L2 DRD error
+MCA: Address 0x2759a00
I'm not very familiar with the AMD implementation of Machine Check Architecture, but it looks like your primary CPU is experiencing correctable errors in its secondary data cache. It is rare for a CPU to fail partially due to an internal fault - normally they work or don't. This problem could be caused by a flaky power supply or a broken CPU cooler. Is this a commercially-assembled system or a homebuilt one? If homebuilt, did you use the proper thermal compound between the CPU and the cooler? Has this always happened or did it start after the system had been running reliably for some time?
 
Homebuilt one. Everything was set up long ago and I haven't touched the cooler for two years.

I've just opened a case, to see if the cooler is spinning. It is.

And that error hasn't appeared anymore and the system is stable.
 
I've had something simillar happen to me. My AMD CPU would throw Machine Check Exceptions complaining about ECC errors while fetching data from random places from memory. The system was initially stable as usual, but it would gradually worsen -- especially under heavy CPU usage (games and stuff like that).

I ran memcheck which would show no errors but would hang or reboot the computer after some random amount of time. I tried removing memory slots one by one, but that got me nowhere. I tried using a spare PSU, but that changed nothing as well.

Finally, I fixed it by buying a new CPU. The system has been completely stable ever since.

I suspect it was a faulty L2 cache. So, although rare, it may happen.

By the way, the CPU was an AMD Athlon 64 3000+.

Anyway, you may try stressing the CPU/memory. Memtest can't hurt. But if you don't encounter any stability problems even under heavy load, I suppose it isn't anything to worry about -- perhaps just a bit got flipped around by cosmic rays or something.
 
Just to have a note here, here is another, second one, after a long period of time:
Code:
+MCA: Bank 2, Status 0x940040000000017a
+MCA: Global Cap 0x0000000000000104, Status 0x0000000000000000
+MCA: Vendor "AuthenticAMD", ID 0x681, APIC ID 0
+MCA: CPU 0 COR GCACHE L2 EVICT error
+MCA: Address 0x27ec700
Just to say, that system is stable.
I do a very heavy compilations, for a more then 1h and all passes fine.
 
Would be worth trying another power supply. Also inspect the motherboard capacitors for visible failure (bulging), particularly the ones next to the processor. If the problem is caused by either of those, it will increase as components continue to degrade.
 
wblock said:
Would be worth trying another power supply. Also inspect the motherboard capacitors for visible failure (bulging), particularly the ones next to the processor. If the problem is caused by either of those, it will increase as components continue to degrade.
Agreed. The error is in a different part of the CPU this time, so I definitely think it is something external to the CPU causing the machine checks.

Last time:
Code:
MCA: CPU 0 COR DCACHE L2 DRD error

This time:
Code:
MCA: CPU 0 COR GCACHE L2 EVICT error
 
To jump in, I got this today:

Code:
(17:33:48) Aug 30 17:02:22 IP2 kernel: MMMMCCCAA:CA:: A B B:aaB nBkn aaknn kk  5, S5tat, u0Sst0 a0,,x tS utSsabt t2au0t0s0u xs00 x00b2x10b0b22000000008
 01002000048001404e210000000f080e0000f800
Aug 30 17:02:22 IP2 kernel: MCA: Global Cap 0x0000000000000806, Status 0x0000000000000005
Aug 30 17:02:22 IP2 kernel: MCA: Global Cap 0x0000000000000806, Status 0x0000000000000004
Aug 30 17:02:22 IP2 kernel: MCA: Vendor "GenuineIntel", ID 0x10676, APIC ID 5
Aug 30 17:02:22 IP2 kernel: MCA: Global Cap 0x0000000000000806, Status 0x0000000000000005
Aug 30 17:02:22 IP2 kernel: 
Aug 30 17:02:22 IP2 kernel: MCA: CPU 5 
Aug 30 17:02:22 IP2 kernel: MCA: Vendor "GenuineIntel", ID 0x10676, APIC ID 6
Aug 30 17:02:22 IP2 kernel: MCA: Global Cap 0x0000000000000806, Status 0x0000000000000004
Aug 30 17:02:22 IP2 kernel: MCA: CPU 6 
Aug 30 17:02:22 IP2 kernel: UNCOR UNCOR PCC PCC BUSLG BUSL0 ???Source ERR  ERR OtherMemory
Aug 30 17:02:22 IP2 kernel: 
Aug 30 17:02:22 IP2 kernel: MCA: Vendor "GenuineIntel", ID 0x10676, APIC ID 7
Aug 30 17:02:22 IP2 kernel: MCA: Vendor "GenuineIntel", ID 0x10676, APIC ID 4
Aug 30 17:02:22 IP2 kernel: 
Aug 30 17:02:22 IP2 kernel: MCA: CPU 4 
Aug 30 17:02:23 IP2 kernel: 
Aug 30 17:02:23 IP2 kernel: MCA: CPU 7 
Aug 30 17:02:23 IP2 kernel: UNCOR 
Aug 30 17:02:23 IP2 kernel: Fatal trap 28: machine check trap while in kernel modePCC 
Aug 30 17:02:23 IP2 kernel: cpuid = 5; 
Aug 30 17:02:23 IP2 kernel: MCA: Bank 5, Status 0xb200000044100e0fapic id = 05
Aug 30 17:02:23 IP2 kernel: MCA: Global Cap 0x0000000000000806, Status 0x0000000000000004
Aug 30 17:02:23 IP2 kernel: instruction pointer = 0x20:0xffffffff80895796
Aug 30 17:02:23 IP2 kernel: 
Aug 30 17:02:23 IP2 kernel: MCA: Vendor "GenuineIntel", ID 0x10676, APIC ID 6
Aug 30 17:02:23 IP2 kernel: stack pointer               = 0x28:0xffffff80000ebb40UNCOR 
Aug 30 17:02:23 IP2 kernel: frame pointer               = 0x28:0xffffff80000ebb50
Aug 30 17:02:23 IP2 kernel: 
Aug 30 17:02:23 IP2 kernel: MCA: CPU 6 
Aug 30 17:02:23 IP2 kernel: code segment                = base 0x0, limit 0xfffff, type 0x1bPCC 
Aug 30 17:02:23 IP2 kernel:  DPL 0, pres 1, long 1, def32 0, gran 1BUSLG UNCOR 
Aug 30 17:02:23 IP2 kernel:  Aug 30 17:02:23 IP2 kernel: processor eflags    = ???interrupt enabled,  ERR IOPL = 0

Running 8.1-STABLE r215391 with mysql 5.1.49, UFS, with 42GB RAM (most of it used by cache) and 8-core Xeon.
 
Back
Top