Some Kernel traps 28

Hi!

I'm getting some kernel traps. Machine was working correctly and now, I'm getting nearly one kernel trap a day. These are some of the traps:

Code:
kernel: MCA: Bank 1, Status 0xb600000000000181
kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000004
kernel: MCA: Vendor "AuthenticAMD", ID 0x40fb2, APIC ID 0
kernel: MCA: CPU 0 UNCOR PCC ICACHE L1 SNOOP error
kernel: MCA: Address 0xb61780
kernel:
kernel:
kernel: Fatal trap 28: machine check trap while in kernel mode
kernel:
kernel: cpuid = 0; apic id = 00
kernel: instruction pointer     = 0x20:0xffffffff8088eff6
kernel: stack pointer           = 0x28:0xffffff8000039b00
kernel: frame pointer           = 0x28:0xffffff8000039b10
kernel: code segment            = base 0x0, limit 0xfffff, type 0x1b
kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
kernel: processor eflags        = interrupt enabled, IOPL = 0
kernel: current process         = 11 (idle: cpu0)
And:

Code:
kernel: MCA: Bank 1, Status 0xb600000000000181
kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000004
kernel: MCA: Vendor "AuthenticAMD", ID 0x40fb2, APIC ID 0
kernel: MCA: CPU 0 UNCOR PCC ICACHE L1 SNOOP error
kernel: MCA: Address 0xb68400
kernel:
kernel:
kernel: Fatal trap 28: machine check trap while in kernel mode
kernel: cpuid = 0; apic id = 00
kernel: instruction pointer     = 0x20:0xffffffff805c9fd0
kernel: stack pointer           = 0x28:0xffffff80000399d0
kernel: frame pointer           = 0x28:0xffffff8000039a00
kernel: code segment            = base 0x0, limit 0xfffff, type 0x1b
kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
kernel: processor eflags        = IOPL = 0
kernel: current process         = 11 (idle: cpu0)
kernel: trap number             = 28
kernel: panic: machine check trap
kernel: cpuid = 0
kernel: KDB: stack backtrace:
kernel: #0 0xffffffff805f4e0e at kdb_backtrace+0x5e
kernel: #1 0xffffffff805c2d07 at panic+0x187
kernel: #2 0xffffffff808ac630 at trap_fatal+0x290
kernel: #3 0xffffffff808acc19 at trap+0x109
kernel: #4 0xffffffff80894fe4 at calltrap+0x8
kernel: #5 0xffffffff8089bab6 at lapic_handle_timer+0x196
kernel: #6 0xffffffff80895b3d at Xtimerint+0x8d
kernel: #7 0xffffffff801f443a at acpi_cpu_idle+0x20a
kernel: #8 0xffffffff805e770f at sched_idletd+0x11f
kernel: #9 0xffffffff805994f8 at fork_exit+0x118
kernel: #10 0xffffffff808954ae at fork_trampoline+0xe
kernel: Uptime: 16h21m59s
kernel: Physical memory: 4066 MB
kernel: Dumping 552 MB: 537 521 505 489 473 457 441 425 409 393 377 361 345 329 313 297 281 265 249 233 217 201 185 169 153 137 121 105 89 73 57 41 25 9
Any idea? I'm lost.
 
These are hardware errors (MCA). Check to see if the fan on your CPU still works and isn't clogged up with dust.
 
I've opened the server, and have cleaned all dust, replace thermal paste on CPU cooling. Works ok for some hours, and got another trap. I've remembered that this happened after freebsd-update to 8.2-RELEASE-p3. More ideas?

Thanks!
 
juanjico said:
I've opened the server, and have cleaned all dust, replace thermal paste on CPU cooling. Works ok for some hours, and got another trap. I've remembered that this happened after freebsd-update to 8.2-RELEASE-p3. More ideas?
I still think you have a CPU problem. It is reporting an uncorrectable error in its cache. It is certainly possible that some sequence of code that FreeBSD uses is triggering it, but even if that's the case, it is a CPU problem - modern CPUs are supposed to either correctly process or reject the instruction stream, not machine check with an uncorrectable error. That CPU ID comes back as a "AMD Athlon 64 X2" from 2006, so I wouldn't expect this sort of thing (there are some people who like to run engineering samples of the latest CPUs, where this is more common).

I'd suggest checking for a BIOS update for your motherboard (sometimes BIOS updates include microcode updates). If there is no BIOS update for your motherboard, or if that doesn't fix it, perhaps you can download the microcode patch file from the AMD web site and install it via cpucontrol(8). Note that this probably needs to be done at each boot, and definitely each time the system is power cycled even if it isn't needed at each boot.
 
Another possibility is that the power supply is close to failing and CPU isn't getting correct voltage for operation. I'm not familiar with voltage monitoring in FreeBSD so others can fill in here.
 
Thanks for your replys.

Terry, for "CPU problem" you mean hardware or physical problem ? I don't understand why the server was working perfect for years and now, after upgrade to -p3, start to fail with kernel traps.

Maybe the new kernel on -p3 upgrade have code that this relative older CPU can't handle ? If I downgrade to -p0 it maybe solve the problem ?

I'll try to do a BIOS upgrade and replace the PSU to minimize factors.

Thanks !
 
These traps usualy indicate problems in the hardware. What you could also try is removing some of the main memory to lessen the power drain a little. The trap is caused by a cache snooping error.

But could anyone come up with a valid explaination why the instruction cache reports a snooping error*? Is someone writing to the code space or how could this come to be?

*: Apart from being defective, but then I would wager it would trap much more frequently.
 
Another common hardware failure: many motherboards use prone-to-fail electrolytic capacitors. Failing capacitors cause intermittent but increasing problems. The ones filtering CPU power are right next to the processor, and the extra heat causes them to fail more quickly. So post the model of motherboard (some are famous for bad capacitors), and do an inspection. Look for electrolytic capacitors with bulging tops or leaking goo. A Google image search on "bulging capacitor" will show many horrifying examples.

That said, I saw one MCA error not too long ago on an 8-STABLE system:
Code:
MCA: Bank 3, Status 0x902000830001010a
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x10676, APIC ID 0
MCA: CPU 0 COR GCACHE L2 ERR error

That's a Pentium E5200 on a Gigabyte G31M-ES2L motherboard. No panic, just that error in the log.
 
Motherboard is AsRock ALiveNF6G-VSTA. There's no BIOS upgrade and all capacitors are visually perfect.

I've changed the PSU and no traps for now. Will report in a few days.

Thanks!
 
Back
Top