Unrecoverable machine check exception

Hello,

Two of my servers have started crashing for the last weeks. The first one is based on OEM PC parts and it's about six months old. The second one is an IBM server and it's only one month old. Both servers have FreeBSD 9.1-p5 inside. The first server was working without any problem for months. The only thing I did was updating the system to FreeBSD 9.1-p5 from FreeBSD 9.1-RELEASE with freebsd-update tool about a month ago.

Both of the servers were crashing with different reasons. I did memtest and the RAM is okay. I have reinstalled FreeBSD 9.1-RELEASE to a brand new harddisk and updated the OS to FreeBSD 9.1-p5 after installation. But problem continues. PC based server is sometimes crashing but generally not responding. The IBM server is always giving "Unrecoverable machine check exception" error.

PC based server:
Code:
# dmesg | grep ACPI
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
ACPI APIC Table: <ALASKA A M I>
ACPI Warning: FADT (revision 5) is longer than ACPI 2.0 version, truncating length 268 to 244 (20110527/tbfadt-320)
ACPI Error: [RAMB] Namespace lookup failure, AE_NOT_FOUND (20110527/psargs-392)
ACPI Exception: AE_NOT_FOUND, Could not execute arguments for [RAMW] (Region) (20110527/nsinit-380)

# dmesg | grep Warning
atrtc0: Warning: Couldn't map I/O.

The IBM server:
Code:
# dmesg | ACPI
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
ACPI APIC Table: <IBM    BROMOLOW>
ACPI Warning: Invalid length for Pm1aControlBlock: 32, using default 16 (20110527/tbfadt-638)

# dmesg | grep Warning
atrtc0: Warning: Couldn't map I/O.

# more /var/crash/core.txt.0
panic: Unrecoverable machine check exception

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
MCA: Bank 5, Status 0xfe20000000021136
MCA: Global Cap 0x0000000000000c09, Status 0x0000000000000005
MCA: Vendor "GenuineIntel", ID 0x306a9, APIC ID 1
MCA: CPU 1 UNCOR PCC OVER DCACHE L2 DRD error
MCA: Address 0x635c5080
MCA: Misc 0x70c0000086
MCA: Bank 5, Status 0xfe20000000021136
MCA: Global Cap 0x0000000000000c09, Status 0x0000000000000005
MCA: Vendor "GenuineIntel", ID 0x306a9, APIC ID 0
MCA: CPU 0 UNCOR PCC OVER DCACHE L2 DRD error
MCA: Address 0x635c5080
MCA: Misc 0x70c0000086
panic: Unrecoverable machine check exception
cpuid = 1
KDB: stack backtrace:
#0 0xffffffff809208d6 at kdb_backtrace+0x66
#1 0xffffffff808ea8ee at panic+0x1ce
#2 0xffffffff80c640be at mca_intr+0xae
#3 0xffffffff80bd8a82 at trap+0x282
#4 0xffffffff80bc318f at calltrap+0x8
#5 0xffffffff80343eaa at acpi_cpu_idle+0x24a
#6 0xffffffff80bc7355 at cpu_idle_acpi+0x45
#7 0xffffffff80bc9b0c at cpu_idle+0x6c
#8 0xffffffff8091241c at sched_idletd+0x24c
#9 0xffffffff808bba1f at fork_exit+0x11f
#10 0xffffffff80bc36be at fork_trampoline+0xe
Uptime: 14h31m58s

Another problem with the IBM server is, if I've loaded ipmi.ko kernel module, when I try to shutdown or reboot the server, it suddenly power off the server without unmounting disks and jobs.

I thought about electricity problems, but all of the other servers (other brands) are working without any problem and I think UPS is also okay. I also wanted to note that, the server room's ambient temperature is 18 C (64 F) and humidity is about 40%.

I know that this brands don't support FreeBSD officially, but you guys are doing great things for compatibility. If you think that this problems may occur because of FreeBSD's driver support, I will try them with Linux which is officially supported by manufacturers. If you think that these errors are because of hardware problems regardless with being almost brand new servers, I will send the servers to the distributors for replacement.

I just wonder your thoughts for not to waste my time. What can I do for this situations?

Thanks in advance.
 
jailed said:
I just wonder your thoughts for not to waste my time. What can I do for this situations?
Your IBM server is having an uncorrectable error in its on-chip Level 2 data cache.

This normally indicates a CPU problem or a problem with the power supply / motherboard. I think it is safe to say it isn't from overclocking (I doubt IBM allows configuring that). I think your CPU is an Intel Core CPU, so the system is probably too new to be affected by the "capacitor plague" of some years ago.

It is theoretically possible for this to be triggered by software tripping over a chip[set] errata notice (which generally say "Workaround: Don't do that"), but I would expect newer FreeBSD releases to be able to deal with more errata, not fewer, so I don't see how the update triggered this.
 
Hello,

I've recently talked with IBM. We did hardware tests with IBM's software. The results are okay. There's no problem with hardware. Just two days after I've opened this thread, IBM published a firmware update. When I read the change log, I saw a reported bug about virtualization support which causes random resets on VMware ESX. I'm not sure but since FreeBSD's jail support is a virtualization example, maybe this causes the problem.

Yesterday, I did firmware update on the server. The uptime is 1 day 12 hours now and there is no problem as yet. I won't be sure before at least 10 days passed that the problem is solved.

If it continues, first I will try FreeBSD 9.2-RELEASE when it's announced. If it still continues, I have to switch to VMware, Xen or Linux because hardware is okay and passed all the stress tests.

Because the hardware is okay, and there's nothing to do now besided waiting, I'm marking the thread as solved.

I really thank you for your reply.

Sincerely.
 
Hello,

I wanted the update thread for sharing information about the problem. Even after I did firmware upgrades to both of the servers, both of them continue giving the same error.

OEM Server, CPU: Core i3-3220 (Ivy Bridge), Non-ECC DDR3 RAMs
Code:
Unread portion of the kernel message buffer:
 ID 0x306a9, APIC ID 3
MCA: CPU 3 COR (1) internal parity error
MCA: Bank 0, Status 0x9000004000010005
MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x306a9, APIC ID 2
MCA: CPU 2 COR (1) internal parity error
MCA: Bank 3, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000005
MCA: Vendor "GenuineIntel", ID 0x306a9, APIC ID 0
MCA: CPU 0 UNCOR PCC internal timer error
MCA: Address 0x3fff808d94f0
MCA: Misc 0x3ffff
panic: Unrecoverable machine check exception
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff80920bb6 at kdb_backtrace+0x66
#1 0xffffffff808eabce at panic+0x1ce
#2 0xffffffff80c643ee at mca_intr+0xae
#3 0xffffffff80bd8db2 at trap+0x282
#4 0xffffffff80bc34bf at calltrap+0x8
#5 0xffffffff80982ea4 at vputx+0x84
#6 0xffffffff809746d9 at lookup+0x2d9
#7 0xffffffff80975979 at namei+0x4e9
#8 0xffffffff8098a076 at kern_accessat+0xd6
#9 0xffffffff80bd7e46 at amd64_syscall+0x546
#10 0xffffffff80bc37a7 at Xfast_syscall+0xf7
Uptime: 1d18h36m58s

IBM, Xeon E3-1230V2 (Ivy Bridge), ECC DDR3 RAMs
Code:
Unread portion of the kernel message buffer:
MCA: Bank 5, Status 0xfe20000000021136
MCA: Global Cap 0x0000000000000c09, Status 0x0000000000000005
MCA: Vendor "GenuineIntel", ID 0x306a9, APIC ID 1
MCA: CPU 1 UNCOR PCC OVER DCACHE L2 DRD error
MCA: Address 0x3a5c5080
MCA: Misc 0xb046000086
panic: Unrecoverable machine check exception
cpuid = 1
KDB: stack backtrace:
#0 0xffffffff809208d6 at kdb_backtrace+0x66
#1 0xffffffff808ea8ee at panic+0x1ce
#2 0xffffffff80c640be at mca_intr+0xae
#3 0xffffffff80bd8a82 at trap+0x282
#4 0xffffffff80bc318f at calltrap+0x8
#5 0xffffffff80343eaa at acpi_cpu_idle+0x24a
#6 0xffffffff80bc7355 at cpu_idle_acpi+0x45
#7 0xffffffff80bc9b0c at cpu_idle+0x6c
#8 0xffffffff8091241c at sched_idletd+0x24c
#9 0xffffffff808bba1f at fork_exit+0x11f
#10 0xffffffff80bc36be at fork_trampoline+0xe
Uptime: 3d1h49m10s

I've upgraded the systems to FreeBSD 9.1-RELEASE-p6 from FreeBSD 9.1-RELEASE-p5. Results are the same.

Since I couldn't find any other solution available and all diagnostics show that the servers have no hardware problems, I will use Linux with these servers.

Thanks for your help.
 
Back
Top