Hello,
Two of my servers have started crashing for the last weeks. The first one is based on OEM PC parts and it's about six months old. The second one is an IBM server and it's only one month old. Both servers have FreeBSD 9.1-p5 inside. The first server was working without any problem for months. The only thing I did was updating the system to FreeBSD 9.1-p5 from FreeBSD 9.1-RELEASE with freebsd-update tool about a month ago.
Both of the servers were crashing with different reasons. I did memtest and the RAM is okay. I have reinstalled FreeBSD 9.1-RELEASE to a brand new harddisk and updated the OS to FreeBSD 9.1-p5 after installation. But problem continues. PC based server is sometimes crashing but generally not responding. The IBM server is always giving "Unrecoverable machine check exception" error.
PC based server:
The IBM server:
Another problem with the IBM server is, if I've loaded ipmi.ko kernel module, when I try to shutdown or reboot the server, it suddenly power off the server without unmounting disks and jobs.
I thought about electricity problems, but all of the other servers (other brands) are working without any problem and I think UPS is also okay. I also wanted to note that, the server room's ambient temperature is 18 C (64 F) and humidity is about 40%.
I know that this brands don't support FreeBSD officially, but you guys are doing great things for compatibility. If you think that this problems may occur because of FreeBSD's driver support, I will try them with Linux which is officially supported by manufacturers. If you think that these errors are because of hardware problems regardless with being almost brand new servers, I will send the servers to the distributors for replacement.
I just wonder your thoughts for not to waste my time. What can I do for this situations?
Thanks in advance.
Two of my servers have started crashing for the last weeks. The first one is based on OEM PC parts and it's about six months old. The second one is an IBM server and it's only one month old. Both servers have FreeBSD 9.1-p5 inside. The first server was working without any problem for months. The only thing I did was updating the system to FreeBSD 9.1-p5 from FreeBSD 9.1-RELEASE with freebsd-update tool about a month ago.
Both of the servers were crashing with different reasons. I did memtest and the RAM is okay. I have reinstalled FreeBSD 9.1-RELEASE to a brand new harddisk and updated the OS to FreeBSD 9.1-p5 after installation. But problem continues. PC based server is sometimes crashing but generally not responding. The IBM server is always giving "Unrecoverable machine check exception" error.
PC based server:
Code:
# dmesg | grep ACPI
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
ACPI APIC Table: <ALASKA A M I>
ACPI Warning: FADT (revision 5) is longer than ACPI 2.0 version, truncating length 268 to 244 (20110527/tbfadt-320)
ACPI Error: [RAMB] Namespace lookup failure, AE_NOT_FOUND (20110527/psargs-392)
ACPI Exception: AE_NOT_FOUND, Could not execute arguments for [RAMW] (Region) (20110527/nsinit-380)
# dmesg | grep Warning
atrtc0: Warning: Couldn't map I/O.
The IBM server:
Code:
# dmesg | ACPI
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
ACPI APIC Table: <IBM BROMOLOW>
ACPI Warning: Invalid length for Pm1aControlBlock: 32, using default 16 (20110527/tbfadt-638)
# dmesg | grep Warning
atrtc0: Warning: Couldn't map I/O.
# more /var/crash/core.txt.0
panic: Unrecoverable machine check exception
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
Unread portion of the kernel message buffer:
MCA: Bank 5, Status 0xfe20000000021136
MCA: Global Cap 0x0000000000000c09, Status 0x0000000000000005
MCA: Vendor "GenuineIntel", ID 0x306a9, APIC ID 1
MCA: CPU 1 UNCOR PCC OVER DCACHE L2 DRD error
MCA: Address 0x635c5080
MCA: Misc 0x70c0000086
MCA: Bank 5, Status 0xfe20000000021136
MCA: Global Cap 0x0000000000000c09, Status 0x0000000000000005
MCA: Vendor "GenuineIntel", ID 0x306a9, APIC ID 0
MCA: CPU 0 UNCOR PCC OVER DCACHE L2 DRD error
MCA: Address 0x635c5080
MCA: Misc 0x70c0000086
panic: Unrecoverable machine check exception
cpuid = 1
KDB: stack backtrace:
#0 0xffffffff809208d6 at kdb_backtrace+0x66
#1 0xffffffff808ea8ee at panic+0x1ce
#2 0xffffffff80c640be at mca_intr+0xae
#3 0xffffffff80bd8a82 at trap+0x282
#4 0xffffffff80bc318f at calltrap+0x8
#5 0xffffffff80343eaa at acpi_cpu_idle+0x24a
#6 0xffffffff80bc7355 at cpu_idle_acpi+0x45
#7 0xffffffff80bc9b0c at cpu_idle+0x6c
#8 0xffffffff8091241c at sched_idletd+0x24c
#9 0xffffffff808bba1f at fork_exit+0x11f
#10 0xffffffff80bc36be at fork_trampoline+0xe
Uptime: 14h31m58s
Another problem with the IBM server is, if I've loaded ipmi.ko kernel module, when I try to shutdown or reboot the server, it suddenly power off the server without unmounting disks and jobs.
I thought about electricity problems, but all of the other servers (other brands) are working without any problem and I think UPS is also okay. I also wanted to note that, the server room's ambient temperature is 18 C (64 F) and humidity is about 40%.
I know that this brands don't support FreeBSD officially, but you guys are doing great things for compatibility. If you think that this problems may occur because of FreeBSD's driver support, I will try them with Linux which is officially supported by manufacturers. If you think that these errors are because of hardware problems regardless with being almost brand new servers, I will send the servers to the distributors for replacement.
I just wonder your thoughts for not to waste my time. What can I do for this situations?
Thanks in advance.