• This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn more.

FreeBSD intermittent crash!

shahzaib

Member


Messages: 87

#1
Hi,

We're new to FreeBSD as well as this forum, so please pardon me for any wrong here.

We've switched to FreeBSD recently because of its improved ARC caching and asynchronous performance but so far our experience is not very good with it. It crashes every 2-3 days and we're unable to track down the problem. The server specs are pretty high :


Supermicro X5690 (12 cores, 24 threads - 2u)
96GB RAM
12x3TB RAID-10 (HBA-LSI9211)
X8DT3 Board
Supermicro PS- 902-1R 900W

Here is the screenshot of recent crash :

http://prntscr.com/9er3pk

One thing worth mentioning is, before going down there's not load on server, more or less free RAM usually is around 12GB.

Please guys help us out to resolve this issue. Its really killing us
 
Last edited:

SirDice

Administrator
Staff member
Administrator
Moderator

Thanks: 5,508
Messages: 25,688

#2
Which version of FreeBSD?
 

shahzaib

Member


Messages: 87

#3
Thanks for quick response. Here it is :
Code:
FreeBSD 10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 21:02:49 UTC 2014     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
 

PacketMan

Aspiring Daemon

Thanks: 108
Messages: 782

#4
I'll let the more experienced folks suggest what could be the cause of the crash, but it looks like you are using a base un-patched version of 10.1-RELEASE. Consider applying the patches. One of my machines for example is running 10.1-RELEASE-p24, and its likely there another higher patched release available. It might be possible the cause of your issue has already been addressed, and maintaining an up-to-date patched system is a good practice.

Starting reading this whole document as time permits, but patching and updating are found in Section 23.2.

http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/
 

SirDice

Administrator
Staff member
Administrator
Moderator

Thanks: 5,508
Messages: 25,688

#6
You may want to check for hardware errors, the mca_intr call refers to Machine Check Architecture. But I'm not sure if it's the cause of the panic(9).

You said it crashes ever 2-3 days, but the picture shows it has been up almost 12 days.
 

shahzaib

Member


Messages: 87

#7
Right, thanks. Regarding updating , I've found tons of patches and about to update now but one point is very much important before upgrade take place. Is there any chance of zpool corruption after the upgrade ? We've around 16TB data in the zpool. Sorry for newbie question, but I am newbie to FreeBSD.
 

SirDice

Administrator
Staff member
Administrator
Moderator

Thanks: 5,508
Messages: 25,688

#8
There shouldn't be a risk to your existing pools, but it's always a good idea to have proper backups of course.
 

ondra_knezour

Aspiring Daemon

Thanks: 162
Messages: 710

#9
You may encounter minor shock if:
  • System boots from given pool
  • ZFS was upgraded between version you had before and you upgraded into
  • You run # zpool upgrade and you forget to also upgrade boot code

This may render your system unbootable, because boot code would not be able to read the ZFS filesystem from which system has to boot. However data will not be lost and you can fix it by booting live system new enough to contain the same version of the ZFS as your new pool and run # gpart bootcode <required params>.
 

shahzaib

Member


Messages: 87

#10
I am trying to update the system using freebsd-update(8) install but output is really insane :

Code:
Installing updates...install: ///usr/src/contrib/file/magic/Magdir/kerberos: No such file or directory
install: ///usr/src/contrib/file/magic/Magdir/meteorological: No such file or directory
install: ///usr/src/contrib/file/magic/Magdir/qt: No such file or directory
install: ///usr/src/contrib/ntp/html/drivers/driver40-ja.html: No such file or directory
install: ///usr/src/contrib/ntp/html/drivers/driver45.html: No such file or directory
install: ///usr/src/contrib/ntp/html/drivers/driver46.html: No such file or directory
install: ///usr/src/contrib/ntp/html/drivers/mx4200data.html: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/accopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/audio.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/authopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/clockopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/command.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/config.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/confopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/external.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/hand.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/install.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/manual.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/misc.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/miscopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/monopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/refclock.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/special.txt: No such file or directory
install: ///usr/src/contrib/ntp/include/declcond.h: No such file or directory
install: ///usr/src/contrib/ntp/include/intreswork.h: No such file or directory
install: ///usr/src/contrib/ntp/include/lib_strbuf.h: No such file or directory
install: ///usr/src/contrib/ntp/include/libntp.h: No such file or directory
install: ///usr/src/contrib/ntp/include/ntp_assert.h: No such file or directory
 

da1

Aspiring Daemon

Thanks: 92
Messages: 864

#11
As a workaround, just create the missing directories and run the fetch and install commands again.
 

ANOKNUSA

Aspiring Daemon

Thanks: 359
Messages: 671

#12
As a workaround, just create the missing directories and run the fetch and install commands again.
Or, since all the errors are in the source code distribution that the OP isn't using anyway, just configure freebsd-update(8) to exclude the source code. See freebsd-update.conf(5) for info on that.

EDIT: Unless the source code is being used for something. I shouldn't presume too much, sorry.
 

feld

New Member
Developer

Thanks: 1
Messages: 8

#13
You may encounter minor shock if
- system boots from given pool
- ZFS was upgraded between version you had before and you upgraded into
- you run # zpool upgrade
- and you forget to also upgrade boot code

This may render your system unbootable, because boot code would not be able to read the ZFS filesystem from which system has to boot. However data will not be lost and you can fix it by booting live system new enough to contain the same version of the ZFS as your new pool and run # gpart bootcode <required params>.
I haven't seen a zpool upgrade happen within security/errata patches. If he upgraded from 10.1-RELEASE to 10.2-RELEASE this might be the case.

All of my research on MCA panics reported to the mailing lists so far seem to indicate a hardware issue -- usually processor.

As a workaround, just create the missing directories and run the fetch and install commands again.
You can just remove src from Components in /etc/freebsd-update.conf to make those messages go away with future updates.
 

shahzaib

Member


Messages: 87

#14
Thanks guys for work around, I created missing directories and updated and rebooted the OS.

Code:
[root@cw001 ~]# uname -a
FreeBSD 10.1-RELEASE-p24 FreeBSD 10.1-RELEASE-p24 #0: Mon Nov  2 12:17:28 UTC 2015     [EMAIL]root@amd64-builder.daemonology.net[/EMAIL]:/usr/obj/usr/src/sys/GENERIC  amd64
 

shahzaib

Member


Messages: 87

#15
I haven't seen a zpool upgrade happen within security/errata patches. If he upgraded from 10.1-RELEASE to 10.2-RELEASE this might be the case.

All of my research on MCA panics reported to the mailing lists so far seem to indicate a hardware issue -- usually processor.



You can just remove src from Components in /etc/freebsd-update.conf to make those messages go away with future updates.
Thanks for tips. I'll monitor this server downtime to see if it crash again ?
 

PacketMan

Aspiring Daemon

Thanks: 108
Messages: 782

#16
Thanks for tips. I'll monitor this server downtime to see if it crash again ?
Of course, and don't forget that SirDice asked you to check for hardware errors. Regardless if you have crashes or not, updating to the latest patch release is a good practice.
 

_martin

Aspiring Daemon

Thanks: 115
Messages: 651

#17
You're missing the important information regarding the crash in the picture - the message. Only backtrace is shown.
Do you have dump configured ? If so you can find the text info in /var/crash/core.txt.$N by default, where N is the number of last crash.

If you don't have it set look at dumpdev in /etc/rc.conf configuration.

Does it crash regularly (though 12d uptime doesn't fix the "crashes every two days" criteria). ?
Is some heavy job scheduled to be run during that period ? You said no - were you logged just before it crashed monitoring ?

When it comes to FreeBSD I'd push for 10.2 version as guys are improving performance every release.

As SirDice mentioned already - do check for the HW issues. Especially with non-ecc RAM. Running Memtest+ for few hours, etc. could show possible memory issues.
 

shahzaib

Member


Messages: 87

#18
You're missing the important information regarding the crash in the picture - the message. Only backtrace is shown.
Do you have dump configured ? If so you can find the text info in /var/crash/core.txt.$N by default, where N is the number of last crash.

If you don't have it set look at dumpdev in /etc/rc.conf configuration.

Does it crash regularly (though 12d uptime doesn't fix the "crashes every two days" criteria). ?
Is some heavy job scheduled to be run during that period ? You said no - were you logged just before it crashed monitoring ?

When it comes to FreeBSD I'd push for 10.2 version as guys are improving performance every release.

As SirDice mentioned already - do check for the HW issues. Especially with non-ecc RAM. Running Memtest+ for few hours, etc. could show possible memory issues.
Thanks for detailed answer. Yes dump is configured and I can find a big core.txt.0 text file. Now, I don't know how to debug it in order to find the bottleneck of crash. So i am attaching here.
 

Attachments

Last edited by a moderator:

_martin

Aspiring Daemon

Thanks: 115
Messages: 651

#19
I was looking for the panic string only. Information in the core.txt is confidential to some state. Nowadays I'd be more paranoid than not.
Remove it from attachment.

Interesting part is:

Code:
panic: Unrecoverable machine check exception

Unread portion of the kernel message buffer:
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 5
MCA: CPU 5 UNCOR PCC internal timer error
MCA: Address 0x802bf6e59
MCA: Misc 0x0
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 4
MCA: CPU 4 UNCOR PCC internal timer error
MCA: Address 0x802bf6e59
MCA: Misc 0x0
panic: Unrecoverable machine check exception
cpuid = 7
You can actually search forums here for this MCA string. At first look I would focus on CPU and memory.

Now is it a false alarm or does it really hit problem with HW? Don't know right now, I would need to google around too. Some searches suggest issue with fw (bios) on motherboard. I'd check that too (compare FW/bios version of the board to the vendor's last update, etc..).
 

kpa

Beastie's Twin

Thanks: 1,673
Messages: 6,084

#20
My first assumption would be that there is really something wrong with the hardware and take the issue to the manufacturer of the motherboard, look at the documentation and their web support.
 

shahzaib

Member


Messages: 87

#21
Guys, again the same server got rebooted on its own and zpool didn't even mounted itself though it is enabled in rc.conf and loaded in loader.conf. Here is the panic log :

Code:
panic: Unrecoverable machine check exception

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 3
MCA: CPU 3 UNCOR PCC internal timer error
MCA: Address 0x802bf6a69
MCA: Misc 0x0
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 2
MCA: CPU 2 UNCOR PCC internal timer error
MCA: Address 0x802bf6a69
MCA: Misc 0x0
panic: Unrecoverable machine check exception
cpuid = 8
KDB: stack backtrace:
#0 0xffffffff80962d90 at kdb_backtrace+0x60
#1 0xffffffff80927eb5 at panic+0x155
#2 0xffffffff80e3bfeb at mca_intr+0x6b
#3 0xffffffff80d24c09 at trap+0x99
#4 0xffffffff80d0aec2 at calltrap+0x8
#5 0xffffffff80361eea at acpi_cpu_idle+0x13a
#6 0xffffffff80d0f89f at cpu_idle_acpi+0x3f
#7 0xffffffff80d0f940 at cpu_idle+0x90
#8 0xffffffff80953585 at sched_idletd+0x1d5
#9 0xffffffff808f88fa at fork_exit+0x9a
#10 0xffffffff80d0b3fe at fork_trampoline+0xe
----------------------

Where should I look :( , some ppl people are suggesting to disable MCA panic using hw.mca.enabled=0″ to the file /boot/loader.conf.

Please help :(
 

SirDice

Administrator
Staff member
Administrator
Moderator

Thanks: 5,508
Messages: 25,688

#22
Don't disable MCA, it's reporting hardware errors. You can use sysutils/mcelog to translate those MCA messages:
Code:
dice@test:~ % mcelog --ascii --no-syslog
mcelog: Cannot open /dev/mem for DMI decoding: Permission denied
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 3
MCA: CPU 3 UNCOR PCC internal timer error
MCA: Address 0x802bf6a69
MCA: Misc 0x0
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 2
MCA: CPU 2 UNCOR PCC internal timer error
MCA: Address 0x802bf6a69
MCA: Misc 0x0HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 3 BANK 5
MISC 0 ADDR 802bf6a69
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 3 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
 

shahzaib

Member


Messages: 87

#23
Thanks, here is the output :

Code:
[root@cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1 
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 3 BANK 5 
MISC 0 ADDR 802bf6a69 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 3 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 5 
MISC 0 ADDR 802bf6a69 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 3 BANK 5 
MISC 0 ADDR 802bf6a69 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 3 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 5 
MISC 0 ADDR 802bf6a69 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44
 

SirDice

Administrator
Staff member
Administrator
Moderator

Thanks: 5,508
Messages: 25,688

#24
You have hardware errors. No amount of fiddling with software settings is going to change that fact.