Random crashes - reason unknown

Hello all FreeBSD fans,

I have a nasty problem which makes my server crashes at random period and until today I even didn't have some kind of error message which could points me where the problem could be.

I use 6.2-RELEASE mainly as mail server, running postfix with dovecot. At the today's crash I've manage to record an error message on the screen:
Sleeping thread (tid 100255 ipid 67940) owns a non-sleepable lock
panic: sleeping thread
cpuid=0
Uptime 4d.....
Cannot dump. No dump device defined
Automatic reboot in 15 seconds - press a key on the console to abort


Fatal double fault:
eip=0xc0622198
esp=0xe797a000
ebp=0xe797a00c
cpuid=1; apicid=01

rebooting...
cpu_reset: Stopping, other CPUs

I'm not sure what this message means, but to me it seems like some hardware problem. Another suggestion is that the crash is caused by some faulty process, but how could a non-root process to crash the whole server and OS? As far as I know FreeBSD is one of the most stable and reliable operating systems in this aspect..

Where could be the problem - the hardware, the OS or the software installed?

Any help is appreciated, thanks in advance!
 
@trev, I'll read this document to see if it is useful and after that will report back.

@sniper007, I've started memtest and it displays a lot of loops like this:
Loop 17:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok

Every loop looks identical to the others, how many loops it has to pass? Do I have to press Ctrl+C to stop memtest or it will stops by itself?
 
Hi guys,

I've read the page on troubleshooting subject, but it didn't helped me. It was about installation issues, but my system is installed and working. The problem is that it restart spontaneously from time to time and I can't figure out the reason. The last time I've managed to write down this error message by hand on a paper, before the system starts loading and to the message disappears.

Any other ideas or suggestions?
 
You need a swap partition, so that the kernel can create a dump. From the dump you can create a backtrace with kgdb.
 
OK, I've prepared my system for dumps. Added this to rc.conf:
dumpdev="AUTO" # Device to crashdump to (device name, AUTO, or NO).
dumpdir="/var/crash" # Directory where crash dumps are to be stored
savecore_flags="" # Used if dumpdev is enabled above, and present.

Now will wait for the next crash, but can you tell me what to do with the content of /var/crash after that?
 
Not from the CD, I have it in /usr/local/bin/memtest, can't remember whether it is installed by me or by default.

Which CD do you mean?
 
Yes, I've found this memtest86 in the ports and now have to wait for the next crash to test it.

But meantime, I've replaced the old memory banks with new, so if the problem is with the memory it is solved radical.

I'll report back when I have more info.

Thanks a lot for your help ;)
 
Today my system crashed again with the new memory banks.

After reading this page http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-gdb.html , all I could do was to run
cd /usr/obj/usr/src/sys/MYKERNEL/
kgdb kernel.debug /var/crash/vmcore.0

kgdb output was:
hostname# kgdb kernel.debug /var/crash/vmcore.0
[GDB will not be able to debug user-mode threads: /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"]
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-marcel-freebsd".

Unread portion of the kernel message buffer:

trap number = 12
panic: page fault
cpuid = 1
Uptime: 6d13h21m25s
Dumping 3774 MB (2 chunks)
chunk 0: 1MB (151 pages) ... ok
chunk 1: 3774MB (966112 pages) 3758 3742 3726 3710 3694 3678 3662 3646 3630 3614 3598 3582 3566 3550 3534 3518 3502 3486 3470 3454 3438 3422 3406 3390 3374 3358 3342 3326 3310 3294 3278 3262 3246 3230 3214 3198 3182 3166 3150 3134 3118 3102 3086 3070 3054 3038 3022 3006 2990 2974 2958 2942 2926 2910 2894 2878 2862 2846 2830 2814 2798 2782 2766 2750 2734 2718 2702 2686 2670 2654 2638 2622 2606 2590 2574 2558 2542 2526 2510 2494 2478 2462 2446 2430 2414 2398 2382 2366 2350 2334 2318 2302 2286 2270 2254 2238 2222 2206 2190 2174 2158 2142 2126 2110 2094 2078 2062 2046 2030 2014 1998 1982 1966 1950 1934 1918 1902 1886 1870 1854 1838 1822 1806 1790 1774 1758 1742 1726 1710 1694 1678 1662 1646 1630 1614 1598 1582 1566 1550 1534 1518 1502 1486 1470 1454 1438 1422 1406 1390 1374 1358 1342 1326 1310 1294 1278 1262 1246 1230 1214 1198 1182 1166 1150 1134 1118 1102 1086 1070 1054 1038 1022 1006 990 974 958 942 926 910 894 878 862 846 830 814 798 782 766 750 734 718 702 686 670 654 638 622 606 590 574 558 542 526 510 494 478 462 446 430 414 398 382 366 350 334 318 302 286 270 254 238 222 206 190 174 158 142 126 110 94 78 62 46 30 14

#0 doadump () at pcpu.h:165
165 __asm __volatile("movl %%fs:0,%0" : "=r" (td));
(kgdb) quit

And here I need your help, the last quote means nothing to me :(

What should I do?
 
please submit the output of [cmd=(kgdb)]bt[/cmd] command in the prompt.
 
Here is your request:
(kgdb) bt
#0 doadump () at pcpu.h:165
#1 0xc062aec6 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:409
#2 0xc062b1ed in panic (fmt=0xc08dc012 "%s") at /usr/src/sys/kern/kern_shutdown.c:565
#3 0xc0808be0 in trap_fatal (frame=0xe93f2c50, eva=260) at /usr/src/sys/i386/i386/trap.c:837
#4 0xc0808396 in trap (frame=
{tf_fs = -923860984, tf_es = -881131480, tf_ds = -381747160, tf_edi = -923831552, tf_esi = 4, tf_ebp = -381735780, tf_isp = -381735812, tf_ebx = -910496164, tf_edx = 6, tf_ecx = 0, tf_eax = 1, tf_trapno = 12, tf_err = 0, tf_eip = -1067310867, tf_cs = 32, tf_eflags = 65538, tf_esp = -919156552, tf_ss = 4})
at /usr/src/sys/i386/i386/trap.c:270
#5 0xc07f565a in calltrap () at /usr/src/sys/i386/i386/exception.s:139
#6 0xc06220ed in _mtx_lock_sleep (m=0xc9baee5c, tid=3371135744, opts=0, file=0x0, line=0) at /usr/src/sys/kern/kern_mutex.c:546
#7 0xc0671352 in unp_gc (arg=0x0, pending=2) at /usr/src/sys/kern/uipc_usrreq.c:1714
#8 0xc064bab7 in taskqueue_run (queue=0xc8e77200) at /usr/src/sys/kern/subr_taskqueue.c:257
#9 0xc064bf9a in taskqueue_thread_loop (arg=0x1) at /usr/src/sys/kern/subr_taskqueue.c:376
#10 0xc0614609 in fork_exit (callout=0xc064bf08 <taskqueue_thread_loop>, arg=0xc09ba248, frame=0xe93f2d38) at /usr/src/sys/kern/kern_fork.c:821
#11 0xc07f56bc in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:208
(kgdb)
 
Hi Yavor:

I'm sorry to hear about the problems you have been experiencing. A number of bugs in the UNIX domain socket code were fixed between FreeBSD 6.2 and 6.4, with at least a couple that might include symptoms such as those you are experiencing. Is it possible for you to upgrade to a more recent 6.x (ideally 6.4) to see if that corrects the problem? If you can't do a full system upgrade, it should be possible to upgrade the kernel and modules alone; however, I think moving it entirely forward to 6.4 would be preferable.

Thanks
 
Thanks for the answer rwatson,

I had thoughts for upgrade too, but I've never done this before and since this system is serving 1500 email clients I'm a little bit concerned about the upgrade process and if it gonna pass smoothly.

I have to read more about the upgrade details and every advices are welcome.

I'll report back when I have more info.
 
Well if you have a spare machine, try to make a clone of the "production" box and try to update the testing machine first to figure out how the upgrade process looks like. After all, 6.2->6.4 shouldn't be painful. If you have any specific questions, just ask.
 
Unfortunately, I don't have a spare machine.. :(

What are my upgrade options, can I upgrade from 6.2 directly to 6.4, or I have to step first on 6.3 and after that 6.4?

Also, if the upgrade fails for some reason, is there a way to rollback?

Thanks a lot for your help.
 
The easiest way for you might (if you run GENERIC kernel) be to use freebsd-update(8) tool. However the version contained within FreeBSD 6.2 does not support the [cmd=]update[/cmd] option, so you will have to obtain a newer version. Please see http://www.freebsd.org/releases/6.3R/announce.html and follow the instructions described there. FIY, the freebsd-update(8) utility supports the [cmd=]rollback[/cmd] option.

Also, you may want to go through http://www.freebsd.org/doc/en/books/handbook/updating-upgrading-freebsdupdate.html.

Speaking of myself, I have used freebsd-update(8) tool only once. I prefer doing source upgrades. Also please see Robert's reply he has a good point.
 
Hi Yavor:

It should be possible to updated straight to FreeBSD 6.4 without going via 6.3. If you do a source update, you should be able to try out a 6.4 kernel+modules without upgrading userspace or applications, which is easy to back out. Make sure you do a "cp -r /boot/kernel /boot/kernel.backup" to keep a copy of the 6.2 kernel around in case you decide to roll back. Most problems with upgrades will occur as a result of kernel changes, perhaps due to a change in device driver support, so this is a good way to do upgrades generally. FWIW, upgrades within a major release are generally low-risk and straight forward, it's major version upgrades that tend to be trickier.
 
OK.. and to make the things more complicated - my kernel is customized.

What is the difference with the GENERIC?
 
Sorry, I had to say: What is the difference between upgrading custom and generic kernel?

There is something not very clear to me, when I upgrade from 6.2 to 6.4 I have to:
1. Switch my kernel to generic
2. Upgrade the generic kernel + kernel modules, libraries etc..
3. portupgrade -a to rebuild the installed software to work with the new libraries from point 2

Am I on the right way?
 
Back
Top