freebsd server crashing (crash dump)

hello
I have a FreeBSD 6.2-RELEASE-p8 server (http+mysql+pop3+imap+smtp) crashing about twice a week for a month or so
this server was running fine for at least 8 months before without a single crash

I already tried replacing RAM and PSU, but the problem keeps happening

I got to create a "top" less than 10 seconds before the server crashing (I had set a script saving it):
Code:
last pid: 61576;  load averages:  2.04,  2.38,  2.45  up 9+12:13:49    14:42:43
199 processes: 2 running, 196 sleeping, 1 zombie

Mem: 1245M Active, 347M Inact, 281M Wired, 100M Cache, 112M Buf, 31M Free
Swap: 2048M Total, 2096K Used, 2046M Free


  PID USERNAME   THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
16167 mysql       19  20    0   488M 85556K kserel 0  53.3H 27.88% mysqld
61398 apache       1   4    0   168M 33868K sbwait 0   0:01  5.89% httpd
92707 apache       1   4    0   169M 64084K sbwait 0   1:51  3.42% httpd

and here's the dump:

Code:
Unread portion of the kernel message buffer:
kernel trap 12 with interrupts disabled


Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 06
fault virtual address   = 0x104
fault code              = supervisor read, page not present
instruction pointer     = 0x20:0xc067a45d
stack pointer           = 0x28:0xe4f58c90
frame pointer           = 0x28:0xe4f58c9c
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = resume, IOPL = 0
current process         = 5 (thread taskq)
trap number             = 12
panic: page fault
cpuid = 2
Uptime: 9d12h14m29s
Physical memory: 2039 MB
Dumping 338 MB: 323 307 291 275 259 243 227 211 195 179 163 147 131 115 99 83 67 51 35 19 3

#0  doadump () at pcpu.h:165
165             __asm __volatile("movl %%fs:0,%0" : "=r" (td));
(kgdb) backtrace
#0  doadump () at pcpu.h:165
#1  0xc0683236 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:409
#2  0xc068355d in panic (fmt=0xc08d7a75 "%s") at /usr/src/sys/kern/kern_shutdown.c:565
#3  0xc0889c70 in trap_fatal (frame=0xe4f58c50, eva=260) at /usr/src/sys/i386/i386/trap.c:837
#4  0xc0889426 in trap (frame=
      {tf_fs = -968949752, tf_es = -967507928, tf_ds = -453705688, tf_edi = -968921088, tf_esi = 4, tf_ebp = -453669732, tf_isp = -453669764, 
tf_ebx = -960082340, tf_edx = 6, tf_ecx = 0, tf_eax = 1, tf_trapno = 12, tf_err = 0, tf_eip = -1066949539, tf_cs = 32, tf_eflags = 65538, 
tf_esp = -941363984, tf_ss = 4})
    at /usr/src/sys/i386/i386/trap.c:270
#5  0xc087604a in calltrap () at /usr/src/sys/i386/i386/exception.s:139
#6  0xc067a45d in _mtx_lock_sleep (m=0xc6c64e5c, tid=3326046208, opts=0, file=0x0, line=0) at /usr/src/sys/kern/kern_mutex.c:546
#7  0xc06c97c2 in unp_gc (arg=0x0, pending=1) at /usr/src/sys/kern/uipc_usrreq.c:1714
#8  0xc06a3edf in taskqueue_run (queue=0xc64c9080) at /usr/src/sys/kern/subr_taskqueue.c:257
#9  0xc06a43c2 in taskqueue_thread_loop (arg=0x1) at /usr/src/sys/kern/subr_taskqueue.c:376
#10 0xc066c979 in fork_exit (callout=0xc06a4330 <taskqueue_thread_loop>, arg=0xc09d7048, frame=0xe4f58d38) at /usr/src/sys/kern/kern_fork.c:821
#11 0xc08760ac in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:208

any ideas if I should try to replace some other hardware or if this may be a software/kernel problem?

the DC offered to swap my entire box, leaving only the HDDs
this server uptime is very important for me, so I'm trying to find the least risk procedure to try to debug that

thanks
 
The dump appears to suggest it's related to xpt_thr, which is part of CAM (SCSI, USB), xpt(4). Anything special going on in that area?
 
Code:
current process         = 5 (thread taskq)

Checked on three different systems here .. it's always:

Code:
   5  ??  DL     0:00.00 [xpt_thrd]

Etcetera ;)
 
Hmm.. strange on my 7-1 release there is
5 ?? DL 0:00,00 [system_taskq]
And there is some lines about tasks/threads in 1st post.. It may be a task scheduler fail? I dont find anything about xpt or somewhat on my system...
Maybe try to run server without services one by one ?
 
Ah, I should've looked beyond the first three systems ;)

On two others:

Code:
    5  ??  DL     0:00.00 [kqueue taskq]
Code:
    5  ??  DL     0:00.05 [[B]thread taskq[/B]]
 
DutchDaemon said:
The dump appears to suggest it's related to xpt_thr, which is part of CAM (SCSI, USB), xpt(4). Anything special going on in that area?

any ideas on how I could check that?
I see /dev/xpt0 exists
and I also found that process
Code:
root        21  0.0  0.0     0     8  ??  WL   Mon02PM   0:00.00 [swi2: cambio]


for taskq I have:
Code:
root         5  0.0  0.0     0     8  ??  DL   Mon02PM   0:55.67 [thread taskq]
root         9  0.0  0.0     0     8  ??  DL   Mon02PM   0:00.00 [kqueue taskq]
root        19  0.0  0.0     0     8  ??  WL   Mon02PM   0:00.01 [swi6: Giant taskq]

thanks
 
It's not xpt, my bad.

All I know about 'thread taskq' is in here: taskqueue(9). I don't know whether the panics are due to e.g. scheduling, locking, or threaded apps misbehaving. I guess a developer should look at this.
 
well
if it's really a kernel problem I guess I should try upgrading to 6.3 or 7 and report if it persists
maybe it's already fixed

thanks
 
Lem0nHead said:
well
if it's really a kernel problem I guess I should try upgrading to 6.3 or 7 and report if it persists
maybe it's already fixed

thanks

Look again at your hardware, I had a similar problem like this, turned out to be a bulging capacitor on the mother board near the memory.
 
Lem0nHead said:
hello
I have a FreeBSD 6.2-RELEASE-p8 server (http+mysql+pop3+imap+smtp) crashing about twice a week for a month or so
this server was running fine for at least 8 months before without a single crash

Looks like similar with the issue mentioned in the 6.2 ERRATA, (according to your dump, the box panic in unp_gc so I've just googled on it), see :

http://people.freebsd.org/~bmah/relnotes/6.2-RELEASE/errata.pdf

There is a patch, you could try it.
 
thanks, that's good news since I already upgraded to 6.4
I most likely won't need to upgrade to 7 then, since it will be probably fixed on 6.4 :)
 
Back
Top