System randomly panics

jurrie · Nov 23, 2008

Hi,

I recently installed FreeBSD 7 on my home file+web server. However, it randomly panics and reboots and I have no idea how to resolve the issue.

In my /var/crash, dump info is stored after each panic. Is it possible to use this to determine the actual problem? Each info.* file states the panic is due to a "Page fault".

If needed, I could upload one of the dump files, but they easily pack over 100MB when gzipped. I don't have full details on the system at this point (I'd need to open it up), but it's an AM2 Sempron system, running 64-bit bsd.

Any pointers are appreciated.

Speedy · Nov 23, 2008

Most likely bad RAM. If possible, try removing RAM sticks one at the time and see if problem persists.

ale · Nov 23, 2008

http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html

anomie · Nov 23, 2008

jurrie said:
Each info.* file states the panic is due to a "Page fault".

Test your RAM - memtest86+

For a quicker method of identifying the problem (or eliminating RAM as a cause), swap it with new RAM.

Barnon · Nov 24, 2008

random panics

You should test ALL of the hardware, especially the integrity of
your disk drives. Download and burn the appropriate disk test and
recertification utility from your disk drive manufacturer, IBM,
Seagate, Maxtor, etc. I can not stress enough that people need
to periodically test their drives for growing errors. These
errors can cause major problems and are usually disk area
dependent.

Mel_Flynn · Nov 24, 2008

Please people. Page faults are not synonym for bad hardware and random is only random till the relation is found.

FYI, the most common page fault currently is yanking out a mounted USB device. One can only start guessing with a trace.

jurrie · Nov 25, 2008

I did test my hard drives not too long ago (the server has been running on Gentoo Linux for quite some time before I tried my hands on BSD). I did switch the memory when installing, so I will run a memtest. If that doesn't give me any answers, I'll put some of my desktop ram into the server and see if it goes on without sudden reboots.

If none of the above applies, I may have to dig into tracing or whatever is needed. Time is not my friend though, so I may end up trying to live with the problem for a while or giving up alltogether. I'll report if I find anything. Thanks all for replying so far.

ale · Nov 25, 2008

Did you have a look at the link I've posted?

Mel_Flynn · Nov 26, 2008

jurrie said:
Time is not my friend though, so I may end up trying to live with the problem for a while or giving up alltogether.

If time is not your friend investigating wild guesses is counterproductive. Memory errors much more likely manifest as segmentation faults in userland, then kernel panics.
Simply run the trace, no need to interpret. Just save the output to file using script(1) and post it here.

jurrie · Nov 26, 2008

If by trace you mean section "10.2 Debugging a Kernel Crash Dump with kgdb" in the link ale posted, that was next on my list. I ran a memtest86+ yesterday which said my memory is fine.

I'll run the kgdb today when I get home from work, which should be some 7 hours from now.

@ale: yes, I did read your link, thanks

Mel_Flynn · Nov 26, 2008

Yes, since you already have a dump, you can simply run kgdb /var/crash/vmcore.0. Unless you commented out the debug option in your kernel config, the output should provide hints.
You can tell if your kernel was built with debug (it's the default) using:

Code:

# ls -l /boot/kernel/kernel*
-r-xr-xr-x  1 root  wheel   9851499 Nov 25 02:32 /boot/kernel/kernel
-r-xr-xr-x  1 root  wheel  29488610 Nov 25 02:32 [B]/boot/kernel/kernel.symbols[/B]

jurrie · Nov 26, 2008

While I did edit my linux kernel configs, I have not touched FreeBSD's, nor do I intend to do stuff like that again (unless it means getting something to work which woudn't work otherwise). I recall confirming my kernel has debugging enabled, so I should be able to post a trace a few hours from now.

Thanks for the confirmation.

jurrie · Nov 26, 2008

Okay I did the trace using

Code:

kgdb kernel.symbols /var/crash/vmcore.8

The first output for all vmcore files is similar to

Code:

Unread portion of the kernel message buffer:


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x8
fault code              = supervisor read data, page not present
instruction pointer     = 0x8:0xffffffff80722663
stack pointer           = 0x10:0xffffffffae6d78b0
frame pointer           = 0x10:0xffffff0001d9e778
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 1247 (sh)
trap number             = 12
panic: page fault
cpuid = 0
Uptime: 3h3m27s
Physical memory: 2035 MB
Dumping 234 MB: 219 203 187 171 155 139 123 107 91 75 59 43 27 11

#0  doadump () at pcpu.h:194
194     pcpu.h: No such file or directory.
        in pcpu.h

I googled the "pcpu.h: No such file or directory" line on google and saw posts of people typing "bt" on the kgdb-prompt, so here's the output from that:

Code:

#0  doadump () at pcpu.h:194
#1  0x0000000000000004 in ?? ()
#2  0xffffffff804776c9 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:409
#3  0xffffffff80477acd in panic (fmt=0x104 <Address 0x104 out of bounds>) at /usr/src/sys/kern/kern_shutdown.c:563
#4  0xffffffff8072edd4 in trap_fatal (frame=0xffffff0001a8d000, eva=18446742974228633808) at /usr/src/sys/amd64/amd64/trap.c:724
#5  0xffffffff8072f1a5 in trap_pfault (frame=0xffffffffae6d7800, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:641
#6  0xffffffff8072fae8 in trap (frame=0xffffffffae6d7800) at /usr/src/sys/amd64/amd64/trap.c:410
#7  0xffffffff8071575e in calltrap () at /usr/src/sys/amd64/amd64/exception.S:169
#8  0xffffffff80722663 in pmap_remove_entry (pmap=0xffffff0001d9e778, m=0xffffff007f2e6030, va=34368897024)
    at /usr/src/sys/amd64/amd64/pmap.c:1833
#9  0xffffffff8072706d in pmap_enter (pmap=0xffffff0001d9e778, va=34368897024, m=0xffffff007cbaccd0, prot=3 '\003', wired=0)
    at /usr/src/sys/amd64/amd64/pmap.c:2342
#10 0xffffffff80680856 in vm_fault (map=0xffffff0001d9e680, vaddr=34368897024, fault_type=2 '\002', fault_flags=Variable "fault_flags" is not available.
)
    at /usr/src/sys/vm/vm_fault.c:882
#11 0xffffffff8072f03f in trap_pfault (frame=0xffffffffae6d7c70, usermode=1) at /usr/src/sys/amd64/amd64/trap.c:618
#12 0xffffffff8072fc88 in trap (frame=0xffffffffae6d7c70) at /usr/src/sys/amd64/amd64/trap.c:309
#13 0xffffffff8071575e in calltrap () at /usr/src/sys/amd64/amd64/exception.S:169
#14 0x000000080051f7e9 in ?? ()
Previous frame inner to this frame (corrupt stack?)

All other vmcore.* files have a similar trace. Is this the info people need to determine the flaw, or should I do more?

Mel_Flynn · Nov 26, 2008

The useful info starts at frame 14, which seems to be corrupted.

Please type:

Code:

list *0xffffffff80722663

Also, you have sources in /usr/src, correct?

And finally, load /boot/kernel/kernel.symbols (tho I figured this should go automatic).

jurrie · Nov 26, 2008

The output of the list command:

Code:

(kgdb) list *0xffffffff80722663
0xffffffff80722663 is in pmap_remove_entry (/usr/src/sys/amd64/amd64/pmap.c:1837).
1832            TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) {
1833                    if (pmap == PV_PMAP(pv) && va == pv->pv_va)
1834                            break;
1835            }
1836            KASSERT(pv != NULL, ("pmap_remove_entry: pv not found"));
1837            TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);
1838            m->md.pv_list_count--;
1839            if (TAILQ_EMPTY(&m->md.pv_list))
1840                    vm_page_flag_clear(m, PG_WRITEABLE);
1841            free_pv_entry(pmap, pv);

I do have /usr/src, yes. I don't know what you mean with the loading of kernel.symbols. If you mean in relation to kgdb, I had to do that when I launched it, with reference to the vmcore.* file. See first code snipplet of my previous post.

Mel_Flynn · Nov 26, 2008

Doesn't make sense. If there's no backtrace that goes beyond calltrap(), the stacks get corrupted, some bug in kgdb or symbols don't get loaded properly.

Code:

cd /usr/obj/usr/src/sys/GENERIC
kgdb kernel.debug /var/crash/vmcore.0
# In kgdb:
bt

If the above directory does not exist, you will have to do a "make buildkernel". Doing "make installkernel" afterwards, doesn't hurt either.

If there's still nothing after calltrap(), it's possible that can only be debugged using DDB

With memory being fine, the likely hardware cause is power supply, but you'd see that on other operating systems too.

jurrie · Nov 26, 2008

Alas, after doing all the above (I did not have the /usr/obj/usr/src/sys/GENERIC directory), the output remains the same.. The power supply isn't one I would boast about, but I'm not too fond of randomly replacing hardware trying to eliminate a problem.

Should I redo the "make buildkernel" after adding "options KDB" and "options DDB" to the /usr/src/sys/amd64/conf/GENERIC file? Neither are present in there at the moment.

I wish I could try more now, but I have to go. I hope you can answer the above question. Thank you very much for your time.

Oh, and I had three dumps displaying some more info, maybe it's a hint.

Code:

#0  doadump () at pcpu.h:194
#1  0x0000000000000004 in ?? ()
#2  0xffffffff804776c9 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:409
#3  0xffffffff80477acd in panic (fmt=0x104 <Address 0x104 out of bounds>) at /usr/src/sys/kern/kern_shutdown.c:563
#4  0xffffffff80680f63 in vm_fault (map=0xffffff0001000000, vaddr=18446744072219361280, fault_type=1 '\001', fault_flags=0)
    at /usr/src/sys/vm/vm_fault.c:275
#5  0xffffffff8072f1bf in trap_pfault (frame=0xffffffffae60c590, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:630
#6  0xffffffff8072fae8 in trap (frame=0xffffffffae60c590) at /usr/src/sys/amd64/amd64/trap.c:410
#7  0xffffffff8071575e in calltrap () at /usr/src/sys/amd64/amd64/exception.S:169
#8  0xffffffffae5e0fda in ?? ()
#9  0xffffffff9a0c72b8 in ?? ()
#10 0xffffff004b1d5354 in ?? ()
#11 0xffffff004b1d5358 in ?? ()
#12 0x0000000000017ff8 in ?? ()
#13 0xffffff0001424000 in ?? ()
#14 0xffffffff9a0c7220 in ?? ()
#15 0xffffff0001387800 in ?? ()
#16 0xffffffff9a7b2120 in ?? ()
#17 0xffffffffa5713040 in ?? ()
#18 0xffffff0001388c00 in ?? ()
#19 0x0000000000001000 in ?? ()
#20 0x0000000000000000 in ?? ()
#21 0x0000000000000000 in ?? ()
#22 0x0000000000000000 in ?? ()
#23 0x0000000000000000 in ?? ()
#24 0xffffff004b1cd7c0 in ?? ()
#25 0xffffff004b1cd7c0 in ?? ()
#26 0xffffffff9a69c6a0 in ?? ()
#27 0xffffffff9a0c7220 in ?? ()
#28 0xffffff004b1d5300 in ?? ()
#29 0x0000000000017ff8 in ?? ()
#30 0xffffff0001387800 in ?? ()
#31 0x0000000000001000 in ?? ()
#32 0xffffffffae60c80c in ?? ()
#33 0xffffffffae60c760 in ?? ()
#34 0xffffffffae5de478 in ?? ()
#35 0xffffffffae60c840 in ?? ()
#36 0xffffff0001a58500 in ?? ()
#37 0x0000e40c0000e40c in ?? ()
#38 0x0000000000000000 in ?? ()
#39 0x0000000000000000 in ?? ()
#40 0x000000000000e40c in ?? ()
#41 0xffffff004b1d5300 in ?? ()
#42 0xffffffffae60c7bc in ?? ()
#43 0xffffffffae60c840 in ?? ()
#44 0xffffffffae5dea48 in ?? ()
#45 0x0000007f00000001 in ?? ()
#46 0x0000000100000000 in ?? ()
#47 0xffffffffae60c8b0 in ?? ()
#48 0xffffff0001a58500 in ?? ()
#49 0xffffff0001387800 in ?? ()
#50 0xffffff004b1cd7c0 in ?? ()
#51 0xffffffff9c1bf000 in ?? ()
#52 0x0000000200017ff8 in ?? ()
#53 0x00000001fffffbf3 in ?? ()
#54 0xfffffbf300000000 in ?? ()
#55 0x0000000000000038 in ?? ()
#56 0x00000000ffff1bf4 in ?? ()
#57 0x0000000000000000 in ?? ()
#58 0x0000000000000050 in ?? ()
#59 0x0000000000202122 in ?? ()
#60 0xffffffff8047e969 in uiomove (cp=0x2, n=32785, uio=0xffffffffae60c710) at /usr/src/sys/kern/kern_subr.c:170
#61 0xffffffffae5e6459 in ?? ()
#62 0x000000000000e410 in ?? ()
#63 0x000000004b1cf1f0 in ?? ()
#64 0xffffffffae60ca10 in ?? ()
#65 0xffffff004b1cd7c0 in ?? ()
#66 0xffffffffae60cb00 in ?? ()
#67 0xffffff004b1d5300 in ?? ()
#68 0xffffff0001387800 in ?? ()
#69 0x000000010000007f in ?? ()
#70 0x00010000007f0001 in ?? ()
#71 0x000000000e400000 in ?? ()
#72 0x000000004b1cf340 in ?? ()
#73 0x000000000000e40f in ?? ()
#74 0x0000000000000000 in ?? ()
#75 0xffffffffae5e95c0 in ?? ()
#76 0x0000000000000000 in ?? ()
#77 0x0000000000000000 in ?? ()
#78 0xffffffffae60cb00 in ?? ()
#79 0xffffff00014e2340 in ?? ()
#80 0xffffffffae60ca10 in ?? ()
#81 0xffffffff80772175 in VOP_WRITE_APV (vop=0xe40c, a=0x1000) at vnode_if.c:691
#82 0xffffffff804ffbd1 in vn_write (fp=0xffffff0001a374b0, uio=0xc, active_cred=Variable "active_cred" is not available.
) at vnode_if.h:373
#83 0xffffffff804abe68 in dofilewrite (td=0xffffff00014e2340, fd=4, fp=0xffffff0001a374b0, auio=0xffffffffae60cb00, offset=Variable "offset" is not available.
)
    at file.h:254
#84 0xffffffff804ac16e in kern_writev (td=0xffffff00014e2340, fd=4, auio=0xffffffffae60cb00)
    at /usr/src/sys/kern/sys_generic.c:401
#85 0xffffffff804ac1ec in write (td=0xffffffffa72d7000, uap=0x8000) at /usr/src/sys/kern/sys_generic.c:317
#86 0xffffffff8072f427 in syscall (frame=0xffffffffae60cc70) at /usr/src/sys/amd64/amd64/trap.c:852
#87 0xffffffff8071596b in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:290
#88 0x0000000800713d3c in ?? ()
Previous frame inner to this frame (corrupt stack?)

Mel_Flynn · Nov 26, 2008

Hmm, questionable. Could be a result of the frequent panics or the cause as this points to file system corruption.

Reboot in single user and run fsck -y. Let it finish.

Could you post /boot/loader.conf if present and a dmesg (as txt attachment would be best).
It's my pumpkin time, but maybe someone else can help you in the meantime.

jurrie · Nov 27, 2008

I was looking at my blog posts around the date of the dump I posted last. I was still running some ext3 disks back then and concluded ext3 was the source of the crashes. They did occur way less frequent after reformatting the disks to UFS.

So your conclusion about fs corruption is probably correct. Since reformatting to UFS, these extra lines have disappeared from the dumps, though. So I guess they aren't relevant anymore.

I left the computer on today so I can log on via SSH. I attached the dmesg to this post. My /boot/loader.conf is empty, so there's not much to post concerning that

Mel_Flynn · Nov 27, 2008

There's a red flag in your dmesg, regarding the High Precision Event Timer, but it can be a red herring. Your disk all working correctly, so the only thing I can think of (aside from random hardware), is that it's caused by a loadable module, that's not being loaded at the time of debugging.

Can you provide kldstat output?

The debugging would then be:

Code:

kgdb /var/crash/vmcore.0
# in kgdb repeat for each module reported by kldstat
load /boot/kernel/name_of_module.symbols
# try the backtrace
bt
# go to frame 14 the one after calltrap() if it's not listed as ??
frame 14
list

jurrie · Nov 27, 2008

After seeing the output in my terminal and assuming you are correct that it is either a loadable module or hardware, I'd have to conclude it is some hardware problem.

kldstat output:

Code:

[root@merry /home/jurrie]# kldstat
Id Refs Address            Size     Name
 1    1 0xffffffff80100000 ac6f08   kernel

which would result in a

Code:

(kgdb) load /boot/kernel/kernel.symbols

But this probably does not make sense, judging from kgdb's reply:

Code:

You can't do that when your target is `kernel'

I guess typing any further is pointless, but I still did:

Code:

#13 0xffffffff8071575e in calltrap () at /usr/src/sys/amd64/amd64/exception.S:169
#14 0x000000080051f7e9 in ?? ()
Previous frame inner to this frame (corrupt stack?)
(kgdb) frame 14
#14 0x000000080051f7e9 in ?? ()
(kgdb) list
169             call    trap
170             MEXITCOUNT
171             jmp     doreti                  /* Handle any pending ASTs */
172
173             /*
174              * alltraps_noen entry point.  Unlike alltraps above, we want to
175              * leave the interrupts disabled.  This corresponds to
176              * SDT_SYS386IGT on the i386 port.
177              */
178             SUPERALIGN_TEXT

Somehow I get the feeling this is not something one can solve remotely

Mel_Flynn · Nov 27, 2008

I'm afraid not. The stack gets too corrupted. This is part of exception.s. Only ddb might shed some light on it.
To clarify, it panics cause a memory page is being removed that never got added (or was already removed, but not properly). But who did this, is what points to the culprit and that bit gets lost in the crashdump.

jurrie · Nov 27, 2008

"Crap", is all I can say :-(

I read a bit about this DDB thing, but it looks way out of my league. I guess I have to settle with random reboots or replace the OS and reformat all data disks again some time. Ugh.

Thanks very much for helping out so far, I really appreciate it.

Mel_Flynn · Nov 27, 2008

On the plus side, if another OS works correctly, you can pretty much rule out hardware

I would insert the "upgrade to latest snapshot version to see if it's fixed" customer service line, but without so much as a clue that's not really helpful.

You could also file a PR with the information you provided here (kgdb traces, dmesg and brand/model of server) and maybe a link to the thread. It's possible a FreeBSD dev not on this forum has a similar setup and instantly recognizes the problem.

jurrie · Dec 1, 2008

Back again.

For the past few days I have let the server running. During lunch today at work I checked the uptime and noticed it had reset and was at 7 minutes, meaning the server had rebooted again. After checking the /var/crash directory, I did not see a new vmcore. Seems I was out of diskspace (d'oh!). The most recent vmcore was from Nov 8th. So I cleaned up in hope of it crashing again later on and seeing something different in the dumps, compared to the traces I posted earlier.

It just crashed on me again (while I was compiling something + burning a dvd) and got a new vmcore. There is a difference! Though I do not know if it says anything. I really hope someone can make something out of this trace *stares at Mel_Flynn*

Code:

[GDB will not be able to debug user-mode threads: /usr/lib/libthread_db.so: Unde                                 fined symbol "ps_pglobal_lookup"]
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd".

Unread portion of the kernel message buffer:


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0xffff800004003908
fault code              = supervisor read data, page not present
instruction pointer     = 0x8:0xffffffff8072436c
stack pointer           = 0x10:0xffffffffae6d2a00
frame pointer           = 0x10:0x4000000000
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 5551 (sed)
trap number             = 12
panic: page fault
cpuid = 0
Uptime: 4h42m45s
Physical memory: 2035 MB
Dumping 311 MB: 296 280 264 248 232 216 200 184 168 152 136 120 104 88 72 56 40                                  24 8

#0  doadump () at pcpu.h:194
194             __asm __volatile("movq %%gs:0,%0" : "=r" (td));
(kgdb) bt
#0  doadump () at pcpu.h:194
#1  0x0000000000000004 in ?? ()
#2  0xffffffff804776c9 in boot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:409
#3  0xffffffff80477acd in panic (fmt=0x104 <Address 0x104 out of bounds>)
    at /usr/src/sys/kern/kern_shutdown.c:563
#4  0xffffffff8072edd4 in trap_fatal (frame=0xffffff00018bc9c0,
    eva=18446742974225594576) at /usr/src/sys/amd64/amd64/trap.c:724
#5  0xffffffff8072f1a5 in trap_pfault (frame=0xffffffffae6d2950, usermode=0)
    at /usr/src/sys/amd64/amd64/trap.c:641
#6  0xffffffff8072fae8 in trap (frame=0xffffffffae6d2950)
    at /usr/src/sys/amd64/amd64/trap.c:410
#7  0xffffffff8071575e in calltrap ()
    at /usr/src/sys/amd64/amd64/exception.S:169
#8  0xffffffff8072436c in pmap_remove_pages (pmap=0xffffff0001ab7778)
    at /usr/src/sys/amd64/amd64/pmap.c:388
#9  0xffffffff80688238 in vmspace_exit (td=0xffffff00018bc9c0)
    at /usr/src/sys/vm/vm_map.c:404
#10 0xffffffff804577ac in exit1 (td=0xffffff00018bc9c0, rv=0)
    at /usr/src/sys/kern/kern_exit.c:294
#11 0xffffffff80458b5e in sys_exit (td=Variable "td" is not available.
) at /usr/src/sys/kern/kern_exit.c:98
#12 0xffffffff8072f427 in syscall (frame=0xffffffffae6d2c70)
    at /usr/src/sys/amd64/amd64/trap.c:852
#13 0xffffffff8071596b in Xfast_syscall ()
    at /usr/src/sys/amd64/amd64/exception.S:290
#14 0x00000008006a8b3c in ?? ()
Previous frame inner to this frame (corrupt stack?)

I have followed earlier instructions again, going to the line after calltrap():

Code:

(kgdb) frame 8
#8  0xffffffff8072436c in pmap_remove_pages (pmap=0xffffff0001ab7778)
    at /usr/src/sys/amd64/amd64/pmap.c:388
388             return (PTmap + ((va >> PAGE_SHIFT) & mask));
(kgdb) list
383     PMAP_INLINE pt_entry_t *
384     vtopte(vm_offset_t va)
385     {
386             u_int64_t mask = ((1ul << (NPTEPGSHIFT + NPDEPGSHIFT + NPDPEPGSHIFT + NPML4EPGSHIFT)) - 1);
387
388             return (PTmap + ((va >> PAGE_SHIFT) & mask));
389     }
390
391     static __inline pd_entry_t *
392     vtopde(vm_offset_t va)

*crosses fingers*

System randomly panics

Attachments