Solved FreeBSD 11.1 has begun to crash accidentally - how to debug?

Petr Fischer · Jan 12, 2018

Snurg said:
To find instabilities, there is better method: make buildkernel+buildworld.

Yes and after this test I will have my own _debug_ kernel

I will try it, good test.

Just for sure - can I debug kernel panics with stock official kernel? Can I just install kernel debug symbols?

max21 · Jan 12, 2018

ZFS or not, I may never use it ... if not already, search this:

Code:

 0  0xffffffff80a6b98a in doadump

You might be lucky because you only need to understand one line: I remember Windows crashdump would have 100’s of lines like this, that’s why Inever bothered to learn debugging, but now I will check it out since FreeBSD made it so simple and you found it. Success or not, Good job!

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=223699

max21 · Jan 12, 2018

Code:

Can I just install kernel debug symbols?

I'm no jack of all UNIX like you, but FreeBSD comes with everything out the box even in bits and pieces. So I guest it's safe to say sure you can and if not just install the whole thing... why not if you want to learn debugging anyway. It not that big and you don't have to enable everything I bet!

Pay attention to the install screen and enable everything you need, then after intall disable or remove what you don't. If anything extra; do a search and read about it, and if FreeBSD don't talk about it, go see what Linux has to say about it, then install the port.

Petr Fischer · Jan 13, 2018

Found a way how to install kernel debug symbols, installed, updated to the latest patch version (11.1-RELEASE-p4, same as my kernel) and actual output from kgdb is this:

Code:

Fatal trap 12: page fault while in kernel mode                                                                                                                                                                      
cpuid = 0; apic id = 00
fault virtual address   = 0x40
fault code      = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff8260f700
stack pointer           = 0x0:0xfffffe0466418580
frame pointer           = 0x0:0xfffffe04664185b0
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process     = 1895 (Compositor)
trap number     = 12
panic: page fault
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff80aadac7 at kdb_backtrace+0x67
#1 0xffffffff80a6bba6 at vpanic+0x186
#2 0xffffffff80a6ba13 at panic+0x43
#3 0xffffffff80edf832 at trap_fatal+0x322
#4 0xffffffff80edf889 at trap_pfault+0x49
#5 0xffffffff80edf0c6 at trap+0x286
#6 0xffffffff80ec36d1 at calltrap+0x8
#7 0xffffffff824f4aff at i915_gem_evict_something+0x9f   <-----------
#8 0xffffffff824f0f5f at i915_gem_object_pin+0x37f
#9 0xffffffff824eeac9 at i915_gem_pager_populate+0x1a9
#10 0xffffffff80d57c99 at vm_fault_hold+0x1179
#11 0xffffffff80d56ad5 at vm_fault+0x75
#12 0xffffffff80edf927 at trap_pfault+0xe7
#13 0xffffffff80edf170 at trap+0x330
#14 0xffffffff80ec36d1 at calltrap+0x8
Uptime: 10m22s
Dumping 1205 out of 16271 MB:..2%..11%..22%..31%..42%..51%..62%..71%..81%..91%

<loading symbols - stripped>

#0  doadump (textdump=<value optimized out>) at pcpu.h:222
    in pcpu.h
222 pcpu.h: No such file or directory.
(kgdb)

It looks like the source of panics is i915 Intel HD Graphics (drivers).

ralphbsz · Jan 14, 2018

That's going to be hard to debug. The page fault happens when trying to evict something (duh), which is an operation that's analogous to free'ing or delete'ing memory. That means that the data structure that records what memory is in use has become corrupted. The routine that page faults is most likely the innocent victim, and not incorrect. There could be many reasons for this corruption. The most likely is a software bug. The classic mistake here is that some thing in memory is free'ed twice; it helps to have parallelism and a broken locking design, and then two threads both try to free the same thing. The other nasty option is that the data structure has become overwritten by someone following a wild pointer. And finally, it could just be a (hardware) memory error.

To debug this, you probably want to add print statements to this routine, which checks that the data structures it uses are valid, and if they aren't, print them (so you can see what has happened to them). To do that, you need to understand the data structure, and have an accurate definition of what "valid" means. For someone not familiar with that code, this is a lot of work.

Thence, my recommendation: Search on the web if similar problems have been reported, and/or find the maintainer of that particular piece of code, and/or open a kernel bug report.

Petr Fischer · Jan 15, 2018

I sent a bug report.
With your help I learned to find something in kernel crashlogs. Thanks. For now, done.

Solved FreeBSD 11.1 has begun to crash accidentally - how to debug?

Petr Fischer

max21

max21

Petr Fischer

ralphbsz

Petr Fischer