Solved Fatal trap 10: debug exception while in kernel mode

iulian · Jul 27, 2021

Any clue what produced this kind of crash?

SirDice · Jul 27, 2021

Maybe if you posted the whole thing we can try and figure out what's causing it.

iulian · Jul 27, 2021

SirDice said:
Maybe if you posted the whole thing we can try and figure out what's causing it.

It is a pretty big kld that i'm working on and the crash is happening while I step, trough code, line by line with kld(after hit a breakpoint), if i execute the code without debugger there is no crash.

Avoiding of the debugger is not an option!

Why the crash is happening only when I use kgdb?

SirDice · Jul 27, 2021

It's not the kernel module I was asking about but the information that's printed alongside the trap message.

iulian · Jul 27, 2021

Sorry for misunderstanding, here it is:

Code:

Unread portion of the kernel message buffer:


Fatal trap 10: debug exception while in kernel mode
cpuid = 1; apic id = 01
instruction pointer     = 0x20:0xffffffff82f19f63
stack pointer           = 0x28:0xfffffe00005b5570
frame pointer           = 0x28:0xfffffe00005b55a0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = trace trap, interrupt enabled, IOPL = 0
current process         = 897 (nltest)
trap number             = 10
panic: debug exception
cpuid = 1
time = 1627400554
KDB: stack backtrace:
Uptime: 6m36s
Dumping 164 out of 978 MB:..10%..20%..30%..39%..49%..59%..68%..78%..88%..98%

0xffffffff81125876 in doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:370
370             savectx(&dumppcb);

Code:

>>> bt
#0  0xffffffff81125876 in doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:370
#1  0xffffffff81125433 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:451
#2  0xffffffff81125e3c in vpanic (fmt=0xffffffff81b9fb56 "%s", ap=0xfffffe0000bebb40) at /usr/src/sys/kern/kern_shutdown.c:880
#3  0xffffffff81125b80 in panic (fmt=0xffffffff81b9fb56 "%s") at /usr/src/sys/kern/kern_shutdown.c:807
#4  0xffffffff818a2286 in trap_fatal (frame=0xfffffe0000bebf30, eva=0) at /usr/src/sys/amd64/amd64/trap.c:921
#5  0xffffffff818a1828 in trap (frame=0xfffffe0000bebf30) at /usr/src/sys/amd64/amd64/trap.c:593
#6  <signal handler called>
#7  0xffffffff82f19f63 in netlink_send (so=0xfffff80006d2f000, flags=0, m=0xfffff80006714e00, nam=0x0, control=0x0, td=0xfffff800035a0740) at bsd_nlsock.c:314
#8  0xffffffff8121df82 in sosend_generic (so=0xfffff80006d2f000, addr=0x0, uio=0xfffffe00005b58f0, top=0xfffff80006714e00, control=0x0, flags=0, td=0xfffff800035a0740) at /usr/src/sys/kern/uipc_socket.c:1581
#9  0xffffffff8121e256 in sosend (so=0xfffff80006d2f000, addr=0x0, uio=0xfffffe00005b58f0, top=0x0, control=0x0, flags=0, td=0xfffff800035a0740) at /usr/src/sys/kern/uipc_socket.c:1627
#10 0xffffffff8122a1a8 in kern_sendit (td=0xfffff800035a0740, s=3, mp=0xfffffe00005b59d0, flags=0, control=0x0, segflg=UIO_USERSPACE) at /usr/src/sys/kern/uipc_syscalls.c:800
#11 0xffffffff8122a62c in sendit (td=0xfffff800035a0740, s=3, mp=0xfffffe00005b59d0, flags=0) at /usr/src/sys/kern/uipc_syscalls.c:725
#12 0xffffffff8122a4ef in sys_sendto (td=0xfffff800035a0740, uap=0xfffff800035a0b00) at /usr/src/sys/kern/uipc_syscalls.c:843
#13 0xffffffff818a3c79 in syscallenter (td=0xfffff800035a0740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:144
#14 0xffffffff818a336b in amd64_syscall (td=0xfffff800035a0740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1163
#15 <signal handler called>
#16 0x00000008003ef21a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffe9e8

mark_j · Jul 27, 2021

Do you have a usr/lib/debug/boot/kernel/kernel.debug?

iulian · Jul 27, 2021

mark_j said:
Do you have a usr/lib/debug/boot/kernel/kernel.debug?

Yes I have, as you can see:

 root@vm2:/usr # ls -lh /usr/lib/debug/boot/kernel/kernel.debug

-r-xr-xr-x  1 root  wheel    94M Jul 27 19:26 /usr/lib/debug/boot/kernel/kernel.debug

mark_j · Jul 27, 2021

Kernel debugging via forums is, well, basically impossible. It's up to you, because to debug it properly we'd need to know what you were doing first off. Then we need the vmcore relevant to the crash plus the /boot/kernel and the kernel.debug.

Looking at what you've posted, frame 7 invokes bsd_nlsock.c which is some program but not the kernel.

The rest of the stuff is sockets, so my guess would be some sort of mbuf issue, either not enough, bad memory or bad programming.

0x7fffffffe9e8 is a wacky address; perhaps a pointer off to nowhere? You could try:
addr2line -afi -e /usr/lib/debug/boot/kernel/kernel.debug 0x7fffffffe9e8

ralphbsz · Jul 28, 2021

iulian said:
#7 0xffffffff82f19f63 in netlink_send (so=0xfffff80006d2f000, flags=0, m=0xfffff80006714e00, nam=0x0, control=0x0, td=0xfffff800035a0740) at bsd_nlsock.c:314
...
Backtrace stopped: Cannot access memory at address 0x7fffffffe9e8

OK, so the problem occurred at the instruction at address 0xfff...f63, which is in the function netlink_send().

Get a link map, and find out where that function starts. Compile that function with assembly listing, and find out what instruction is found at ....f63. Find out what that instruction it is, what it does, and what pointers it relies on. Then figure out where those pointers come from.

richardtoohey2 · Jul 28, 2021

mark_j said:
Looking at what you've posted, frame 7 invokes bsd_nlsock.c which is some program but not the kernel.

I know less that nothing about this, but isn't there a clue in the first bit of pasted output:

iulian said:
current process = 897 (nltest)

Does that mean a program called nltest was running? Or am I reading too much into the bsd_nlsock.c and nltest?

iulian · Jul 28, 2021

richardtoohey2 said:
Does that mean a program called nltest was running? Or am I reading too much into the bsd_nlsock.c and nltest?

yes, nltest is a userland program that tries to create and write to a socket in order to trigger netlink_send

iulian · Jul 28, 2021

Also I should have mentioned that the crash is not consistent, it happens in different places after I set a breakpoint to netlink_send .
And I am wondering why this strange behaviour?

ralphbsz · Jul 28, 2021

Most likely because netlink_send() relies on some pointers (could be in memory, could be in registers). Those pointers are likely set to bad values, sometimes. So what you need to do is to find the actual piece of code that crashes, write down what information it needs, and then trace back where that information comes from. For me, that would be most efficiently done by tracing backwards in the source code, simply reading it: Debugging with paper and pencil.

You say it happens "in different places". What do you mean by that: Different instruction addresses? Different functions? Can you figure out what they have in common?

Does the crash happen if you don't set a breakpoint? If setting a breakpoint causes the crash, that narrows it down. My favorite hunch at that point would be that some piece of data needs to be updated under lock (for consistency), and that locking is missing. Without a breakpoint, updates or reading that data happens fast enough that the lack of locking usually doesn't cause problems; with breakpoints, the timing changes massively, and you get race conditions.

iulian · Jul 29, 2021

ralphbsz said:
Does the crash happen if you don't set a breakpoint? If setting a breakpoint causes the crash, that narrows it down. My favorite hunch at that point would be that some piece of data needs to be updated under lock (for consistency), and that locking is missing. Without a breakpoint, updates or reading that data happens fast enough that the lack of locking usually doesn't cause problems; with breakpoints, the timing changes massively, and you get race conditions.

No, when I run without a breakpoint, the code runs without any crash, so I think your hunch is right, but do you have any idea why the breakpoint messes up with locks, I would like any furder reference/explanation with more details?

ralphbsz · Jul 30, 2021

Hunch: Setting a breakpoint changes the timing. The code as written and executed is not actually correct, but happens to work in normal timing. If the timing changes, it becomes incorrect.

Classic example: We have a variable X, which could be as simple as a single bit. Thread A (could be an interrupt handler) changes it at random times. Thread B (could be normal foreground stuff) does the following: It checks the value of the variable, and depending on the value, it does something that depends on the value. For example, the value could be the number of bytes in an input buffer; the interrupt handler occasionally increases it, and foreground code removes as many bytes from the buffer as it thinks there are in it. For example, the foreground code might think that after it removes that many bytes, the buffer should be empty. In normal operation, if the foreground code is really fast, this will nearly always be true, because the probability that an extra byte arrives (the interrupt handler increases the number) is extremely small. If you put a breakpoint into the foreground code, that probability increases massively.

Code:

int n_bytes_in_queue;
byte* queue;  // Let's not worry about space allocation

void interrupt_handler() {
    new_byte = ... get it from the hardware;
    ++n_bytes_in_queue;
    queue.append(new byte);
}

void empty_queue() {
    int how_many_bytes_to_remove = n_bytes_in_queue;
    // Put breakpoint after here, and you have a  bug: data will be left in the queue
    if (how_many_bytes_to_remove > 0) {
        for (int i=0; i<how_many_bytes_to_remove; ++i) {
            byte = queue.pop()
            ... do something useful with byte
        }
    }
    n_byte_in_queue = 0; // This is where the bug gets catastrophic, and turns into data loss.
}

One way to fix the above code is with locking: Make sure the two functions never run at the same time. That can be easy, for example DI (disable interrupts) and EI around the body of empty_queue(). There are other ways to do this, for example using a lock-free data structure for the queue, and not storing the # of bytes in the queue separately from the queue content.

iulian · Jul 30, 2021

Thank you for the above example, it makes things more clear!

iulian · Aug 10, 2021

I finally solved it, the whole problem was because options GDB and options DDB were missing from kernel, sys/amd64/conf/MY_KERNEL in my case.

TL;DR
I asked myself if this this crash appears only in my module, so I also tried with another empty module and the problem also occurred, then I concluded that something is wrong with my kernel, so I asked on #bsddevs from EEFNet IRC server, after some time mhorne suggested that I should use options GDB.

Also, I added options DDB because this blog mentions it.

richardtoohey2 · Aug 10, 2021

Thanks for reporting back - always good to see the solutions and helps people in the future.

Solved Fatal trap 10: debug exception while in kernel mode

iulian

SirDice

Administrator

iulian

SirDice

Administrator

iulian

mark_j

iulian

mark_j

ralphbsz

richardtoohey2

iulian

iulian

ralphbsz

iulian

ralphbsz

iulian

iulian

richardtoohey2