It is a pretty big kld that i'm working on and the crash is happening while I step, trough code, line by line with kld(after hit a breakpoint), if i execute the code without debugger there is no crash.Maybe if you posted the whole thing we can try and figure out what's causing it.
Unread portion of the kernel message buffer:
Fatal trap 10: debug exception while in kernel mode
cpuid = 1; apic id = 01
instruction pointer = 0x20:0xffffffff82f19f63
stack pointer = 0x28:0xfffffe00005b5570
frame pointer = 0x28:0xfffffe00005b55a0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = trace trap, interrupt enabled, IOPL = 0
current process = 897 (nltest)
trap number = 10
panic: debug exception
cpuid = 1
time = 1627400554
KDB: stack backtrace:
Uptime: 6m36s
Dumping 164 out of 978 MB:..10%..20%..30%..39%..49%..59%..68%..78%..88%..98%
0xffffffff81125876 in doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:370
370 savectx(&dumppcb);
>>> bt
#0 0xffffffff81125876 in doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:370
#1 0xffffffff81125433 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:451
#2 0xffffffff81125e3c in vpanic (fmt=0xffffffff81b9fb56 "%s", ap=0xfffffe0000bebb40) at /usr/src/sys/kern/kern_shutdown.c:880
#3 0xffffffff81125b80 in panic (fmt=0xffffffff81b9fb56 "%s") at /usr/src/sys/kern/kern_shutdown.c:807
#4 0xffffffff818a2286 in trap_fatal (frame=0xfffffe0000bebf30, eva=0) at /usr/src/sys/amd64/amd64/trap.c:921
#5 0xffffffff818a1828 in trap (frame=0xfffffe0000bebf30) at /usr/src/sys/amd64/amd64/trap.c:593
#6 <signal handler called>
#7 0xffffffff82f19f63 in netlink_send (so=0xfffff80006d2f000, flags=0, m=0xfffff80006714e00, nam=0x0, control=0x0, td=0xfffff800035a0740) at bsd_nlsock.c:314
#8 0xffffffff8121df82 in sosend_generic (so=0xfffff80006d2f000, addr=0x0, uio=0xfffffe00005b58f0, top=0xfffff80006714e00, control=0x0, flags=0, td=0xfffff800035a0740) at /usr/src/sys/kern/uipc_socket.c:1581
#9 0xffffffff8121e256 in sosend (so=0xfffff80006d2f000, addr=0x0, uio=0xfffffe00005b58f0, top=0x0, control=0x0, flags=0, td=0xfffff800035a0740) at /usr/src/sys/kern/uipc_socket.c:1627
#10 0xffffffff8122a1a8 in kern_sendit (td=0xfffff800035a0740, s=3, mp=0xfffffe00005b59d0, flags=0, control=0x0, segflg=UIO_USERSPACE) at /usr/src/sys/kern/uipc_syscalls.c:800
#11 0xffffffff8122a62c in sendit (td=0xfffff800035a0740, s=3, mp=0xfffffe00005b59d0, flags=0) at /usr/src/sys/kern/uipc_syscalls.c:725
#12 0xffffffff8122a4ef in sys_sendto (td=0xfffff800035a0740, uap=0xfffff800035a0b00) at /usr/src/sys/kern/uipc_syscalls.c:843
#13 0xffffffff818a3c79 in syscallenter (td=0xfffff800035a0740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:144
#14 0xffffffff818a336b in amd64_syscall (td=0xfffff800035a0740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1163
#15 <signal handler called>
#16 0x00000008003ef21a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffe9e8
Yes I have, as you can see:Do you have a usr/lib/debug/boot/kernel/kernel.debug?
root@vm2:/usr # ls -lh /usr/lib/debug/boot/kernel/kernel.debug
-r-xr-xr-x 1 root wheel 94M Jul 27 19:26 /usr/lib/debug/boot/kernel/kernel.debug
addr2line -afi -e /usr/lib/debug/boot/kernel/kernel.debug 0x7fffffffe9e8
OK, so the problem occurred at the instruction at address 0xfff...f63, which is in the function netlink_send().#7 0xffffffff82f19f63 in netlink_send (so=0xfffff80006d2f000, flags=0, m=0xfffff80006714e00, nam=0x0, control=0x0, td=0xfffff800035a0740) at bsd_nlsock.c:314
...
Backtrace stopped: Cannot access memory at address 0x7fffffffe9e8
I know less that nothing about this, but isn't there a clue in the first bit of pasted output:Looking at what you've posted, frame 7 invokes bsd_nlsock.c which is some program but not the kernel.
current process = 897 (nltest)
yes,Does that mean a program called nltest was running? Or am I reading too much into the bsd_nlsock.c and nltest?
nltest
is a userland program that tries to create and write to a socket in order to trigger netlink_send
No, when I run without a breakpoint, the code runs without any crash, so I think your hunch is right, but do you have any idea why the breakpoint messes up with locks, I would like any furder reference/explanation with more details?Does the crash happen if you don't set a breakpoint? If setting a breakpoint causes the crash, that narrows it down. My favorite hunch at that point would be that some piece of data needs to be updated under lock (for consistency), and that locking is missing. Without a breakpoint, updates or reading that data happens fast enough that the lack of locking usually doesn't cause problems; with breakpoints, the timing changes massively, and you get race conditions.
int n_bytes_in_queue;
byte* queue; // Let's not worry about space allocation
void interrupt_handler() {
new_byte = ... get it from the hardware;
++n_bytes_in_queue;
queue.append(new byte);
}
void empty_queue() {
int how_many_bytes_to_remove = n_bytes_in_queue;
// Put breakpoint after here, and you have a bug: data will be left in the queue
if (how_many_bytes_to_remove > 0) {
for (int i=0; i<how_many_bytes_to_remove; ++i) {
byte = queue.pop()
... do something useful with byte
}
}
n_byte_in_queue = 0; // This is where the bug gets catastrophic, and turns into data loss.
}
options GDB
and options DDB
were missing from kernel, sys/amd64/conf/MY_KERNEL in my case. options GDB
. options DDB
because this blog mentions it.