jails Kernel panic after upgrade to FreeBSD 13

Greetings,
I recently encountered a problem with my jails (I use iocage to manage them), which leads to a crash of the host-system.
I'm using this setup for quite a while without similar issues, but after upgrading to 13 directing network-traffic IN or OUT of my jails leads to a Kernel panic.

So I have a jail running zabbix which crashed my server when I try to access it's webui.
Same goes when I try to access anything from inside the jail (connected via console)
Code:
root@phcn-zabbix:~ # curl www.google.de
client_loop: send disconnect: Connection reset
boom, host is down.

The only special thing in my setup I can think of is that I use vxlan interfaces (and fibs) to separate my networks.
I also have bhyve-based virtual machines on the same vxlan, which still work perfectly after the Upgrade.

I've tried upgrading 12.2 to 13 in April with the same result, rolled back my boot env. and tried again yesterday, the outcome is identical.
As soon as I'm back to 12.2 everything works just fine.

Has anyone encountered a similar issue or has an idea what to do?
I might just try setting up a testjail via ezjail or without any of these helping scripts to check wether it's an iocage-only problem, but i feel like a jail should not be able to crash the host at all.

Unfortunately there aren't many normal logs of this events, and I'm not skilled enough to do something usefull with the core-dumps.
I see a great amount of these login_getclass errors
Code:
Jun 25 23:26:21 Server1 jail[2984]: login_getclass: unknown class 'root'
Jun 25 23:26:21 Server1 jail[2985]: login_getclass: unknown class 'root'
Jun 25 23:26:21 Server1 jail[2986]: login_getclass: unknown class 'root'
Jun 25 23:26:21 Server1 jail[2987]: login_getclass: unknown class 'root'
Jun 25 23:26:21 Server1 jail[2988]: login_getclass: unknown class 'root'
Jun 25 23:26:56 Server1 jail[3617]: login_getclass: unknown class 'root'
Jun 25 23:26:56 Server1 jail[3618]: login_getclass: unknown class 'root'
Jun 25 23:26:56 Server1 jail[3619]: login_getclass: unknown class 'root'
Jun 25 23:26:56 Server1 jail[3620]: login_getclass: unknown class 'root'
Jun 25 23:26:56 Server1 jail[3621]: login_getclass: unknown class 'root'
Jun 25 23:26:56 Server1 jail[3622]: login_getclass: unknown class 'root'
Jun 25 23:28:21 Server1 jexec[3916]: login_getclass: unknown class 'root'
Jun 25 23:29:25 Server1 syslogd: kernel boot file is /boot/kernel/kernel
Jun 25 23:29:25 Server1 kernel: ---<<BOOT>>---
Jun 25 23:29:25 Server1 kernel: Copyright (c) 1992-2021 The FreeBSD Project.
Jun 25 23:29:25 Server1 kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Jun 25 23:29:25 Server1 kernel:         The Regents of the University of California. All rights reserved.
Jun 25 23:29:25 Server1 kernel: FreeBSD is a registered trademark of The FreeBSD Foundation.
Jun 25 23:29:25 Server1 kernel: FreeBSD 13.0-RELEASE-p1 #0: Wed May 26 22:15:09 UTC 2021
Jun 25 23:29:25 Server1 kernel:     [email]root@amd64-builder.daemonology.net[/email]:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
Jun 25 23:29:25 Server1 kernel: FreeBSD clang version 11.0.1 ([email]git@github.com[/email]:llvm/llvm-project.git llvmorg-11.0.1-0-g43ff75f2c3fe)
Jun 25 23:29:25 Server1 kernel: VT(vga): resolution 640x480
Jun 25 23:29:25 Server1 kernel: CPU: AMD Ryzen 7 3700X 8-Core Processor              (3600.07-MHz K8-class CPU)

root@Server1  ~ grep -r -i "panic" /var/log
/var/log/messages:Apr 18 14:27:03 Server1 savecore[1700]: reboot after panic: double fault
/var/log/messages:Apr 18 14:28:35 Server1 savecore[1685]: reboot after panic: double fault
/var/log/messages:Apr 18 14:30:23 Server1 savecore[1685]: reboot after panic: double fault
/var/log/messages:Apr 18 14:57:27 Server1 savecore[1681]: reboot after panic: double fault
/var/log/messages:Apr 18 14:59:31 Server1 savecore[1679]: reboot after panic: double fault
/var/log/messages:Apr 20 19:56:25 Server1 savecore[1693]: reboot after panic: double fault
/var/log/messages:Jun 25 23:03:04 Server1 savecore[1691]: reboot after panic: double fault
/var/log/messages:Jun 25 23:13:31 Server1 savecore[1678]: reboot after panic: double fault
/var/log/daemon.log:Apr 18 14:27:03 Server1 savecore[1700]: reboot after panic: double fault
/var/log/daemon.log:Apr 18 14:28:35 Server1 savecore[1685]: reboot after panic: double fault
/var/log/daemon.log:Apr 18 14:30:23 Server1 savecore[1685]: reboot after panic: double fault
/var/log/daemon.log:Apr 18 14:57:27 Server1 savecore[1681]: reboot after panic: double fault
/var/log/daemon.log:Apr 18 14:59:31 Server1 savecore[1679]: reboot after panic: double fault
/var/log/daemon.log:Apr 20 19:56:25 Server1 savecore[1693]: reboot after panic: double fault
/var/log/daemon.log:Jun 25 23:03:04 Server1 savecore[1691]: reboot after panic: double fault
/var/log/daemon.log:Jun 25 23:13:31 Server1 savecore[1678]: reboot after panic: double fault
 
At /var/crash, are there core.txt.⋯ files?
Yes, there are core.txt.X , info.X and vmcore.X files from one of the latest crashes.
I haven't managed to open the core.txt files, but vmcore is accessible and has the following content:

Code:
root@Server1  ~ cat /var/crash/core.txt.7
Unable to find a kernel debugger.
Please install the devel/gdb port or gdb package.

root@Server1  ~ kgdb /boot/kernel/kernel /var/crash/vmcore.7
GNU gdb (GDB) 10.1 [GDB v10.1 for FreeBSD]
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd13.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

Unread portion of the kernel message buffer:

Fatal double fault
rip 0xffffffff80da0eda rsp 0xfffffe00dfa72fc0 rbp 0xfffffe00dfa73040
rax 0xf0441070603e6aac rdx 0 rbx 0xfffffe00dfa73140
rcx 0 rsi 0x2 rdi 0x2
r8 0x73c04106 r9 0 r10 0x9c00
r11 0x58 r12 0x73c04106 r13 0x14
r14 0 r15 0x101a8c0 rflags 0x10286
cs 0x20 ss 0x28 ds 0x3b es 0x3b fs 0x13 gs 0x1b
fsbase 0x800240120 gsbase 0xffffffff82810000 kgsbase 0
cpuid = 0; apic id = 00
panic: double fault
cpuid = 0
time = 1624655548
KDB: stack backtrace:
Uptime: 9m33s
Dumping 1285 out of 32659 MB:..2%..12%..22%..32%..42%..52%..62%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
 
My kernel panicked on FreeBSD 13.0. I removed many customized settings, and use the generic kernel for now, bc I'm not sure what caused it, and it works.

Others said theirs on upgrading to 13.0 caused issues too, perhaps it has to do with drivers or ACPI. Also that, when moving to a .0 release, there's lots of bugs.
 
core.txt.⋯ files are plain text.
root@Server1 ~ cat /var/crash/core.txt.7
Unable to find a kernel debugger.
Please install the devel/gdb port or gdb package.

I don't know, either that's the content of the file, then it doesn't help at all, or it tells me to install gdb, which i already have.
 
Those don't look good. Check your /etc/login.conf on both the jail and host side. I suspect there's a merge gone wrong here.
That's a good direction, thank you. login.conf on the host looks broken indeed. Seems like i messed up merging during the upgrade.


Code:
root@Server1  ~ grep -v '^#' /etc/login.conf               



default:\
    :passwd_format=sha512:\
    :copyright=/etc/COPYRIGHT:\
<<<<<<< current version
    :welcome=/etc/motd:\
<<<<<<< current version
    :setenv=MAIL=/var/mail/$,BLOCKSIZE=K:\
    :setenv=IOCAGE_LOGFILE=/var/log/iocage.log,IOCAGE_COLOR=TRUE:\
=======
=======
    :welcome=/var/run/motd:\
>>>>>>> 13.0-RELEASE
    :setenv=BLOCKSIZE=K:\
    :mail=/var/mail/$:\
>>>>>>> 12.2-RELEASE
    :path=/sbin /bin /usr/sbin /usr/bin /usr/local/sbin /usr/local/bin ~/bin:\
    :nologin=/var/run/nologin:\
    :cputime=unlimited:\
    :datasize=unlimited:\
    :stacksize=unlimited:\
    :memorylocked=64K:\
    :memoryuse=unlimited:\
    :filesize=unlimited:\
    :coredumpsize=unlimited:\
    :openfiles=unlimited:\
    :maxproc=unlimited:\
    :sbsize=unlimited:\
    :vmemoryuse=unlimited:\
    :swapuse=unlimited:\
    :pseudoterminals=unlimited:\
    :kqueues=unlimited:\
    :umtxp=unlimited:\
    :priority=0:\
    :ignoretime@:\
    :umask=022:\
    :charset=UTF-8:\
    :lang=C.UTF-8:

standard:\
    :tc=default:
xuser:\
    :tc=default:
staff:\
    :tc=default:

daemon:\
    :path=/sbin /bin /usr/sbin /usr/bin /usr/local/sbin /usr/local/bin:\
    :mail@:\
    :memorylocked=128M:\
    :tc=default:
news:\
    :tc=default:
dialer:\
    :tc=default:

root:\
    :ignorenologin:\
    :memorylocked=unlimited:\
    :tc=default:

russian|Russian Users Accounts:\
    :charset=UTF-8:\
    :lang=ru_RU.UTF-8:\
    :tc=default:


Please run the command below. When deletion occurs, is there any report of a missing file?

pkg delete -fy gdb && pkg install -y gdb
Code:
root@Server1  ~ pkg delete -fy gdb && pkg install -y gdb
Checking integrity... done (0 conflicting)
Deinstallation has been requested for the following 1 packages (of 0 packages in the universe):

Installed packages to be REMOVED:
    gdb: 10.1_1

Number of packages to be removed: 1

The operation will free 52 MiB.
[1/1] Deinstalling gdb-10.1_1...
[1/1] Deleting files for gdb-10.1_1: 100%
Updating FreeBSD repository catalogue...
FreeBSD repository is up to date.
All repositories are up to date.
Checking integrity... done (0 conflicting)
The following 1 package(s) will be affected (of 0 checked):

New packages to be INSTALLED:
    gdb: 10.1_1

Number of packages to be installed: 1

The process will require 52 MiB more space.
[1/1] Installing gdb-10.1_1...
[1/1] Extracting gdb-10.1_1: 100%
No missing files reported
I just noticed I am able to open one of the crash.txt files, it is /var/crash/core.txt.5, which happens to be the only one of reasonable size
Code:
root@Server1  ~ ls -lah /var/crash/core.txt.*
-rw-r--r--  1 root  wheel    84B Apr 18 14:27 /var/crash/core.txt.0
-rw-r--r--  1 root  wheel    84B Apr 18 14:28 /var/crash/core.txt.1
-rw-r--r--  1 root  wheel    84B Apr 18 14:30 /var/crash/core.txt.2
-rw-r--r--  1 root  wheel    84B Apr 18 14:57 /var/crash/core.txt.3
-rw-r--r--  1 root  wheel    84B Apr 18 14:59 /var/crash/core.txt.4
-rw-r--r--  1 root  wheel   147K Apr 20 19:56 /var/crash/core.txt.5
-rw-r--r--  1 root  wheel    84B Jun 25 23:03 /var/crash/core.txt.6
-rw-r--r--  1 root  wheel    84B Jun 25 23:13 /var/crash/core.txt.7
 
… Try and fix that first. …

+1

The current /etc/login.conf before anything else.

Retrospective

… tried upgrading 12.2 to 13 in April with the same result, rolled back …

-rw-r--r-- 1 root wheel 147K Apr 20 19:56 /var/crash/core.txt.5

This one might be of interest, if you'd like to share some or all of its content.

To include the backtrace(s), probably everything above the first divider line.

1624915033129.png
 
That doesn't look good. Try and fix that first. I have a sneaky suspicion it's the cause of the crashes.
I fixed these by copying a clean login.conf and running
root@Server1 ~ cap_mkdb /etc/login.conf

rebooted afterwards, now i have no more 'unknown class' errors in my logs.

thanks for your help, unfortunately the kernel panic persists.

+1

The current /etc/login.conf before anything else.

Retrospective





This one might be of interest, if you'd like to share some or all of its content.

To include the backtrace(s), probably everything above the first divider line.

View attachment 10353

sure, I don't mind sharing it, I can't do much with that information. Also I finally realized how the core.txt files work, I have to run kgdb on the vmcore file first to populate the text files....

it's to many characters for a single post, so I had to put it elswhere:
 
rebooted afterwards, now i have no more 'unknown class' errors in my logs.
Ok, that's good. At least you eliminated a bunch of errors and warnings, those are never good, I personally like to keep my logs clean so actual errors and warning are more clear and don't get drowned out by a bunch of useless errors or warnings.

it's to many characters for a single post, so I had to put it elswhere:
Neat trick: cat somereallylongtextfile.txt | nc termbin.com 9999
 
This instruction at line 55 is basically an instruction to fool the compiler that per cpu thread information on the stack won't change. (Apologies if I didn't get the vernacular absolutely correct). It's a one-off call by the kernel, not applicable to userland and basically means your memory is corrupt if it triggers a fault.
Something's changed the stack and BOOM!! That's your double fault. 30-love. :)
 
This instruction at line 55 is basically an instruction to fool the compiler that per cpu thread information on the stack won't change. (Apologies if I didn't get the vernacular absolutely correct). It's a one-off call by the kernel, not applicable to userland and basically means your memory is corrupt if it triggers a fault.
Something's changed the stack and BOOM!! That's your double fault. 30-love. :)
Thank you for the explanation.

I'm going to test with different memory anyway, but do you have an idea, why it would not happen in 12.2?
Seems like there's the same instruction
 
I looked at the capra 's backtrace and given it's all related to taskqueue(9)'s handling of TCP traffic between the jail(s) and host, that would seem to be a reasonable place to start.
BTW, I think you're making a pretty solid assumption that a jail should not crash the host.

The fact there's a double fault means the stack just got overflowed by something like a recursion. Is it always failing at the same point, with the same register (rcx, rip etc) information?

This could be a bug in the kernel and should probably be raised as such. Especially so if it can be reproduced.

Can we assume this is FreeBSD 13 Release and not some kernel with WITNESS set?
 
Oh, I was sitting on that post for a long time and in the meantime you've answered one key question. No, I still reckon memory is not the issue here, that this is a genuine bug. After all, as you say, the same situation running 12.2 just works.
 
Frame 7 seems to be a good start: fib4_lookup: /usr/src/sys/netinet/in_fib.c:144 when looking into the issue. It seems during that time signal handler is called where the fault occurs (hence the double fault).
This is very likely a bug due to reasons mark_j already said above.

What is nice is you can reproduce this. Please open a PR and share the information there.
 
I looked at the capra 's backtrace and given it's all related to taskqueue(9)'s handling of TCP traffic between the jail(s) and host, that would seem to be a reasonable place to start.
BTW, I think you're making a pretty solid assumption that a jail should not crash the host.

The fact there's a double fault means the stack just got overflowed by something like a recursion. Is it always failing at the same point, with the same register (rcx, rip etc) information?

This could be a bug in the kernel and should probably be raised as such. Especially so if it can be reproduced.

Can we assume this is FreeBSD 13 Release and not some kernel with WITNESS set?
Yes, this is FreeSD 13 RELEASE
Code:
root@Server1  ~ freebsd-version -ku
13.0-RELEASE-p1
13.0-RELEASE-p2
root@Server1  ~ uname -a
FreeBSD Server1 13.0-RELEASE-p1 FreeBSD 13.0-RELEASE-p1 #0: Wed May 26 22:15:09 UTC 2021     root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

Edit: actually, I'm not sure if the registers are the same.

Code:
root@Server1  ~ grep "rip\|rcx" /var/crash/core.txt.8 -r /var/crash/core.txt.*
/var/crash/core.txt.8:rip 0xffffffff80da0eda rsp 0xfffffe00dfa7cfc0 rbp 0xfffffe00dfa7d040
/var/crash/core.txt.8:rcx 0 rsi 0x2 rdi 0x2
/var/crash/core.txt.8:ripcb:                  488, 1045113,       0,       0,       0,   0,   0,   0
/var/crash/core.txt.8:rip6:
/var/crash/core.txt.8:em0: Using 1024 TX descriptors and 1024 RX descriptors
/var/crash/core.txt.8:igb0: Using 1024 TX descriptors and 1024 RX descriptors
/var/crash/core.txt.8:rip 0xffffffff80da0eda rsp 0xfffffe00dfa7cfc0 rbp 0xfffffe00dfa7d040
/var/crash/core.txt.8:rcx 0 rsi 0x2 rdi 0x2
/var/crash/core.txt.5:rip 0xffffffff80d439cc rsp 0xfffffe00dfa73000 rbp 0xfffffe00dfa73020
/var/crash/core.txt.5:rcx 0 rsi 0xfffffe00dfa73070 rdi 0xfffff80006541c00
/var/crash/core.txt.5:ripcb:                  488, 1045113,       0,       8,       2,   0,   0,   0
/var/crash/core.txt.5:rip6:
/var/crash/core.txt.5:em0: Using 1024 TX descriptors and 1024 RX descriptors
/var/crash/core.txt.5:igb0: Using 1024 TX descriptors and 1024 RX descriptors
/var/crash/core.txt.5:rip 0xffffffff80d439cc rsp 0xfffffe00dfa73000 rbp 0xfffffe00dfa73020
/var/crash/core.txt.5:rcx 0 rsi 0xfffffe00dfa73070 rdi 0xfffff80006541c00
/var/crash/core.txt.8:rip 0xffffffff80da0eda rsp 0xfffffe00dfa7cfc0 rbp 0xfffffe00dfa7d040
/var/crash/core.txt.8:rcx 0 rsi 0x2 rdi 0x2
/var/crash/core.txt.8:ripcb:                  488, 1045113,       0,       0,       0,   0,   0,   0
/var/crash/core.txt.8:rip6:
/var/crash/core.txt.8:em0: Using 1024 TX descriptors and 1024 RX descriptors
/var/crash/core.txt.8:igb0: Using 1024 TX descriptors and 1024 RX descriptors
/var/crash/core.txt.8:rip 0xffffffff80da0eda rsp 0xfffffe00dfa7cfc0 rbp 0xfffffe00dfa7d040
/var/crash/core.txt.8:rcx 0 rsi 0x2 rdi 0x2
 
Honestly it's hard to debug unless there's a core file to look at. Do you have a core file that you can look at with ddb? This would certainly help debug the issue, but this is all on your time. I would like to see the callq that leads to the jmp/jne of rcx. Sorry for the gobbledygook. o_O

The real issue here is not the fault itself, which as _martin points out is generated by /usr/src/sys/netinet/in_fib.c, but that this fault was not handled and instead was handled by the "double fault" handler (triple faults are possible but no more as the processor will reset - see https://www.amd.com/system/files/TechDocs/24593.pdf page 533 if you need some light reading before bed. ;))

So therefore backtraces are just a no-go.

If it was me personally debugging this, I would build a kernel with these options:
options DDB
options GDB
options QUEUE_MACRO_DEBUG_TRASH
(I am not sure if if_vxlan.c uses queue(3), I will check).

and maybe even
options WITNESS
options WITNESS_KDB
options WITNESS_SKIPSPIN

Then boot with -d and wait for the magic to happen.

If you post a bug report, you'll need to do the heavy lifting because a developer is unlikely to have the exotic set-up you've got. :)
 
Back
Top