Kernel panic with netgraph and mpd5.8

Donald Baud · Jul 12, 2016

Hi I'm running an L2TP LNS using net/mpd5 and the server panics once every ~24h. This is a new project replacing a Cisco 7206, 700-sessions 800 Mbit/s. It looks like it crashes somewhere in the netgraph.ko module. Could someone please help me troubleshoot this issue, it always crashes around the same location: instruction pointer = 0x20:0xffffffff81c3828d
The crash happens at random times not necessarily under heavy load.

- using plain GENERIC kernel

Code:

10.3-RELEASE-p4 FreeBSD 10.3-RELEASE-p4 #0: Sat May 28 12:23:44 UTC 2016 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64

Code:

- # kldstat 
Id Refs Address  Size  Name 
 1  32 0xffffffff80200000 17bc6a8  kernel 
 2  2 0xffffffff81c11000 114db  ipfw.ko 
 3  1 0xffffffff81c23000 d32f  dummynet.ko 
 4  1 0xffffffff81c31000 3831  ng_socket.ko 
 5  8 0xffffffff81c35000 ba02  netgraph.ko 
 6  1 0xffffffff81c41000 2b99  ng_mppc.ko 
 7  1 0xffffffff81c44000 80c  rc4.ko 
 8  1 0xffffffff81c45000 23dc  vmmemctl.ko 
 9  1 0xffffffff81c48000 397d  ng_l2tp.ko 
10  1 0xffffffff81c4c000 4b04  ng_ksocket.ko 
11  1 0xffffffff81c51000 17d6  ng_tee.ko 
12  1 0xffffffff81c53000 40d2  ng_iface.ko 
13  1 0xffffffff81c58000 5829  ng_ppp.ko 
14  1 0xffffffff81c5e000 18b1  ng_tcpmss.ko

- /etc/rc.conf

Code:

mpd_enable="YES" 
quagga_daemons="zebra ospfd" 
devd_enable="NO" 
ipv6_network_interfaces="none" 
ip6addrctl_enable="NO"

- /etc/sysctl.conf]

Code:

net.inet.ip.fastforwarding=1 
hw.intr_storm_threshold=40000 
net.graph.maxdgram=524288 
net.graph.recvspace=524288

- /boot/loader.conf

Code:

net.graph.maxdata=20480 
net.graph.maxalloc=20480

- grep kernel: /var/log/messages

Code:

Jul 12 04:18:05 mybox syslogd: kernel boot file is /boot/kernel/kernel 
Jul 12 04:18:05 mybox kernel: 
Jul 12 04:18:05 mybox kernel: 
Jul 12 04:18:05 mybox kernel: Fatal trap 9: general protection fault while in kernel mode 
Jul 12 04:18:05 mybox kernel: cpuid = 0; apic id = 00 
Jul 12 04:18:05 mybox kernel: instruction pointer  = 0x20:0xffffffff81c3828d 
Jul 12 04:18:05 mybox kernel: stack pointer  = 0x28:0xfffffe0174da8380 
Jul 12 04:18:05 mybox kernel: frame pointer  = 0x28:0xfffffe0174da83c0 
Jul 12 04:18:05 mybox kernel: code segment  = base 0x0, limit 0xfffff, type 0x1b 
Jul 12 04:18:05 mybox kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 
Jul 12 04:18:05 mybox kernel: processor eflags  = interrupt enabled, resume, IOPL = 0 
Jul 12 04:18:05 mybox kernel: current process  = 659 (ng_queue3) 
Jul 12 04:18:05 mybox kernel: trap number  = 9 
Jul 12 04:18:05 mybox kernel: panic: general protection fault 
Jul 12 04:18:05 mybox kernel: cpuid = 0 
Jul 12 04:18:05 mybox kernel: KDB: stack backtrace: 
Jul 12 04:18:05 mybox kernel: #0 0xffffffff8098e390 at kdb_backtrace+0x60 
Jul 12 04:18:05 mybox kernel: #1 0xffffffff80951066 at vpanic+0x126 
Jul 12 04:18:05 mybox kernel: #2 0xffffffff80950f33 at panic+0x43 
Jul 12 04:18:05 mybox kernel: #3 0xffffffff80d55f7b at trap_fatal+0x36b 
Jul 12 04:18:05 mybox kernel: #4 0xffffffff80d55bfd at trap+0x77d 
Jul 12 04:18:05 mybox kernel: #5 0xffffffff80d3b8d2 at calltrap+0x8 
Jul 12 04:18:05 mybox kernel: #6 0xffffffff81c49606 at ng_l2tp_rcvdata_lower+0x946 
Jul 12 04:18:05 mybox kernel: #7 0xffffffff81c370ca at ng_apply_item+0x21a 
Jul 12 04:18:05 mybox kernel: #8 0xffffffff81c36d1a at ng_snd_item+0x38a 
Jul 12 04:18:05 mybox kernel: #9 0xffffffff81c4d3e2 at ng_ksocket_incoming2+0x2f2 
Jul 12 04:18:05 mybox kernel: #10 0xffffffff81c36f62 at ng_apply_item+0xb2 
Jul 12 04:18:05 mybox kernel: #11 0xffffffff81c38d39 at ngthread+0x1b9 
Jul 12 04:18:05 mybox kernel: #12 0xffffffff8091a4ea at fork_exit+0x9a 
Jul 12 04:18:05 mybox kernel: #13 0xffffffff80d3be0e at fork_trampoline+0xe 
Jul 12 04:18:05 mybox kernel: Uptime: 1d19h53m59s 
Jul 12 04:18:05 mybox kernel: Dumping 560 out of 6119 MB:..3%..12%..23%..32%..43%..52%..63%..72%..83%..92% 
Jul 12 04:18:05 mybox kernel: Dump complete 
Jul 12 04:18:05 mybox kernel: [...]

Donald Baud · Jul 13, 2016

Another crash happened less than 24hours of uptime.
This time, I'm removing all tunings in sysctl.conf and loader.conf (net.graph....)

Could someone please explain to me how to decipher a kernel panic message like this one:

Code:

Jul 12 18:55:22 mybox syslogd: kernel boot file is /boot/kernel/kernel
Jul 12 18:55:22 mybox kernel:
Jul 12 18:55:22 mybox kernel:
Jul 12 18:55:22 mybox kernel: Fatal trap 9: general protection fault while in kernel mode
Jul 12 18:55:22 mybox kernel: cpuid = 0; apic id = 01
Jul 12 18:55:22 mybox kernel: instruction pointer   = 0x20:0xffffffff81c31f40
Jul 12 18:55:22 mybox kernel: stack pointer    = 0x28:0xfffffe017359a510
Jul 12 18:55:22 mybox kernel: frame pointer    = 0x28:0xfffffe017359a560
Jul 12 18:55:22 mybox kernel: code segment     = base 0x0, limit 0xfffff, type 0x1b
Jul 12 18:55:22 mybox kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
Jul 12 18:55:22 mybox kernel: processor eflags   = interrupt enabled, resume, IOPL = 0
Jul 12 18:55:22 mybox kernel: current process     = 656 (mpd5)
Jul 12 18:55:22 mybox kernel: trap number     = 9
Jul 12 18:55:22 mybox kernel: panic: general protection fault
Jul 12 18:55:22 mybox kernel: cpuid = 0
Jul 12 18:55:22 mybox kernel: KDB: stack backtrace:
Jul 12 18:55:22 mybox kernel: #0 0xffffffff8098e390 at kdb_backtrace+0x60
Jul 12 18:55:22 mybox kernel: #1 0xffffffff80951066 at vpanic+0x126
Jul 12 18:55:22 mybox kernel: #2 0xffffffff80950f33 at panic+0x43
Jul 12 18:55:22 mybox kernel: #3 0xffffffff80d55f7b at trap_fatal+0x36b
Jul 12 18:55:22 mybox kernel: #4 0xffffffff80d55bfd at trap+0x77d
Jul 12 18:55:22 mybox kernel: #5 0xffffffff80d3b8d2 at calltrap+0x8
Jul 12 18:55:22 mybox kernel: #6 0xffffffff81c395b6 at ng_add_hook+0x106
Jul 12 18:55:22 mybox kernel: #7 0xffffffff81c390bb at ng_mkpeer+0x3b
Jul 12 18:55:22 mybox kernel: #8 0xffffffff81c37275 at ng_apply_item+0x3c5
Jul 12 18:55:22 mybox kernel: #9 0xffffffff81c36d1a at ng_snd_item+0x38a
Jul 12 18:55:22 mybox kernel: #10 0xffffffff81c319f1 at ngc_send+0x221
Jul 12 18:55:22 mybox kernel: #11 0xffffffff809cc2d6 at sosend_generic+0x476
Jul 12 18:55:22 mybox kernel: #12 0xffffffff809d2635 at kern_sendit+0x245
Jul 12 18:55:22 mybox kernel: #13 0xffffffff809d2959 at sendit+0x129
Jul 12 18:55:22 mybox kernel: #14 0xffffffff809d281d at sys_sendto+0x4d
Jul 12 18:55:22 mybox kernel: #15 0xffffffff80d5694f at amd64_syscall+0x40f
Jul 12 18:55:22 mybox kernel: #16 0xffffffff80d3bbbb at Xfast_syscall+0xfb
Jul 12 18:55:22 mybox kernel: Uptime: 14h36m50s
Jul 12 18:55:22 mybox kernel: Dumping 327 out of 6119 MB:..5%..15%..25%..35%..44%..54%..64%..74%..84%..93%
Jul 12 18:55:22 mybox kernel: Dump complete

I know I should use kgdb as per the FreeBSD handbook

Code:

# cd /usr/obj/usr/src/sys/GENERIC
# kgdb kernel.debug /var/crash/vmcore.last

but I'm getting an error from kgdb

Code:

# kgdb kernel.debug /var/crash/vmcore.last
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
Cannot access memory at address 0x4
(kgdb)

Why is giving such error:

Code:

Cannot access memory at address 0x4

Is it because the kernel modules like netgrah.ko and others are not statically loaded?

SirDice · Jul 13, 2016

Donald Baud said:
Is it because the kernel modules like netgrah.ko and others are not statically loaded?

I can't help much with the rest but I may be able to answer this. Everything is actually a module. Even the ones that have been 'statically' linked in the kernel. So there shouldn't be a difference if the module is 'built-in' or loaded as a module afterwards.

As for the crashes, I would suggest creating a PR for it. There are very few developers on this board unfortunately. Creating the PR should make sure the issue gets reported to the right persons. The crash appears to be with netgraph, not with the mpd port. It looks like it just triggers it.

kpa · Jul 13, 2016

The only difference there could be between a loaded module vs. a statically linked module is the time of initialization. Statically linked modules and those of the dynamically loaded ones that are loaded in loader.conf(5) and 100% equivalent at kernel start up, all of them get initialized in the same sequence. Outside them are the kernel modules that loaded later in the boot sequence from rc(8) scripts and they are the ones that might behave differently compared to the first group of modules.

Donald Baud · Jul 17, 2016

Just to give a followup on this thread.
Here is the latest developement in troubleshooting this issue.

So to summarize, we have a RELENG-10.3-p5 + mpd5.8 (netgraph) + quagga-ospf
Giving chronic (daily) crash of the server with 300-l2tp tunnels / 300-800 sessions and about 400-800 Mbit/s
The crash seems to be caused by a race condition in netgraph.

I've submitted a bug report or rather updated the latest PR
Bug 199096 Kernel panic after some time using mpd (netgraph) and ipfw
Bug 176401 [netgraph] page fault in netgraph
Bug 154286 [netgraph] [panic] 8.2-PRERELEASE panic in netgraph
Bug 154091 - [netgraph] [panic] netgraph, unaligned mbuf?
Bug 153497 - [netgraph] netgraph panic due to race conditions

Now as a workaround I am currently trying the following:
/boot/loader.conf

Code:

# this is to boost mpd5 netgraph performance (default was 4096)
net.graph.maxdata=65536
net.graph.maxalloc=65536

# attempting to use multithreading for interrupts
# use as many threads as cpu cores (in this case 2)
net.isr.numthreads=2
net.isr.defaultqlimit=4096
net.isr.bindthreads=1
net.isr.maxthreads=2
net.isr.dispatch=deferred

/etc/sysctl.conf

Code:

# notice I had to boost maxsockbuf, in order to boost maxdgram and recvspace
kern.ipc.nmbclusters=400000
kern.ipc.maxsockbuf=83886080
net.graph.maxdgram=8388608
net.graph.recvspace=8388608

Also, I replaced net/quagga(ospf) with net/bird(ospf) which seems to be much more lightweight.
I am waiting to see how stable this is and will eventually submit an article on the "how-to section" of this forum.

PwrSurge · Oct 26, 2023

Donald Baud said:
Also, I replaced net/quagga(ospf) with net/bird(ospf) which seems to be much more lightweight.
I am waiting to see how stable this is and will eventually submit an article on the "how-to section" of this forum.

Also interested in using FreeBSD as an L2TP LNS server alternative to Cisco for PPPoE based DSL connections. Donald Baud were you able to eventually get your system running stable? If so, please post the package versions and configuration files used. A how-to would also be great if you have the time.

SirDice · Oct 27, 2023

PwrSurge said:
@Donald Baud were you able to eventually get your system running stable?

Last seen May 5, 2021
Don't expect an answer anytime soon.

This thread has been dead for almost 7 years. 10.3 has been EoL since 2018.

If you have a specific problem I suggest creating a new thread and posting the issues you're having. For what it's worth, I have a FreeBSD firewall machine (13.2-RELEASE) for my fiber connection using PPPoE with mpd5. Pushing 900-950 Mbps throughput which is close to the max of my fiber connection (1Gbps).

Kernel panic with netgraph and mpd5.8

Donald Baud

Donald Baud

SirDice

Administrator

kpa

Donald Baud

PwrSurge

SirDice

Administrator