Random Crash

Gamecreature · Oct 18, 2021

_martin thanks for your suggestion.
I changed the optimization to normal.. and added the extra incoming rule for the webproxy.
(There already was an inbound for this, but I used your example, so no $webserver_sto and $tcp_state is used for now)

covacat · Oct 18, 2021

you can look at the pd argument of the pf_test_rule call
it should point to a pf_pdesc structure (defined in pfvar.h) and you can extract the packet details (source, dest, tcp header, etc)

_martin · Oct 22, 2021

Did you get any more crashes yet? I'm curious..

Gamecreature · Oct 22, 2021

_martin said:
Did you get any more crashes yet? I'm curious..

Code:

# uptime
10:49PM  up 4 days, 13:36, 1 user, load averages: 0.91, 1.02, 1.09

Still running.

Looking good for now.
When running 7 days I will enable sto and tcp state again.. to check if it really is the optimization rule

_martin · Oct 22, 2021

Ok, good. Please keep us posted.

Thanks.

Gamecreature · Oct 27, 2021

Code:

# uptime
 2:31PM  up 9 days,  5:18, 1 user, load averages: 1.08, 1.26, 1.16

Still running, so that looks promising. I will now place back the $webserver_sto and $tcp_state to the webproxy rule. (without the optimize aggressive)

_martin · Nov 4, 2021

Did you apply the optimization yet ?

Gamecreature · Nov 4, 2021

_martin no, I am now still running with 'optimization normal'. All other options are back to normal.

Code:

# uptime
 4:14PM  up 17 days,  8:01, 1 user, load averages: 0.85, 0.96, 0.92

I you're interested, I can enable aggressive again to test if this causes the crash.
(at the moment I'm very happy it is still running)

_martin · Nov 4, 2021

Please do. Thanks.

Gamecreature · Nov 4, 2021

_martin said:
Please do. Thanks.

Just changed this! And now we wait

Gamecreature · Nov 6, 2021

Well just got a new crash. Optimization agressive really seems to be causing this.

Code:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x0
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff81065ba6
stack pointer           = 0x28:0xfffffe00841e90c0
frame pointer           = 0x28:0xfffffe00841e90d0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (swi4: clock (0))
trap number             = 12
panic: page fault
cpuid = 0
time = 1636153153
KDB: stack backtrace:
#0 0xffffffff80c574c5 at kdb_backtrace+0x65
#1 0xffffffff80c09ea1 at vpanic+0x181
#2 0xffffffff80c09d13 at panic+0x43
#3 0xffffffff8108b1b7 at trap_fatal+0x387
#4 0xffffffff8108b20f at trap_pfault+0x4f
#5 0xffffffff8108a86d at trap+0x27d
#6 0xffffffff81061958 at calltrap+0x8
#7 0xffffffff81065ab7 at in_cksum_skip+0x77
#8 0xffffffff82956329 at in4_cksum+0x59
#9 0xffffffff829373d0 at pf_return+0x270
#10 0xffffffff82931351 at pf_test_rule+0x1d71
#11 0xffffffff8292cd11 at pf_test+0x17c1

_martin · Nov 6, 2021

Crash seems to be the same, that's good. Fault happened on the same address. Would you be willing to do this test with only one virtual CPU ?
Now is really a good time to open a PR.
I started the VM I created when I saw your thread, I'm downloading some random torrents in the jail inside that VM. I was not able to trigger the crash before though. I've increased the CPU amount from 2 to 4.

Gamecreature · Nov 6, 2021

_martin, unfortunately this VPS runs at a hosting company (TransIP), I don't have control over the number of CPU's.
Another strange fact is that the another VPS also runs a similar firewall configuration with optimization aggressive and it doesn't happen there.
It only happens on this production VPS... (Did your VM crash?)

_martin · Nov 6, 2021

Didn't you mention that the other VM is on the slightly older version ? As for this test we need to have (or at least that was the initial assumption) vtnet devices I'm using VirtualBox as hypervisor. My VM didn't crash, all torrents were downloaded without a problem several times. My uptime is around 9 hrs. Tests are still running.
Just to be sure - can you confirm you're still on 13.0-RELEASE-p4 ?

Gamecreature · Nov 6, 2021

That's true, the previous VM had an older version. But I just upgraded it. (en re-enabled optimization aggressive)
Let's see if it crashes...

Yes the machines still runs 13.0-RELEASE-p4

Code:

# freebsd-version -kur
13.0-RELEASE-p4
13.0-RELEASE-p4
13.0-RELEASE-p4

_martin · Nov 9, 2021

We did exchange few PMs with Gamecreature, he was able to trigger the crash in VM and simplify the PF config. We did some tests and found out that few things need to be set to trigger the bug. But once those are set you can crash the system within a second or two. This is not related to hypervisor (i.e. network driver) nor amount of CPU.

Behavior is very similar if not the same as described in not that old PR 254419. There's a link to a PR 259645 where the issue is being solved.

Cath O'Deray · Nov 10, 2021

Thanks,

_martin said:
PR 254419. … PR 259645

Respectively:

Fatal trap 12: page fault while in kernel mode, nginx + sendfile on (Closed FIXED)
crash in_cksumdata (sys/amd64/amd64/in_cksum.c:113) via in4_cksum (sys/netpfil/pf/in4_cksum.c:117) after FreeBSD 13.0 p5 update (Open)

Gamecreature · Jan 9, 2022

Yesterday I've got a similar crash again. (with the optimization setting set to option normal)
Uptime 63 days..

Code:

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address   = 0x0
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff81065ba6
stack pointer           = 0x28:0xfffffe00841ed0c0
frame pointer           = 0x28:0xfffffe00841ed0d0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (swi4: clock (0))
trap number             = 12
panic: page fault
cpuid = 3
time = 1641679917
KDB: stack backtrace:
#0 0xffffffff80c574c5 at kdb_backtrace+0x65
#1 0xffffffff80c09ea1 at vpanic+0x181
#2 0xffffffff80c09d13 at panic+0x43
#3 0xffffffff8108b1b7 at trap_fatal+0x387
#4 0xffffffff8108b20f at trap_pfault+0x4f
#5 0xffffffff8108a86d at trap+0x27d
#6 0xffffffff81061958 at calltrap+0x8
#7 0xffffffff81065ab7 at in_cksum_skip+0x77
#8 0xffffffff82956329 at in4_cksum+0x59
#9 0xffffffff829373d0 at pf_return+0x270
#10 0xffffffff82931351 at pf_test_rule+0x1d71
#11 0xffffffff8292cd11 at pf_test+0x17c1
#12 0xffffffff82945bff at pf_check_out+0x1f
#13 0xffffffff80d42137 at pfil_run_hooks+0x97
#14 0xffffffff80db2f21 at ip_output+0xb61
#15 0xffffffff80dc9664 at tcp_output+0x1b04
#16 0xffffffff80dd80df at tcp_timer_rexmt+0x59f
#17 0xffffffff80c25b0d at softclock_call_cc+0x13d
Uptime: 63d23h10m22s
Dumping 1393 out of 8152 MB:..2%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
warning: Source file is more recent than executable.
55              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c09a96 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80c09f10 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80c09d13 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff8108b1b7 in trap_fatal (frame=0xfffffe00841ed000, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:915
#6  0xffffffff8108b20f in trap_pfault (frame=frame@entry=0xfffffe00841ed000,
    usermode=false, signo=<optimized out>, signo@entry=0x0,
    ucode=<optimized out>, ucode@entry=0x0)
    at /usr/src/sys/amd64/amd64/trap.c:732
#7  0xffffffff8108a86d in trap (frame=0xfffffe00841ed000)
    at /usr/src/sys/amd64/amd64/trap.c:398
#8  <signal handler called>
#9  0xffffffff81065ba6 in in_cksumdata (buf=<optimized out>,
    len=len@entry=612) at /usr/src/sys/amd64/amd64/in_cksum.c:113
#10 0xffffffff81065ab7 in in_cksum_skip (m=0xfffff8002b889800, len=612,
    skip=<optimized out>) at /usr/src/sys/amd64/amd64/in_cksum.c:224
#11 0xffffffff82956329 in in4_cksum (m=0x0, nxt=<optimized out>,
    nxt@entry=6 '\006', off=0, len=<optimized out>)
    at /usr/src/sys/netpfil/pf/in4_cksum.c:117
#12 0xffffffff829373d0 in pf_check_proto_cksum (m=0xfffff800a42e8100,
    off=<optimized out>, len=0, p=6 '\006', af=2 '\002')
    at /usr/src/sys/netpfil/pf/pf.c:5844
#13 pf_return (r=r@entry=0xfffff800356cc000, nr=<optimized out>,
    nr@entry=0xfffff80181f40c00, pd=pd@entry=0xfffffe00841ed6d0,
    sk=<optimized out>, off=<optimized out>, off@entry=20, m=<optimized out>,
    m@entry=0xfffff800a42e8100, th=0xfffffe00841ed7a0,
    kif=0xfffff80028166200, bproto_sum=8593, bip_sum=0, hdrlen=20,
    reason=0xfffffe00841ed55e) at /usr/src/sys/netpfil/pf/pf.c:2654
#14 0xffffffff82931351 in pf_test_rule (rm=rm@entry=0xfffffe00841ed770,
    sm=sm@entry=0xfffffe00841ed788, direction=direction@entry=2,
    kif=kif@entry=0xfffff80028166200, m=m@entry=0xfffff800a42e8100, off=20,
    pd=0xfffffe00841ed6d0, am=0xfffffe00841ed760, rsm=0xfffffe00841ed750,
    inp=0xfffff8000fd8a988) at /usr/src/sys/netpfil/pf/pf.c:3641
#15 0xffffffff8292cd11 in pf_test (dir=<optimized out>, dir@entry=2,
    pflags=<optimized out>, ifp=<optimized out>, m0=<optimized out>,
    m0@entry=0xfffffe00841ed948, inp=0xfffff8000fd8a988)
    at /usr/src/sys/netpfil/pf/pf.c:6005
#16 0xffffffff82945bff in pf_check_out (m=0xfffffe00841ed948, ifp=0x0,
    flags=612, ruleset=<optimized out>, inp=0x0)
    at /usr/src/sys/netpfil/pf/pf_ioctl.c:4516
#17 0xffffffff80d42137 in pfil_run_hooks (head=<optimized out>, p=...,
    ifp=0xfffff80003687800, flags=flags@entry=131072,
    inp=inp@entry=0xfffff8000fd8a988) at /usr/src/sys/net/pfil.c:187
#18 0xffffffff80db2f21 in ip_output_pfil (mp=0xfffffe00841ed948,
    ifp=0xfffff80003687800, flags=0, inp=0xfffff8000fd8a988,
    dst=0xfffff8000fd8ab30, fibnum=<optimized out>, error=<optimized out>)
    at /usr/src/sys/netinet/ip_output.c:130
#19 ip_output (m=m@entry=0xfffff800a42e8100, opt=<optimized out>,
    ro=<optimized out>, flags=0, imo=imo@entry=0x0, inp=<optimized out>)
    at /usr/src/sys/netinet/ip_output.c:705
#20 0xffffffff80dc9664 in tcp_output (tp=0xfffffe01226e34d8)
    at /usr/src/sys/netinet/tcp_output.c:1492
#21 0xffffffff80dd80df in tcp_timer_rexmt (xtp=0xfffffe01226e34d8)
    at /usr/src/sys/netinet/tcp_timer.c:879
#22 0xffffffff80c25b0d in softclock_call_cc (c=0xfffffe01226e3760,
    cc=cc@entry=0xffffffff81ca8200 <cc_cpu>, direct=direct@entry=0)
    at /usr/src/sys/kern/kern_timeout.c:696
#23 0xffffffff80c25f99 in softclock (arg=0xffffffff81ca8200 <cc_cpu>)
    at /usr/src/sys/kern/kern_timeout.c:816
#24 0xffffffff80bcafdd in intr_event_execute_handlers (p=<optimized out>,
    ie=0xfffff8000364c700) at /usr/src/sys/kern/kern_intr.c:1168
#25 ithread_execute_handlers (p=<optimized out>, ie=0xfffff8000364c700)
    at /usr/src/sys/kern/kern_intr.c:1181
#26 ithread_loop (arg=arg@entry=0xfffff80003650dc0)
    at /usr/src/sys/kern/kern_intr.c:1269
#27 0xffffffff80bc7dde in fork_exit (
    callout=0xffffffff80bcad90 <ithread_loop>, arg=0xfffff80003650dc0,
    frame=0xfffffe00841edd40) at /usr/src/sys/kern/kern_fork.c:1069
#28 <signal handler called>

_martin · Jan 9, 2022

Frame 9 is where the issue occurred, most likely buf is NULL (same as we experienced before). Gamecreature please update the PR 254419 and re-state the current setup (OS version, confirm generic kernel, etc..). I'm assuming you had kern.ipc.mb_use_ext_pgs set to 0 during crash, correct ? Do state this in PR too please.

Gamecreature · Jan 9, 2022

Oh I see this option is on:

Code:

root@core2:~ # sysctl kern.ipc.mb_use_ext_pgs
kern.ipc.mb_use_ext_pgs: 1

_martin · Jan 9, 2022

Oh well, it's expected behavior for the time being. As a workaround you should set it to 0 as Mark mentioned in the PR.

Gamecreature · Jan 10, 2022

_martin Thank you. I will disable this option until it's fixed.