Random Crash

OP
Gamecreature

Gamecreature

Member

Reaction score: 9
Messages: 34

_martin thanks for your suggestion.
I changed the optimization to normal.. and added the extra incoming rule for the webproxy.
(There already was an inbound for this, but I used your example, so no $webserver_sto and $tcp_state is used for now)
 

covacat

Daemon

Reaction score: 537
Messages: 1,101

you can look at the pd argument of the pf_test_rule call
it should point to a pf_pdesc structure (defined in pfvar.h) and you can extract the packet details (source, dest, tcp header, etc)
 
OP
Gamecreature

Gamecreature

Member

Reaction score: 9
Messages: 34

Did you get any more crashes yet? I'm curious..
Code:
# uptime
10:49PM  up 4 days, 13:36, 1 user, load averages: 0.91, 1.02, 1.09

Still running. ☺️
Looking good for now.
When running 7 days I will enable sto and tcp state again.. to check if it really is the optimization rule
 
OP
Gamecreature

Gamecreature

Member

Reaction score: 9
Messages: 34

Code:
# uptime
 2:31PM  up 9 days,  5:18, 1 user, load averages: 1.08, 1.26, 1.16

Still running, so that looks promising. I will now place back the $webserver_sto and $tcp_state to the webproxy rule. (without the optimize aggressive)
 
OP
Gamecreature

Gamecreature

Member

Reaction score: 9
Messages: 34

_martin no, I am now still running with 'optimization normal'. All other options are back to normal.

Code:
# uptime
 4:14PM  up 17 days,  8:01, 1 user, load averages: 0.85, 0.96, 0.92

I you're interested, I can enable aggressive again to test if this causes the crash.
(at the moment I'm very happy it is still running)
 
OP
Gamecreature

Gamecreature

Member

Reaction score: 9
Messages: 34

Well just got a new crash. Optimization agressive really seems to be causing this.

Code:
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x0
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff81065ba6
stack pointer           = 0x28:0xfffffe00841e90c0
frame pointer           = 0x28:0xfffffe00841e90d0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (swi4: clock (0))
trap number             = 12
panic: page fault
cpuid = 0
time = 1636153153
KDB: stack backtrace:
#0 0xffffffff80c574c5 at kdb_backtrace+0x65
#1 0xffffffff80c09ea1 at vpanic+0x181
#2 0xffffffff80c09d13 at panic+0x43
#3 0xffffffff8108b1b7 at trap_fatal+0x387
#4 0xffffffff8108b20f at trap_pfault+0x4f
#5 0xffffffff8108a86d at trap+0x27d
#6 0xffffffff81061958 at calltrap+0x8
#7 0xffffffff81065ab7 at in_cksum_skip+0x77
#8 0xffffffff82956329 at in4_cksum+0x59
#9 0xffffffff829373d0 at pf_return+0x270
#10 0xffffffff82931351 at pf_test_rule+0x1d71
#11 0xffffffff8292cd11 at pf_test+0x17c1
 

_martin

Daemon

Reaction score: 412
Messages: 1,248

Crash seems to be the same, that's good. Fault happened on the same address. Would you be willing to do this test with only one virtual CPU ?
Now is really a good time to open a PR.
I started the VM I created when I saw your thread, I'm downloading some random torrents in the jail inside that VM. I was not able to trigger the crash before though. I've increased the CPU amount from 2 to 4.
 
OP
Gamecreature

Gamecreature

Member

Reaction score: 9
Messages: 34

_martin, unfortunately this VPS runs at a hosting company (TransIP), I don't have control over the number of CPU's.
Another strange fact is that the another VPS also runs a similar firewall configuration with optimization aggressive and it doesn't happen there.
It only happens on this production VPS... (Did your VM crash?)
 

_martin

Daemon

Reaction score: 412
Messages: 1,248

Didn't you mention that the other VM is on the slightly older version ? As for this test we need to have (or at least that was the initial assumption) vtnet devices I'm using VirtualBox as hypervisor. My VM didn't crash, all torrents were downloaded without a problem several times. My uptime is around 9 hrs. Tests are still running.
Just to be sure - can you confirm you're still on 13.0-RELEASE-p4 ?
 
OP
Gamecreature

Gamecreature

Member

Reaction score: 9
Messages: 34

That's true, the previous VM had an older version. But I just upgraded it. (en re-enabled optimization aggressive)
Let's see if it crashes...

Yes the machines still runs 13.0-RELEASE-p4
Code:
# freebsd-version -kur
13.0-RELEASE-p4
13.0-RELEASE-p4
13.0-RELEASE-p4
 

_martin

Daemon

Reaction score: 412
Messages: 1,248

We did exchange few PMs with Gamecreature, he was able to trigger the crash in VM and simplify the PF config. We did some tests and found out that few things need to be set to trigger the bug. But once those are set you can crash the system within a second or two. This is not related to hypervisor (i.e. network driver) nor amount of CPU.

Behavior is very similar if not the same as described in not that old PR 254419. There's a link to a PR 259645 where the issue is being solved.
 
OP
Gamecreature

Gamecreature

Member

Reaction score: 9
Messages: 34

Yesterday I've got a similar crash again. (with the optimization setting set to option normal)
Uptime 63 days..

Code:
Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address   = 0x0
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff81065ba6
stack pointer           = 0x28:0xfffffe00841ed0c0
frame pointer           = 0x28:0xfffffe00841ed0d0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (swi4: clock (0))
trap number             = 12
panic: page fault
cpuid = 3
time = 1641679917
KDB: stack backtrace:
#0 0xffffffff80c574c5 at kdb_backtrace+0x65
#1 0xffffffff80c09ea1 at vpanic+0x181
#2 0xffffffff80c09d13 at panic+0x43
#3 0xffffffff8108b1b7 at trap_fatal+0x387
#4 0xffffffff8108b20f at trap_pfault+0x4f
#5 0xffffffff8108a86d at trap+0x27d
#6 0xffffffff81061958 at calltrap+0x8
#7 0xffffffff81065ab7 at in_cksum_skip+0x77
#8 0xffffffff82956329 at in4_cksum+0x59
#9 0xffffffff829373d0 at pf_return+0x270
#10 0xffffffff82931351 at pf_test_rule+0x1d71
#11 0xffffffff8292cd11 at pf_test+0x17c1
#12 0xffffffff82945bff at pf_check_out+0x1f
#13 0xffffffff80d42137 at pfil_run_hooks+0x97
#14 0xffffffff80db2f21 at ip_output+0xb61
#15 0xffffffff80dc9664 at tcp_output+0x1b04
#16 0xffffffff80dd80df at tcp_timer_rexmt+0x59f
#17 0xffffffff80c25b0d at softclock_call_cc+0x13d
Uptime: 63d23h10m22s
Dumping 1393 out of 8152 MB:..2%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
warning: Source file is more recent than executable.
55              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c09a96 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80c09f10 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80c09d13 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff8108b1b7 in trap_fatal (frame=0xfffffe00841ed000, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:915
#6  0xffffffff8108b20f in trap_pfault (frame=frame@entry=0xfffffe00841ed000,
    usermode=false, signo=<optimized out>, signo@entry=0x0,
    ucode=<optimized out>, ucode@entry=0x0)
    at /usr/src/sys/amd64/amd64/trap.c:732
#7  0xffffffff8108a86d in trap (frame=0xfffffe00841ed000)
    at /usr/src/sys/amd64/amd64/trap.c:398
#8  <signal handler called>
#9  0xffffffff81065ba6 in in_cksumdata (buf=<optimized out>,
    len=len@entry=612) at /usr/src/sys/amd64/amd64/in_cksum.c:113
#10 0xffffffff81065ab7 in in_cksum_skip (m=0xfffff8002b889800, len=612,
    skip=<optimized out>) at /usr/src/sys/amd64/amd64/in_cksum.c:224
#11 0xffffffff82956329 in in4_cksum (m=0x0, nxt=<optimized out>,
    nxt@entry=6 '\006', off=0, len=<optimized out>)
    at /usr/src/sys/netpfil/pf/in4_cksum.c:117
#12 0xffffffff829373d0 in pf_check_proto_cksum (m=0xfffff800a42e8100,
    off=<optimized out>, len=0, p=6 '\006', af=2 '\002')
    at /usr/src/sys/netpfil/pf/pf.c:5844
#13 pf_return (r=r@entry=0xfffff800356cc000, nr=<optimized out>,
    nr@entry=0xfffff80181f40c00, pd=pd@entry=0xfffffe00841ed6d0,
    sk=<optimized out>, off=<optimized out>, off@entry=20, m=<optimized out>,
    m@entry=0xfffff800a42e8100, th=0xfffffe00841ed7a0,
    kif=0xfffff80028166200, bproto_sum=8593, bip_sum=0, hdrlen=20,
    reason=0xfffffe00841ed55e) at /usr/src/sys/netpfil/pf/pf.c:2654
#14 0xffffffff82931351 in pf_test_rule (rm=rm@entry=0xfffffe00841ed770,
    sm=sm@entry=0xfffffe00841ed788, direction=direction@entry=2,
    kif=kif@entry=0xfffff80028166200, m=m@entry=0xfffff800a42e8100, off=20,
    pd=0xfffffe00841ed6d0, am=0xfffffe00841ed760, rsm=0xfffffe00841ed750,
    inp=0xfffff8000fd8a988) at /usr/src/sys/netpfil/pf/pf.c:3641
#15 0xffffffff8292cd11 in pf_test (dir=<optimized out>, dir@entry=2,
    pflags=<optimized out>, ifp=<optimized out>, m0=<optimized out>,
    m0@entry=0xfffffe00841ed948, inp=0xfffff8000fd8a988)
    at /usr/src/sys/netpfil/pf/pf.c:6005
#16 0xffffffff82945bff in pf_check_out (m=0xfffffe00841ed948, ifp=0x0,
    flags=612, ruleset=<optimized out>, inp=0x0)
    at /usr/src/sys/netpfil/pf/pf_ioctl.c:4516
#17 0xffffffff80d42137 in pfil_run_hooks (head=<optimized out>, p=...,
    ifp=0xfffff80003687800, flags=flags@entry=131072,
    inp=inp@entry=0xfffff8000fd8a988) at /usr/src/sys/net/pfil.c:187
#18 0xffffffff80db2f21 in ip_output_pfil (mp=0xfffffe00841ed948,
    ifp=0xfffff80003687800, flags=0, inp=0xfffff8000fd8a988,
    dst=0xfffff8000fd8ab30, fibnum=<optimized out>, error=<optimized out>)
    at /usr/src/sys/netinet/ip_output.c:130
#19 ip_output (m=m@entry=0xfffff800a42e8100, opt=<optimized out>,
    ro=<optimized out>, flags=0, imo=imo@entry=0x0, inp=<optimized out>)
    at /usr/src/sys/netinet/ip_output.c:705
#20 0xffffffff80dc9664 in tcp_output (tp=0xfffffe01226e34d8)
    at /usr/src/sys/netinet/tcp_output.c:1492
#21 0xffffffff80dd80df in tcp_timer_rexmt (xtp=0xfffffe01226e34d8)
    at /usr/src/sys/netinet/tcp_timer.c:879
#22 0xffffffff80c25b0d in softclock_call_cc (c=0xfffffe01226e3760,
    cc=cc@entry=0xffffffff81ca8200 <cc_cpu>, direct=direct@entry=0)
    at /usr/src/sys/kern/kern_timeout.c:696
#23 0xffffffff80c25f99 in softclock (arg=0xffffffff81ca8200 <cc_cpu>)
    at /usr/src/sys/kern/kern_timeout.c:816
#24 0xffffffff80bcafdd in intr_event_execute_handlers (p=<optimized out>,
    ie=0xfffff8000364c700) at /usr/src/sys/kern/kern_intr.c:1168
#25 ithread_execute_handlers (p=<optimized out>, ie=0xfffff8000364c700)
    at /usr/src/sys/kern/kern_intr.c:1181
#26 ithread_loop (arg=arg@entry=0xfffff80003650dc0)
    at /usr/src/sys/kern/kern_intr.c:1269
#27 0xffffffff80bc7dde in fork_exit (
    callout=0xffffffff80bcad90 <ithread_loop>, arg=0xfffff80003650dc0,
    frame=0xfffffe00841edd40) at /usr/src/sys/kern/kern_fork.c:1069
#28 <signal handler called>
 

_martin

Daemon

Reaction score: 412
Messages: 1,248

Frame 9 is where the issue occurred, most likely buf is NULL (same as we experienced before). Gamecreature please update the PR 254419 and re-state the current setup (OS version, confirm generic kernel, etc..). I'm assuming you had kern.ipc.mb_use_ext_pgs set to 0 during crash, correct ? Do state this in PR too please.
 
OP
Gamecreature

Gamecreature

Member

Reaction score: 9
Messages: 34

Oh I see this option is on:

Code:
root@core2:~ # sysctl kern.ipc.mb_use_ext_pgs
kern.ipc.mb_use_ext_pgs: 1
 
Top