Random Crash

Gamecreature · Oct 10, 2021

Since the update of FreeBSD from 12.2 to 13.0, I receive the following crashes every few days.

Code:

myserver.com dumped core - see /var/crash/vmcore.9

Sun Oct 10 11:49:14 CEST 2021

FreeBSD myserver.com 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #0: Tue Aug 24 07:33:27 UTC 2021     root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

panic: page fault

GNU gdb (GDB) 10.2 [GDB v10.2 for FreeBSD]
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd13.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

Unread portion of the kernel message buffer:


Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address   = 0x0
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff81065ba6
stack pointer           = 0x28:0xfffffe00841ed0c0
frame pointer           = 0x28:0xfffffe00841ed0d0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (swi4: clock (0))
trap number             = 12
panic: page fault
cpuid = 3
time = 1633859213
KDB: stack backtrace:
#0 0xffffffff80c574c5 at kdb_backtrace+0x65
#1 0xffffffff80c09ea1 at vpanic+0x181
#2 0xffffffff80c09d13 at panic+0x43
#3 0xffffffff8108b1b7 at trap_fatal+0x387
#4 0xffffffff8108b20f at trap_pfault+0x4f
#5 0xffffffff8108a86d at trap+0x27d
#6 0xffffffff81061958 at calltrap+0x8
#7 0xffffffff81065ab7 at in_cksum_skip+0x77
#8 0xffffffff82956329 at in4_cksum+0x59
#9 0xffffffff829373d0 at pf_return+0x270
#10 0xffffffff82931351 at pf_test_rule+0x1d71
#11 0xffffffff8292cd11 at pf_test+0x17c1
#12 0xffffffff82945bff at pf_check_out+0x1f
#13 0xffffffff80d42137 at pfil_run_hooks+0x97
#14 0xffffffff80db2f21 at ip_output+0xb61
#15 0xffffffff80dc9664 at tcp_output+0x1b04
#16 0xffffffff80dd80df at tcp_timer_rexmt+0x59f
#17 0xffffffff80c25b0d at softclock_call_cc+0x13d
Uptime: 5d8h2m12s
Dumping 1365 out of 8152 MB:..2%..11%..22%..31%..42%..51%..61%..71%..81%..91%

Every crash is the same.. (Always pid 12, with a stack trace, which seems to be related to the PF firewall)
It's tricky to debug, because it's an essential live server.
I don't have a clue why and how this can happen.

I've got another server running (with almost the same configuration), but it doesn't happen on the other server.
They are both running in the same vps-hosting environment, so I don't think it is faulty hardware.

I hope somebody can help me find a solution for this problem.
Thanks!

Cath O'Deray · Oct 10, 2021

UFS or ZFS?

Is a debug kernel available?

Gamecreature · Oct 10, 2021

ZFS is used. I don’t think I have a debug kernel. I’m using the GENERIC binary kernel. (Fetched via freebsd-update).

Cath O'Deray · Oct 10, 2021

Thanks. Scrubbed recently?

Gamecreature · Oct 10, 2021

I see I didn’t. Last time was 28 July.
Just started a new scrub… I will post an update when done. Thanks!

Argentum · Oct 10, 2021

Gamecreature said:
ZFS is used. I don’t think I have a debug kernel. I’m using the GENERIC binary kernel. (Fetched via freebsd-update).

Try to build your own kernel and see if the issue persists.

Gamecreature · Oct 10, 2021

Scrub just completed without any errors.

# zpool status
pool: zroot
state: ONLINE
scan: scrub repaired 0B in 01:50:37 with 0 errors on Sun Oct 10 21:48:30 2021

Cath O'Deray · Oct 11, 2021

Packages from quarterly, or latest?

pkg -vv | grep url

Gamecreature · Oct 11, 2021

grahamperrin Quarterly it seems:

Code:

# pkg -vv | grep url
    url             : "pkg+http://pkg.FreeBSD.org/FreeBSD:13:amd64/quarterly"

Argentum,

It's a live production server, which I would like to keep updating with freebsd-update. Experimenting with a new custom kernel doesn't feel very comfortable to try on a live server. (Resources are pretty scarce on this vps)

Another note, is that there are running several jails which are connected via pf-rdr / pf-nat to the public ip's. (ipv4, ipv6) (I'm using the iocage jail manager)

Argentum · Oct 11, 2021

Gamecreature said:
Argentum,

It's a live production server, which I would like to keep updating with freebsd-update. Experimenting with a new custom kernel doesn't feel very comfortable to try on a live server. (Resources are pretty scarce on this vps)

Another note, is that there are running several jails which are connected via pf-rdr / pf-nat to the public ip's. (ipv4, ipv6) (I'm using the iocage jail manager)

This is matter of taste. I have always upgraded live servers from source, compiling the world and kernel. I think this is safer option than using binary upgrade. It is matter of taste and I do not want to initiate a dispute, but consider this - the fact that it builds without errors on your machine gives some extra confidence. Multiple kernels can be installed and if you have ZFS, you can also use snapshots before update and bectl.

I would say that upgrading from source is more safe than binary upgrade.

mark_j · Oct 11, 2021

Maybe try turning off receive checksumming? ifconfig ie0 -rxcsum

(Or whatever your interface is called)

Gamecreature · Oct 11, 2021

markj Thanks! This is an interesting option. I changed it `ifconfig vtnet0 -rxcsum`

Code:

vtnet0: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=4c07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,LINKSTATE,TXCSUM_IPV6>

Code:

vtnet0: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=4c03ba<TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,LINKSTATE,TXCSUM_IPV6>

Now I have to wait I guess. (A crash happens at least once in the 1-7 days, it varies)

Btw, the VPS provider also has a predefined setting (which looks like it's related) in sysctl.conf

Code:

# An issue in the current virtio drivers for FreeBSD (used for IO in the virtualized environment),
# specifically with TCP segment offloading (TSO), results in very poor network performance.
# FOR PROPER FUNCTIONING OF YOUR NETWORK CONNECTION, DO NOT REMOVE LINE BELOW
net.inet.tcp.tso=0

mark_j · Oct 11, 2021

Sure, but poor network performance shouldn't equal a crash.

Are you saving the core of these crashes? eg crash dumps enabled? It is worth your while reading this.

When the thing crashes, using the instruction pointer address, ie, in the above example 0xffffffff81065ba6, and addr2line:

addr2line -fia -e /usr/lib/debug/boot/kernel/kernel.debug 0xffffffff81065ba6

This will give output of the faulting code as it pertains to your kernel.

If we, the forum community can't help, then having the vmcore etc would help debug if you decide to raise a PR for it.

Gamecreature · Oct 11, 2021

"network performance shouldn't equal a crash": Absolutely true.

I save the vmcore's (Well at the last least 10 of them)
I tried your example but I get line 0. (Don't know if I have the correct/latest GENERIC debug info installed)

Code:

root@server:/var/crash # addr2line -fia -e /usr/lib/debug/boot/kernel/kernel.debug 0xffffffff81065ba6
0xffffffff81065ba6
in_cksumdata
/usr/src/sys/amd64/amd64/in_cksum.c:0

Though the core txt file is very detailed and tells me it's line 113

#9 0xffffffff81065ba6 in in_cksumdata (buf=<optimized out>,
len=len@entry=612) at /usr/src/sys/amd64/amd64/in_cksum.c:113

Code:

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55      /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory.
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c09a96 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80c09f10 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80c09d13 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff8108b1b7 in trap_fatal (frame=0xfffffe00841ed000, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:915
#6  0xffffffff8108b20f in trap_pfault (frame=frame@entry=0xfffffe00841ed000,
    usermode=false, signo=<optimized out>, signo@entry=0x0,
    ucode=<optimized out>, ucode@entry=0x0)
    at /usr/src/sys/amd64/amd64/trap.c:732
#7  0xffffffff8108a86d in trap (frame=0xfffffe00841ed000)
    at /usr/src/sys/amd64/amd64/trap.c:398
#8  <signal handler called>
#9  0xffffffff81065ba6 in in_cksumdata (buf=<optimized out>,
    len=len@entry=612) at /usr/src/sys/amd64/amd64/in_cksum.c:113
#10 0xffffffff81065ab7 in in_cksum_skip (m=0xfffff80194667400, len=612,
    skip=<optimized out>) at /usr/src/sys/amd64/amd64/in_cksum.c:224
#11 0xffffffff82956329 in in4_cksum (m=0x0, nxt=<optimized out>,
    nxt@entry=6 '\006', off=0, len=<optimized out>)
    at /usr/src/sys/netpfil/pf/in4_cksum.c:117
#12 0xffffffff829373d0 in pf_check_proto_cksum (m=0xfffff8022eab6300,
    off=<optimized out>, len=0, p=6 '\006', af=2 '\002')
    at /usr/src/sys/netpfil/pf/pf.c:5844
#13 pf_return (r=r@entry=0xfffff8002b7dd800, nr=<optimized out>,
    nr@entry=0xfffff80182e3d400, pd=pd@entry=0xfffffe00841ed6d0,
    sk=<optimized out>, off=<optimized out>, off@entry=20, m=<optimized out>,
    m@entry=0xfffff8022eab6300, th=0xfffffe00841ed7a0,
    kif=0xfffff80015329d00, bproto_sum=62438, bip_sum=0, hdrlen=20,
    reason=0xfffffe00841ed55e) at /usr/src/sys/netpfil/pf/pf.c:2654

For now I will wait and see if the crash happens again. (I disabled the rxcsum option).

Argentum · Oct 11, 2021

Gamecreature said:

"network performance shouldn't equal a crash": Absolutely true.

I save the vmcore's (Well at the last least 10 of them)
I tried your example but I get line 0. (Don't know if I have the correct/latest GENERIC debug info installed)

Code:

root@server:/var/crash # addr2line -fia -e /usr/lib/debug/boot/kernel/kernel.debug 0xffffffff81065ba6
0xffffffff81065ba6
in_cksumdata
/usr/src/sys/amd64/amd64/in_cksum.c:0

Though the core txt file is very detailed and tells me it's line 113

#9 0xffffffff81065ba6 in in_cksumdata (buf=<optimized out>,
len=len@entry=612) at /usr/src/sys/amd64/amd64/in_cksum.c:113

Code:

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55      /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory.
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c09a96 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80c09f10 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80c09d13 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff8108b1b7 in trap_fatal (frame=0xfffffe00841ed000, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:915
#6  0xffffffff8108b20f in trap_pfault (frame=frame@entry=0xfffffe00841ed000,
    usermode=false, signo=<optimized out>, signo@entry=0x0,
    ucode=<optimized out>, ucode@entry=0x0)
    at /usr/src/sys/amd64/amd64/trap.c:732
#7  0xffffffff8108a86d in trap (frame=0xfffffe00841ed000)
    at /usr/src/sys/amd64/amd64/trap.c:398
#8  <signal handler called>
#9  0xffffffff81065ba6 in in_cksumdata (buf=<optimized out>,
    len=len@entry=612) at /usr/src/sys/amd64/amd64/in_cksum.c:113
#10 0xffffffff81065ab7 in in_cksum_skip (m=0xfffff80194667400, len=612,
    skip=<optimized out>) at /usr/src/sys/amd64/amd64/in_cksum.c:224
#11 0xffffffff82956329 in in4_cksum (m=0x0, nxt=<optimized out>,
    nxt@entry=6 '\006', off=0, len=<optimized out>)
    at /usr/src/sys/netpfil/pf/in4_cksum.c:117
#12 0xffffffff829373d0 in pf_check_proto_cksum (m=0xfffff8022eab6300,
    off=<optimized out>, len=0, p=6 '\006', af=2 '\002')
    at /usr/src/sys/netpfil/pf/pf.c:5844
#13 pf_return (r=r@entry=0xfffff8002b7dd800, nr=<optimized out>,
    nr@entry=0xfffff80182e3d400, pd=pd@entry=0xfffffe00841ed6d0,
    sk=<optimized out>, off=<optimized out>, off@entry=20, m=<optimized out>,
    m@entry=0xfffff8022eab6300, th=0xfffffe00841ed7a0,
    kif=0xfffff80015329d00, bproto_sum=62438, bip_sum=0, hdrlen=20,
    reason=0xfffffe00841ed55e) at /usr/src/sys/netpfil/pf/pf.c:2654

For now I will wait and see if the crash happens again. (I disabled the rxcsum option).

You probably cannot disable pf in your server, but in source (if you have the same latest 13.0) that line (2654) is

Code:

                if (pf_check_proto_cksum(m, off, len, IPPROTO_TCP, af))
                        REASON_SET(reason, PFRES_PROTCKSUM);

Gamecreature · Oct 11, 2021

Argentum The firewall is indeed very essential, for directing the traffic to the correct jails

mark_j · Oct 11, 2021

Hmmm, might be: https://cgit.freebsd.org/src/commit/?id=fa6d101e5f67246a6804577a9532676eae64c049

What's your system versions? Oops, just in case: freebsd-version -kr and uname output too (not of the jail(s))

Gamecreature · Oct 11, 2021

Interesting! Thanks..

Code:

# freebsd-version -kru
13.0-RELEASE-p4
13.0-RELEASE-p4
13.0-RELEASE-p4

Code:

# uname -a
FreeBSD my.server.com 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #0: Tue Aug 24 07:33:27 UTC 2021     root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

_martin · Oct 11, 2021

You said you have the same setup on the other server but you are not able to reproduce, correct ? These are physical servers, aren't they? Are you using built-in NIC or do you have an external NIC in this setup ? If external is it possible for you to do the test with the swapped NIC? I.e. take the NIC from original crashing server and put it into the server where you are not able to reproduce the crash. Toggle back the original settings (if you changed something in the mean time) and observe the situation.

SirDice · Oct 11, 2021

_martin said:
These are physical servers, aren't they?

vnet(4) interfaces implies they're VMs.

Gamecreature · Oct 11, 2021

_martin said:
You said you have the same setup on the other server but you are not able to reproduce, correct ? These are physical servers, aren't they? Are you using built-in NIC or do you have an external NIC in this setup ? If external is it possible for you to do the test with the swapped NIC? I.e. take the NIC from original crashing server and put it into the server where you are not able to reproduce the crash. Toggle back the original settings (if you changed something in the mean time) and observe the situation.

They are VM's ... So I cannot swap the NICs ;-)
But thanks for the suggestion

_martin · Oct 11, 2021

Oops, I stand corrected then. I was focusing on a fact that you can't reproduce the error on the other machine. Are you able to control where those VMs are running ? Are they both running on the same host when the crash occurs ?

Gamecreature · Oct 11, 2021

_martin said:
Oops, I stand corrected then. I was focusing on a fact that you can't reproduce the error on the other machine. Are you able to control where those VMs are running ? Are they both running on the same host when the crash occurs ?

I don't have any control on that. In theory they run the same tech-stack. But I cannot guarantee that. The fact is that they are running in two different data-centers. (Provider is transip.nl / bladeVPS)

I see I'm not correct with my statement, the server without crashes is running an slightly older version (FreeBSD 13.0-RELEASE-p1).? . I haven't updated that server since I've got into trouble with the other one.

SirDice · Oct 11, 2021

Running a 13.0-RELEASE-p4 on a TransIP VPS for some time now. Also using vtnet(4) and PF combination. Never had an issue with it. I do have TSO turned off on vtnet(4) though, the combination with PF caused really slow transfers until I turned off TSO.

Gamecreature · Oct 11, 2021

SirDice said:
Running a 13.0-RELEASE-p4 on a TransIP VPS for some time now. Also using vtnet(4) and PF combination. Never had an issue with it.

SirDice I know.. Another server TransIP (PerformanceVPS), also runs without problems on 13.0-RELEASE-p4 (clean v13 install).. I also expect the first-server to run without problems, but I think it is safer to first resolve the issues with the other server first. (Those two servers were (freebsd-)updated from 12.2)

Random Crash

Gamecreature

Cath O'Deray

Gamecreature

Cath O'Deray

Gamecreature

Argentum

Gamecreature

Cath O'Deray

Gamecreature

Argentum

mark_j

Gamecreature

mark_j

Gamecreature

Argentum

Gamecreature

mark_j

Gamecreature

_martin

SirDice

Administrator

Gamecreature

_martin

Gamecreature

SirDice

Administrator

Gamecreature