Fatal trap 12: page fault while in kernel mode ... during network operations

marcinkk · May 17, 2021

Does anybody understand anything from below information:

Code:

Fatal trap 12: page fault while in kernel mode
cpuid = 14; apic id = 42
fault virtual address   = 0x70
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80da00b3
stack pointer           = 0x28:0xfffffe02036bd8c0
frame pointer           = 0x28:0xfffffe02036bd8c0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (swi1: netisr 0)
trap number             = 12
panic: page fault
cpuid = 14
time = 1621242908
KDB: stack backtrace:
#0 0xffffffff80c57345 at kdb_backtrace+0x65
#1 0xffffffff80c09d21 at vpanic+0x181
#2 0xffffffff80c09b93 at panic+0x43
#3 0xffffffff8108b187 at trap_fatal+0x387
#4 0xffffffff8108b1df at trap_pfault+0x4f
#5 0xffffffff8108a83d at trap+0x27d
#6 0xffffffff810617a8 at calltrap+0x8
#7 0xffffffff80ddf0f8 at udp_input+0x338
#8 0xffffffff80dafc15 at ip_input+0x125
#9 0xffffffff80d3fa7b at swi_net+0x12b
#10 0xffffffff80bcae5d at ithread_loop+0x24d
#11 0xffffffff80bc7c5e at fork_exit+0x7e
#12 0xffffffff8106282e at fork_trampoline+0xe
Uptime: 9d22h28m6s


Fatal trap 12: page fault while in kernel mode
cpuid = 10; apic id = 2a
fault virtual address   = 0x28
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80d451d1
stack pointer           = 0x28:0xfffffe02036d69c0
frame pointer           = 0x28:0xfffffe02036d6a10
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (swi1: netisr 0)
trap number             = 12
panic: page fault
cpuid = 10
time = 1621243393
KDB: stack backtrace:
#0 0xffffffff80c57345 at kdb_backtrace+0x65
#1 0xffffffff80c09d21 at vpanic+0x181
#2 0xffffffff80c09b93 at panic+0x43
#3 0xffffffff8108b187 at trap_fatal+0x387
#4 0xffffffff8108b1df at trap_pfault+0x4f
#5 0xffffffff8108a83d at trap+0x27d
#6 0xffffffff810617a8 at calltrap+0x8
#7 0xffffffff80d4dd41 at rts_input+0x61
#8 0xffffffff80d3fa7b at swi_net+0x12b
#9 0xffffffff80bcae5d at ithread_loop+0x24d
#10 0xffffffff80bc7c5e at fork_exit+0x7e
#11 0xffffffff8106282e at fork_trampoline+0xe
Uptime: 2m30s


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 20
fault virtual address   = 0x70
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80d24abc
stack pointer           = 0x28:0xfffffe017d013830
frame pointer           = 0x28:0xfffffe017d013880
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 0 (if_io_tqg_0)
trap number             = 12
panic: page fault
cpuid = 0
time = 1621243823
KDB: stack backtrace:
#0 0xffffffff80c57345 at kdb_backtrace+0x65
#1 0xffffffff80c09d21 at vpanic+0x181
#2 0xffffffff80c09b93 at panic+0x43
#3 0xffffffff8108b187 at trap_fatal+0x387
#4 0xffffffff8108b1df at trap_pfault+0x4f
#5 0xffffffff8108a83d at trap+0x27d
#6 0xffffffff810617a8 at calltrap+0x8
#7 0xffffffff80d3f2da at netisr_dispatch_src+0xca
#8 0xffffffff80d23eb9 at ether_input+0x69
#9 0xffffffff80d3ba03 at iflib_rxeof+0xc63
#10 0xffffffff80d35d42 at _task_fn_rx+0x72
#11 0xffffffff80c55dad at gtaskqueue_run_locked+0x15d
#12 0xffffffff80c55a4c at gtaskqueue_thread_loop+0xac
#13 0xffffffff80bc7c5e at fork_exit+0x7e
#14 0xffffffff8106282e at fork_trampoline+0xe
Uptime: 1m49s


Fatal trap 12: page fault while in kernel mode
cpuid = 6; apic id = 26
fault virtual address   = 0x1c
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80c9b94a
stack pointer           = 0x28:0xfffffe0257f726d0
frame pointer           = 0x28:0xfffffe0257f72740
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 1926 (vtnet-3:0 tx)
trap number             = 12
panic: page fault
cpuid = 6
time = 1621259350
KDB: stack backtrace:
#0 0xffffffff80c57345 at kdb_backtrace+0x65
#1 0xffffffff80c09d21 at vpanic+0x181
#2 0xffffffff80c09b93 at panic+0x43
#3 0xffffffff8108b187 at trap_fatal+0x387
#4 0xffffffff8108b1df at trap_pfault+0x4f
#5 0xffffffff8108a83d at trap+0x27d
#6 0xffffffff810617a8 at calltrap+0x8
#7 0xffffffff821398c3 at bridge_input+0x193
#8 0xffffffff80d24cb8 at ether_nh_input+0x218
#9 0xffffffff80d3f2da at netisr_dispatch_src+0xca
#10 0xffffffff80d23eb9 at ether_input+0x69
#11 0xffffffff80d2916d at tunwrite+0x4fd
#12 0xffffffff80aa994a at devfs_write_f+0xda
#13 0xffffffff80c76618 at dofilewrite+0x88
#14 0xffffffff80c7653e at sys_writev+0x6e
#15 0xffffffff8108ba8c at amd64_syscall+0x10c
#16 0xffffffff810620ce at fast_syscall_common+0xf8
Uptime: 3m8s


Fatal trap 12: page fault while in kernel mode
cpuid = 22; apic id = 4a
fault virtual address   = 0x1c
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80c9b94a
stack pointer           = 0x28:0xfffffe02ad9ae6d0
frame pointer           = 0x28:0xfffffe02ad9ae740
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 1938 (vtnet-3:0 tx)
trap number             = 12
panic: page fault
cpuid = 22
time = 1621259879
KDB: stack backtrace:
#0 0xffffffff80c57345 at kdb_backtrace+0x65
#1 0xffffffff80c09d21 at vpanic+0x181
#2 0xffffffff80c09b93 at panic+0x43
#3 0xffffffff8108b187 at trap_fatal+0x387
#4 0xffffffff8108b1df at trap_pfault+0x4f
#5 0xffffffff8108a83d at trap+0x27d
#6 0xffffffff810617a8 at calltrap+0x8
#7 0xffffffff82d2f8c3 at bridge_input+0x193
#8 0xffffffff80d24cb8 at ether_nh_input+0x218
#9 0xffffffff80d3f2da at netisr_dispatch_src+0xca
#10 0xffffffff80d23eb9 at ether_input+0x69
#11 0xffffffff80d2916d at tunwrite+0x4fd
#12 0xffffffff80aa994a at devfs_write_f+0xda
#13 0xffffffff80c76618 at dofilewrite+0x88
#14 0xffffffff80c7653e at sys_writev+0x6e
#15 0xffffffff8108ba8c at amd64_syscall+0x10c
#16 0xffffffff810620ce at fast_syscall_common+0xf8
Uptime: 2m43s


Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 23
fault virtual address   = 0x18
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff805f9629
stack pointer           = 0x28:0xfffffe020386ca20
frame pointer           = 0x28:0xfffffe020386caa0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (irq56: bce0)
trap number             = 12
panic: page fault
cpuid = 3
time = 1621261253
KDB: stack backtrace:
#0 0xffffffff80c57345 at kdb_backtrace+0x65
#1 0xffffffff80c09d21 at vpanic+0x181
#2 0xffffffff80c09b93 at panic+0x43
#3 0xffffffff8108b187 at trap_fatal+0x387
#4 0xffffffff8108b1df at trap_pfault+0x4f
#5 0xffffffff8108a83d at trap+0x27d
#6 0xffffffff810617a8 at calltrap+0x8
#7 0xffffffff80bcae5d at ithread_loop+0x24d
#8 0xffffffff80bc7c5e at fork_exit+0x7e
#9 0xffffffff8106282e at fork_trampoline+0xe
Uptime: 17m10s


Fatal trap 12: page fault while in kernel mode
cpuid = 10; apic id = 2a
fault virtual address   = 0x18
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80c9b880
stack pointer           = 0x28:0xfffffe024fd5f580
frame pointer           = 0x28:0xfffffe024fd5f5f0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 2232 (sshd)
trap number             = 12
panic: page fault
cpuid = 10
time = 1621266438
KDB: stack backtrace:
#0 0xffffffff80c57345 at kdb_backtrace+0x65
#1 0xffffffff80c09d21 at vpanic+0x181
#2 0xffffffff80c09b93 at panic+0x43
#3 0xffffffff8108b187 at trap_fatal+0x387
#4 0xffffffff8108b1df at trap_pfault+0x4f
#5 0xffffffff8108a83d at trap+0x27d
#6 0xffffffff810617a8 at calltrap+0x8
#7 0xffffffff80dc8a63 at tcp_output+0x10b3
#8 0xffffffff80ddab89 at tcp_usr_send+0x229
#9 0xffffffff80ca9053 at sosend_generic+0x633
#10 0xffffffff80ca94e0 at sosend+0x50
#11 0xffffffff80c7efd9 at soo_write+0x49
#12 0xffffffff80c76618 at dofilewrite+0x88
#13 0xffffffff80c7618c at sys_write+0xbc
#14 0xffffffff8108ba8c at amd64_syscall+0x10c
#15 0xffffffff810620ce at fast_syscall_common+0xf8
Uptime: 1m43s

Additionally I can write:
- before the first crash I changed ipfw nat to natd
- after the last crash I reverted back to ipfw nat and make firewall OPEN

Code:

# sysctl hw.model hw.machine hw.ncpu
hw.model: AMD Opteron(TM) Processor 6234
hw.machine: amd64
hw.ncpu: 24

# uname -a
FreeBSD kappa 13.0-RELEASE FreeBSD 13.0-RELEASE #0 releng/13.0-n244733-ea31abc261f: Fri Apr  9 04:24:09 UTC 2021     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

And a while before first crash I made accitentally loop on the switch :/

Regards,
Marcin

PMc · May 18, 2021

Before searching further, I would have a thorough look at the concerned hardware (memory, bus, network card, BIOS settings), probably try to reproduce it on another machine. Then I would look at the offload features of the concerned network card.

mark_j · May 18, 2021

What's your system's setup? A page fault trap means something went wrong bringing a page of memory from swap. Have you checked dmesg?

marcinkk · May 18, 2021

PMc said:
Then I would look at the offload features of the concerned network card.

Did you mean TSO? Some time ago I opened this: Thread gigabit-transfer-problems-with-intel-igb.79788.

Today I completly disabled TSO ( net.inet.tcp.tso=0 in sysctl.conf). I've enabled firewall with in-kernel NAT - works stable for some time. I've also checked again configuration with natd - I didn't notice crash but network didn't work

BTW: I don't know what I'm doing wrong, but in short: I've set in /etc/rc.conf firewall_type="CLIENT", then when I'm changing from firewall_nat_enable="YES" to natd_enable="YES" (and related lines) and make service ipfw restart & everything seems to work as expected. But after reboot network does not work.

mark_j said:
What's your system's setup? A page fault trap means something went wrong bringing a page of memory from swap. Have you checked dmesg?

Swap is configured but I think it should not be used, especially 3 minutes after reboot:

Code:

# freecolor -m -o
             total       used       free     shared    buffers     cached
Mem:        127654      12333     115321          0          0          0
Swap:        16384          0      16384

_martin · May 19, 2021

When you look at the fault trap address you can see it's an bogus address:

marcinkk said:
fault virtual address = 0x70

marcinkk said:
fault virtual address = 0x28

Hence the trap. This has nothing to do with the "swap" (page could be anywhere). When you look at the stack trace output one frame below of calltrap you'll find the location where the issue occurred. It's not hitting the issue on the same spot all the time suggesting it's the data that's corrupted when kernel handles the data.

Faulty HW could be possibility too (e.g. faulty memory).

The best option you have is to open a PR for the issue and let the developers have a look.

mark_j · May 20, 2021

Yes, sorry I wrote swap not virtual address space.

mark_j · May 20, 2021

I also wonder are you running ZFS? (Or did I miss that?)

marcinkk · May 20, 2021

Yes, ZFS. On "single disk", but this "single disk" is RAID on SmartArray P410.

Memory: 128 GB DDR3 ECC.

BTW: Reading a little in the Internet (sorry, I lost the link) I found info, that NIC extensions like TSO, LRO and other can increase performance but on the other way can be a source of problems, when NIC firmware is too old or incompatibile with these extensions. Given the above and my experience with igb cards I disabled these extensions. System works stable for 2 days... Is there a way to check NIC for compatibility with TSO and other extensions?

SirDice · May 20, 2021

marcinkk said:
Yes, ZFS. On "single disk", but this "single disk" is RAID on SmartArray P410.

That's a bad way to use ZFS. Do NOT use hardware RAID in combination with ZFS. You will not have error correction (one of the best features of ZFS).

marcinkk · May 20, 2021

P410 does not have JBOD mode.
Thanks to ZFS I have for example compression option.
All RAID features - redundancy (I have RIAD1 and RAID5) - seems to work well on SmartArray.

SirDice · May 20, 2021

marcinkk said:
All RAID features - redundancy (I have RIAD1 and RAID5) - seems to work well on SmartArray.

Yes, but you will only have error detection, no error correction. Hardware RAID is only helpful if a whole drive fails. It's not going to help with file corruption.

Fatal trap 12: page fault while in kernel mode ... during network operations

marcinkk

PMc

mark_j

marcinkk

_martin

mark_j

mark_j

marcinkk

SirDice

Administrator

marcinkk

SirDice

Administrator