FreeBSD 12.0-RELEASE-p1 unstable?

mg16373 · Dec 22, 2018

Today the server (running bhyve and not more) has rebooted three times. Everytime the "syncer" kernel-thread seems to be the issue.
I have never problems before I have upgraded servers with FreeBSD 10.4 and 11.2-pX and more complex configuration. It's seems to be an error using FreeBSD 12 at this time.

Code:

Fatal trap 12: page fault while in kernel mode
cpuid = 7; apic id = 07
fault virtual address   = 0x40000000410
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80b7ad1d
stack pointer           = 0x28:0xfffffe009a513830
frame pointer           = 0x28:0xfffffe009a5138a0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 33 (syncer)
trap number             = 12
panic: page fault
cpuid = 7
time = 1545478613
KDB: stack backtrace:
#0 0xffffffff80be7977 at kdb_backtrace+0x67
#1 0xffffffff80b9b563 at vpanic+0x1a3
#2 0xffffffff80b9b3b3 at panic+0x43
#3 0xffffffff8107496f at trap_fatal+0x35f
#4 0xffffffff810749c9 at trap_pfault+0x49
#5 0xffffffff81073fee at trap+0x29e
#6 0xffffffff8104f1d5 at calltrap+0x8
#7 0xffffffff80c6e93e at mnt_vnode_next_active+0x29e
#8 0xffffffff80c6d068 at vfs_msync+0x278
#9 0xffffffff80c7077e at sync_fsync+0xee
#10 0xffffffff811f991e at VOP_FSYNC_APV+0x7e
#11 0xffffffff80c6fee5 at sched_sync+0x415
#12 0xffffffff80b5bf33 at fork_exit+0x83
#13 0xffffffff810501be at fork_trampoline+0xe
Uptime: 4d18h55m56s
Dumping 4998 out of 32587 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%---<<BOOT>>---
Copyright (c) 1992-2018 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 12.0-RELEASE r341666 GENERIC amd64
FreeBSD clang version 6.0.1 (tags/RELEASE_601/final 335540) (based on LLVM 6.0.1)
VT(vga): resolution 640x480
CPU: Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz (3500.07-MHz K8-class CPU)

tingo · Dec 24, 2018

After installing FreeBSD 12.0-release (on a new partition) on my main workstation I had two kernel core dumps.

Code:

root@kg-core1# cat /var/crash/info.0
Dump header from device: /dev/ada0p3
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 1569280000
  Blocksize: 512
  Compression: none
  Dumptime: Sun Dec 23 13:36:19 2018
  Hostname: kg-core1.kg4.no
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 12.0-RELEASE r341666 GENERIC
  Panic String: handle_disk_io_initiation: Unexpected type ???
  Dump Parity: 1070871860
  Bounds: 0
  Dump Status: good
root@kg-core1# cat /var/crash/info.1
Dump header from device: /dev/ada0p3
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 1667612672
  Blocksize: 512
  Compression: none
  Dumptime: Sun Dec 23 14:04:51 2018
  Hostname: kg-core1.kg4.no
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 12.0-RELEASE r341666 GENERIC
  Panic String: page fault
  Dump Parity: 2847147341
  Bounds: 1
  Dump Status: good

not sure why. Anyway, I updated (via freebsd-update) to FreeBSD 12.0-RELEASE-p1

Code:

root@kg-core1# freebsd-version -ku
12.0-RELEASE
12.0-RELEASE-p1
root@kg-core1# uname -a
FreeBSD kg-core1.kg4.no 12.0-RELEASE FreeBSD 12.0-RELEASE r341666 GENERIC  amd64

And the machine has been stable so far (knock on wood).

SirDice · Dec 24, 2018

If I look at the last patch information the kernel wasn't updated. So the kernel is still the same.

tingo · Dec 24, 2018

Yeah, I know. And it core dumped again :-(

Code:

root@kg-core1# cat /var/crash/info.2
Dump header from device: /dev/ada0p3
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 2303528960
  Blocksize: 512
  Compression: none
  Dumptime: Mon Dec 24 17:01:50 2018
  Hostname: kg-core1.kg4.no
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 12.0-RELEASE r341666 GENERIC
  Panic String: page fault
  Dump Parity: 3023125325
  Bounds: 2
  Dump Status: good

SirDice · Dec 24, 2018

Have you checked the obvious things? Like memory or disk errors? Both can lead to unexplained crashes or panics.

tingo · Dec 24, 2018

Disks - easy, they get fsck'ed very often now.

memory - I haven't run a memtest on this machine recently (I want to use it, not test it) but it has been running 24/7 with FreeBSD 10.4-stable until I upgraded it to 12.0-release.

SirDice · Dec 24, 2018

tingo said:
Disks - easy, they get fsck'ed very often now.

This doesn't detect bad sectors at all, and it definitely does not detect bad sectors in your swap partition. Use sysutils/smartmontools for that.

tingo · Dec 24, 2018

Aha - I forgot to mention that

smartd is running and has not reported anything yet.
A quick status

Code:

root@kg-core1# smartctl -H /dev/ada0
smartctl 6.6 2017-11-05 r4594 [FreeBSD 12.0-RELEASE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

smartctl 6.6 2017-11-05 r4594 [FreeBSD 12.0-RELEASE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@kg-core1# smartctl -H /dev/ada2
smartctl 6.6 2017-11-05 r4594 [FreeBSD 12.0-RELEASE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

and FWIW (which might be very little) it didn't look like the machine was using any swap when it core dumped the last time (I was using it while it happened). It wasn't even using half the memory (this machine has 32 GB).

SirDice · Dec 24, 2018

At least it's something we can cross off the list now. As they say, once you eliminate the impossible, whatever remains, no matter how improbable, must be truth. So eliminating possible causes helps narrow things down. You can get some really weird crashes if you have bad sectors, especially when they happen to be in the swap partition. Same for memory errors, although I would expect the panics to be a bit more "random" (i.e. random processes that crash, not the same process every time).

ralphbsz · Dec 24, 2018

Fsck does not check the storage subsystem itself, as SirDice already mentioned. Except a small amount, as a side effect. It checks that the metadata in the file system is logically consistent. Not that it is correct, only that it is consistent with itself. It does a little bit of IO, so if there are serious IO problems, it might find them. Not that fsck is useless, rather on the contrary: for those file systems where becoming inconsistent is common (due to bugs, or due to deliberate design choices, like not guaranteeing consistency during an unclean shutdown), fsck is vital.

Smartd doesn't really check the storage subsystem either. It asks the disk drive what it thinks about its own health. That's very valuable, and I'm not knocking it. But it is an incomplete answer. It's a lot like asking a psychiatric patient how their infected toe-nail is feeling: they might give you a truthful answer, they might lie to you, or they might tell you random gibberish. So you should run smartd, but don't trust the results to tell you that everything is good.

To really check the storage subsystem, the best thing to do is to heavily exercise it, doing lots of IO. For example, imagine a loose SATA cable, which occasionally gives you IO errors, or worse, occasionally corrupts data. That's why scrubbing your file systems is so important. Unfortunately, debugging that is hard, since error handling in the IO stack tends to be messy.

But that is all theoretical. Your crash seems to be somewhat consistent. It happens when handling a page fault in the kernel during a sync operation (see the callstack: from vfs_msync to a pfault trap). That points at either memory corruption (in kernel space!), or a software bug in the kernel. Are you using some strange software? Unusual device drivers? Any other suspicious output in dmesg?

tingo · Dec 25, 2018

Nope. Standard FreeBSD 12.0-release install. All programs installed from packages, from the 'latest' repository. Examples:

Code:

firefox-64.0_3,1
root@kg-core1# pkg info thun*
thunderbird-60.4.0
root@kg-core1# pkg info libreo*
libreoffice-6.0.7_4

mg16373 · Dec 28, 2018

Today ...

Code:

(Uptime: 3:14PM  up 7 mins, 1 user, load averages: 0.05, 0.17, 0.11)

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x40000000410
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80b7ad1d
stack pointer           = 0x28:0xfffffe009a513830
frame pointer           = 0x28:0xfffffe009a5138a0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 33 (syncer)
trap number             = 12
panic: page fault
cpuid = 0
time = 1546005944

KDB: stack backtrace:
#0 0xffffffff80be7977 at kdb_backtrace+0x67
#1 0xffffffff80b9b563 at vpanic+0x1a3
#2 0xffffffff80b9b3b3 at panic+0x43
#3 0xffffffff8107496f at trap_fatal+0x35f
#4 0xffffffff810749c9 at trap_pfault+0x49
#5 0xffffffff81073fee at trap+0x29e
#6 0xffffffff8104f1d5 at calltrap+0x8
#7 0xffffffff80c6e93e at mnt_vnode_next_active+0x29e
#8 0xffffffff80c6d068 at vfs_msync+0x278
#9 0xffffffff80c7077e at sync_fsync+0xee
#10 0xffffffff811f991e at VOP_FSYNC_APV+0x7e
#11 0xffffffff80c6fee5 at sched_sync+0x415
#12 0xffffffff80b5bf33 at fork_exit+0x83
#13 0xffffffff810501be at fork_trampoline+0xe
Uptime: 6d0h59m16s
Dumping 4478 out of 32587 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%---<<BOOT>>---

mg16373 · Dec 28, 2018

Shit ... I should not have done this upgrade at this time because 11.2 works

Code:

root@:/var/crash # ls -la
total 3302417
drwxr-x---   2 root  wheel          18 Dec 28 15:08 .
drwxr-xr-x  27 root  wheel          27 Dec 28 15:07 ..
-rw-r--r--   1 root  wheel           2 Dec 28 15:07 bounds
-rw-r--r--   1 root  wheel         319 Dec 22 12:39 core.txt.0
-rw-r--r--   1 root  wheel         319 Dec 22 13:07 core.txt.1
-rw-r--r--   1 root  wheel         319 Dec 22 13:47 core.txt.2
-rw-r--r--   1 root  wheel         319 Dec 28 15:08 core.txt.3
-rw-------   1 root  wheel         402 Dec 22 12:38 info.0
-rw-------   1 root  wheel         402 Dec 22 13:06 info.1
-rw-------   1 root  wheel         402 Dec 22 13:47 info.2
-rw-------   1 root  wheel         402 Dec 28 15:07 info.3
lrwxr-xr-x   1 root  wheel           6 Dec 28 15:08 info.last -> info.3
-rw-r--r--   1 root  wheel           5 Jun 22  2018 minfree
-rw-------   1 root  wheel  5241716736 Dec 22 12:39 vmcore.0
-rw-------   1 root  wheel  1627086848 Dec 22 13:07 vmcore.1
-rw-------   1 root  wheel  1554776064 Dec 22 13:47 vmcore.2
-rw-------   1 root  wheel  4695748608 Dec 28 15:08 vmcore.3
lrwxr-xr-x   1 root  wheel           8 Dec 28 15:08 vmcore.last -> vmcore.3

Remington · Dec 28, 2018

mg16373 said:
Shit ... I should not have done this upgrade at this time because 11.2 works

Rule of thumb is never do any major upgrades on a production server until minor releases such as xx.1 is out since most bugs are fixed. It's nice to get ahead with bleeding technology but it comes with a price and headaches.

jem · Jan 5, 2019

Had my 12.0-RELEASE gateway system reboot itself unexpectedly 7 hours ago. First time I've noticed it happen. Not quite the same fault as OP's:

Code:

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 04
fault virtual address   = 0x410
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80b9661f
stack pointer           = 0x28:0xfffffe00254cf940
frame pointer           = 0x28:0xfffffe00254cf9e0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (swi4: clock (0))
trap number             = 12
panic: page fault
cpuid = 2
time = 1546641102
KDB: stack backtrace:
#0 0xffffffff80be7977 at kdb_backtrace+0x67
#1 0xffffffff80b9b563 at vpanic+0x1a3
#2 0xffffffff80b9b3b3 at panic+0x43
#3 0xffffffff8107496f at trap_fatal+0x35f
#4 0xffffffff810749c9 at trap_pfault+0x49
#5 0xffffffff81073fee at trap+0x29e
#6 0xffffffff8104f1d5 at calltrap+0x8
#7 0xffffffff80e046f1 at nd6_llinfo_timer+0x4d1
#8 0xffffffff80bb554e at softclock_call_cc+0x12e
#9 0xffffffff80bb5a39 at softclock+0x79
#10 0xffffffff80b5ee17 at ithread_loop+0x1a7
#11 0xffffffff80b5bf33 at fork_exit+0x83
---<<BOOT>>---
Copyright (c) 1992-2018 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 12.0-RELEASE r341666 GENERIC amd64
FreeBSD clang version 6.0.1 (tags/RELEASE_601/final 335540) (based on LLVM 6.0.1)

It's the first time I've observed FreeBSD crashing on this host, having been running 11.2-RELEASE (pfSense) for a long time before this.

drhowarddrfine · Jan 5, 2019

Have had no issues whatsoever with my server or my workstation so, no, I do not think this release is unstable.

davisr · Feb 5, 2019

For what it's worth, I am also experiencing a kernel panic from syncer (although I am using pfSense, which runs from 11.2).

Code:

FreeBSD 11.2-RELEASE-p4 #2 b00c407ba5d(RELENG_2_4_4): Mon Nov 26 11:41:48 EST 2018
...
CPU: QEMU Virtual CPU version 2.0.0 (2712.06-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x663  Family=0x6  Model=0x6  Stepping=3
  Features=0x78bfbfd<FPU,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2>
  Features2=0x80a02001<SSE3,CX16,x2APIC,POPCNT,HV>
  AMD Features=0x20100800<SYSCALL,NX,LM>
  AMD Features2=0x21<LAHF,ABM>
Hypervisor: Origin = "KVMKVMKVM"

I am able to induce the kernel panic by running it under qemu as such:

Code:

qemu-system-x86_64 \
    -enable-kvm \
    -cpu kvm64 \
    -smp 1 \
    -m 512 \
    -net nic,model=virtio -net bridge,br=$bridge \
    -display none -serial stdio \
    -drive file=$isodisk,if=virtio,readonly \
    -drive file=$bootdisk,if=virtio

After installing the distribution to disk, just after the machine initiates a reboot, a fault occurs:

Code:

Feb  5 18:09:35  reboot: rebooted by root
Feb  5 18:09:35  syslogd: exiting on signal 15
Waiting (max 60 seconds) for system process `vnlru' to stop... done
Waiting (max 60 seconds) for system process `bufdaemon' to stop... done
Waiting (max 60 seconds) for system process `syncer' to stop...

Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer    = 0x20:0xffffffff80bc7f78
stack pointer            = 0x28:0xfffffe001ccdf7d0
frame pointer            = 0x28:0xfffffe001ccdf800
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 24 (syncer)
[ thread pid 24 tid 100069 ]
Stopped at      g_vfs_strategy+0x38:    lock cmpxchgq   %r13,(%r15)

Later, I found out that by omitting the 'readonly' from my qemu disk parameters allowed syncer to finish its job without panicking. I am disappointed in this, as I like to keep all my install images with "440" privileges (which qemu won't run without 'readonly'), but I am glad to have found why it was panicking for me.

FreeBSD 12.0-RELEASE-p1 unstable?

mg16373

tingo

SirDice

Administrator

tingo

SirDice

Administrator

tingo

SirDice

Administrator

tingo

SirDice

Administrator

ralphbsz

tingo

mg16373

mg16373

Remington

jem

drhowarddrfine

davisr