Kernel panic several times a day

adgjqety · Feb 27, 2020

Not sure what to do next, suggestions appreciated.

This is home rolled NAS VM running in ESXi 6.7.0 Update 3 (Build 15160138). ECC RAM.
It used to be very stable, has been through it's life 11.x -> 12.x -> 12.1p2.

Assume I've got Linux skills but not much BSD ;-)

Code:

root@ViStAr:/var/crash # uname -a
FreeBSD ViStAr 12.1-RELEASE-p2 FreeBSD 12.1-RELEASE-p2 GENERIC  amd64

root@ViStAr:/var/crash # cat info.0
Dump header from device: /dev/da0p2
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 773570560
  Blocksize: 512
  Compression: none
  Dumptime: Thu Feb 27 04:48:16 2020
  Hostname: ViStAr
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 12.1-RELEASE-p2 GENERIC
  Panic String: page fault
  Dump Parity: 962358042
  Bounds: 0
  Dump Status: good

root@ViStAr:/var/crash # ls -la vmcore*
-rw-------  1 root  wheel   773570560 27 Feb 04:49 vmcore.0
-rw-------  1 root  wheel   772816896 27 Feb 05:21 vmcore.1
-rw-------  1 root  wheel   745340928 27 Feb 05:31 vmcore.2
-rw-------  1 root  wheel   882434048 27 Feb 06:07 vmcore.3
-rw-------  1 root  wheel   896045056 27 Feb 06:49 vmcore.4
-rw-------  1 root  wheel   836997120 27 Feb 02:21 vmcore.5
-rw-------  1 root  wheel   737398784 27 Feb 02:23 vmcore.6
-rw-------  1 root  wheel  1729163264 27 Feb 03:06 vmcore.7
-rw-------  1 root  wheel   815341568 27 Feb 04:33 vmcore.8
-rw-------  1 root  wheel   740249600 27 Feb 04:38 vmcore.9
lrwxr-xr-x  1 root  wheel           8 27 Feb 06:49 vmcore.last -> vmcore.4

[SNIP]
kgdb

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
(No debugging symbols found in /boot/kernel/kernel)
0xffffffff80c01e9a in sched_switch ()
(kgdb) bt
#0  0xffffffff80c01e9a in sched_switch ()
#1  0xffffffff80bdbf62 in mi_switch ()
#2  0xffffffff80c2bb35 in sleepq_catch_signals ()
#3  0xffffffff80c2be24 in sleepq_timedwait_sig ()
#4  0xffffffff80bdb965 in _sleep ()
#5  0xffffffff80be7286 in kern_clock_nanosleep ()
#6  0xffffffff80be743f in sys_nanosleep ()
#7  0xffffffff810a8984 in amd64_syscall ()
#8  <signal handler called>
#9  0x000000080039593a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffec48

shkhln · Feb 27, 2020

https://bugs.freebsd.org/bugzilla/enter_bug.cgi?product=Base System

Bobi B. · Feb 27, 2020

May I suggest you to 1) run a memory test and 2) do fsck(8) on your filesystems?

adgjqety · Feb 28, 2020

Code:

$ sudo zpool scrub zroot
$ zpool status -v zroot
  pool: zroot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:39 with 0 errors on Fri Feb 28 08:49:28 2020
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          da0p3     ONLINE       0     0     0

errors: No known data errors
$

I'll go run a memtest, and then assuming that passes, I'll go create a bug ticket.

Thanks all.

ralphbsz · Feb 28, 2020

Most likely corruption in memory. The stack trace above is in a system call (sleep) and kernel function (sched_switch) that are used all the time, and very unlikely to have bugs, in particular in a production version. The error message is clear: a pointer to an incorrect address, the number that begins with 0x7fff. I would suspect the hardware (memory, or the virtualized hardware provided by ESXi) more than the file system or the OS at this point.

adgjqety · Feb 28, 2020

Memtest has ran for over 5 hours, pass 1 completed without errors. Checking the Supermicro IPMI, there are no BIOS/hardware/memory errors noted.

Checked the VMware forums, no mass screaming that I could find. I wonder why 'now' rather than the last few years...
Also there are ~11 other running VMs, none of those are panicking, but this is the only FreeBSD. *shrug*

Again, thanks for all your feedback.

CLimbingKid · Aug 26, 2021

adgjqety - I may have the same issue with TrueNas - I have 8 VMs working away without issue in ESXI, and FreeBSD TrueNas panicing sometimes every few days with a similar Page Fault. All memtest and hardware tests are fine, memory is ECC - and no other VMs affected, nothing in any of the logs

Did you ever make any progress?

Welcome your help.

Thanks
CC

_martin · Aug 26, 2021

The memory address error is just backtrace stopping as it can't resolve any more frames ; i.e. it's not an indicator of a problem. Actually OP omitted a lot of information from that dump, it's impossible to say what happened.

garry · Aug 26, 2021

adgjqety said:
Memtest has ran for over 5 hours, pass 1 completed without errors...

I had a similar frustration. I have two almost identical computers with Gigabyte Z77 Ivy Bridge boards with Intel i5-3475s cpus and DDR3 memory. One board (a "refurb" I got direct from China) would spontaneously reset whenever I tried to poudriere with a high load. Yet it passed the memory test in all configurations: one dimm at a time in each slot, two dimms in either bank, four dimms filling the slots. Finally I tried to diagnose it with benchmarks/stress or benchmarks/stress-ng and as soon as I ran a VM stress test I got the reset (double buss fault I suppose). Turns out the refurb board has a bit a of problem with timing with both memory banks in use, which only shows up under real hard memory stress, and I had to drop back to using only two dimms in one bank on that board, problem solved.

I had used that computer with max memory for months with some pretty high loads not realizing that there was any problem. stress said there was a problem.

Try stress testing with the stress program. (If it passes that, then you've got a software problem.)

Reference: testing server with stress-ng

rootbert · Aug 27, 2021

had quite some issues with VMWare ESXi until version 6.something ... there was a point where I migrated all hardware to runnning bhyve VMs and since then I am happy ;-)

Argentum · Aug 27, 2021

garry said:
benchmarks/stress or benchmarks/stress-ng and as soon as I ran a VM stress test I got

I think you had a typo here - the port is sysutils/stress

CLimbingKid · Aug 27, 2021

_martin said:
The memory address error is just backtrace stopping as it can't resolve any more frames ; i.e. it's not an indicator of a problem. Actually OP omitted a lot of information from that dump, it's impossible to say what happened.

Is there anything more I can do to localise the issue? Here is my full post with panic here, where I describe in alot more detail. Can I interpret this or do anything more to gather more information in the next panic? Im not a FreeBSD expert my any means.

SOLVED - Random unscheduled restarts, under ESXI with 8 other VMs unaffected - out of ideas.

First post here, and as a long time lab'r outside my professional IT job, I am in awe of what TrueNAS is doing for me. Expert in a few IT areas, but definitly a novice with FreeBSD, so please treat me gently :smile: I replaced my old virtualised box in Feb this year with bigger and beefier...

www.truenas.com

Code:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x10
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80a7af2a
stack pointer           = 0x28:0xfffffe00e60aea30
frame pointer           = 0x28:0xfffffe00e60aea80
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 60306 (smbd)
trap number             = 12
panic: page fault
cpuid = 0
time = 1625664646
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e60ae6f0
vpanic() at vpanic+0x17b/frame 0xfffffe00e60ae740
panic() at panic+0x43/frame 0xfffffe00e60ae7a0
trap_fatal() at trap_fatal+0x391/frame 0xfffffe00e60ae800
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00e60ae850
trap() at trap+0x286/frame 0xfffffe00e60ae960
calltrap() at calltrap+0x8/frame 0xfffffe00e60ae960
--- trap 0xc, rip = 0xffffffff80a7af2a, rsp = 0xfffffe00e60aea30, rbp = 0xfffffe00e60aea80 ---
knote_fdclose() at knote_fdclose+0x13a/frame 0xfffffe00e60aea80
closefp() at closefp+0x42/frame 0xfffffe00e60aeac0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe00e60aebf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00e60aebf0
--- syscall (6, FreeBSD ELF64, sys_close), rip = 0x80fd11c2a, rsp = 0x7fffffffd108, rbp = 0x7fffffffd120 ---
KDB: enter: panic

Many THanks
CC

CLimbingKid · Aug 27, 2021

garry said:
I had a similar frustration. I have two almost identical computers with Gigabyte Z77 Ivy Bridge boards with Intel i5-3475s cpus and DDR3 memory. One board (a "refurb" I got direct from China) would spontaneously reset whenever I tried to poudriere with a high load. Yet it passed the memory test in all configurations: one dimm at a time in each slot, two dimms in either bank, four dimms filling the slots. Finally I tried to diagnose it with benchmarks/stress or benchmarks/stress-ng and as soon as I ran a VM stress test I got the reset (double buss fault I suppose). Turns out the refurb board has a bit a of problem with timing with both memory banks in use, which only shows up under real hard memory stress, and I had to drop back to using only two dimms in one bank on that board, problem solved.

I had used that computer with max memory for months with some pretty high loads not realizing that there was any problem. stress said there was a problem.

Try stress testing with the stress program. (If it passes that, then you've got a software problem.)

Reference: testing server with stress-ng

Gary this looks interesting - to run on bare metal hardware or in an ESXI VM?

I have run Prime95 for hours across multiple VM as part of my bench testing - and did not show up any issues, but happy to try this out too. I just wish I could localise it - it says hardware issue to me, but localising between motherboard/CPU/RAM especially when intermittant, and only affecting FreeBSD is hard.

Thanks
CC

Cath O'Deray · Aug 27, 2021

adgjqety said:
… ~11 other running VMs, none of those are panicking, but this is the only FreeBSD. …

Was your case eventually explained, or resolved?

garry · Aug 27, 2021

Argentum said:
I think you had a typo here - the port is sysutils/stress

Yes, thank you. I chopped together app-benchmarks/stress [gentoo] with sysutils/stress [freebsd].

_martin · Aug 27, 2021

Now this is a bogus address:

Code:

fault virtual address   = 0x10

smbd was calling close() syscall, page fault happened in kernel, in knote_fdclose(). If you had debug symbols/src you could get better idea what knote_fdclose+0x13a is(what it does there).

One of the possibilities could be that ESXi env is rubbing the bug the right way adn it gets triggered.

CLimbingKid · Aug 27, 2021

Evening
_martin How do you see that from the hex, amazing. So you think it may not yet be a hardware fault? How do I enable debuging further with symbols/src, like I said not a FreeBSD expert - anything I can read or can you point me in the right direction?

garry New FreeBSD VM from their premade VM FreeBSD 12 distributions, stress installs and running 4 cpus for 24 hours, will see if this fails.

Really apreciate the help here

Thanks
CC

_martin · Aug 28, 2021

You can't rule out either yet. But if you have other VMs running on that HW and only one VM has problem I'd start there. Even better if you can vmotion that VM to other HW and observe it there.
You can expect certain addresses in userspace and kernelspace. Some bogus addresses may be more obvious than others.

Have a look at this Kernel Debugging part of the handbook. It'll give you an idea how to set the environment. But debugging the issue in kernel can get complicated very fast.

Are you able to trigger the bug with reproducible steps ? As this is VM for the sake of test modify the VM to have only one CPU and try again.

_martin · Aug 31, 2021

CLimbingKid More often than not people abandon these posts. I am interested to see why these crashes occur.

I later noticed you are using TrueNAS, platform that's not supported on this forums. Are you able to reproduce this behavior on vanilla FreeBSD 12 installation ? I don't know what changes are done in TrueNAS (if it's only a fancy UI interface on top of the FreeBSD or if there are some custom patches applied to a kernel, etc.).
In the TrueNAS thread you mentioned you've seen crash occurred in two different programs. It could be those are victims of a different bug. If you have a way to trigger the bug I could try myself.

CLimbingKid · Sep 1, 2021

_martin et all,

Update - I have been stress testing with stress-ng on a recently downloaded FreeBSD 12.2 VM image, which I beleive is the base of teh current Truenas version. All stress tests running for days at a time have run flawlessly, with large amounts of ram and vcpus. However, last night, TrueNas restarted once more. This time from what I can see I had no crash log files, but the info.last describes another page fault. I had previously reduced the amount of RAM to 8GB and a single vcpu.

No other VMs were affected - unfortunatly stress-ng, running in another VM, had completed its 24hour cycle an hour earlier - so was not running at the time.

Im getting more confident this is indeed a TrueNas issue, and while I have posted over on their forums, there have been no replies, and to be honest I have limited skilss to gain further information. Please, if anyone can help me gather more info ahead of teh next crash I woudl love to localise this further.

There are no steps to reproduce - it appears random, not related to anu user interface actions, and not related to load - its happened early in the morning, late at night. Its not related to other events or schedules. Platform has 1500VA smart ups - so outside power influences are unlikley.

I have now set stress-ng to run continuously in a hope its running on the next crash - and I think then I shoudl be able to proce its not a hardware issue. My only other option is to then reinstall Truenas, and potentially rebuild the pool. Not looking forward to that but clucching at straws.

Thanks
CC

_martin · Sep 1, 2021

Thanks for the update. From what you're describing it seems like a single VM issue, most likely non-HW related problem. Without having the steps to reproduce it's hard to do anything remotely.
We are getting into the unsupported territory as TrueNAS is not supported here. But may I suggest you create another TrueNAS VM and test if that crashes too ? Preferably without any private data so you could share the vmdk with the crash of it.
I briefly checked the TrueNAS web ; there's too much eye candy there for my taste. The git repos don't have the FreeBSD src, maybe they are using vanilla FreeBSD. But I don't know..

You mentioned it's page faulting all the time. If different programs are #PF in kernel it may be they are victims of something else. As an example it could be some sort of overflow happening in the structure overflowing other data (victim's data) and when victim is trying to access it it gets this bogus info.
Check for the backtrace of each crash ; look for the first function before the trap (it was knote_fdclose() ) in your example. Is it always the same? What are the #PF addresses ?

free-and-bsd · Sep 2, 2021

garry said:
I had a similar frustration. I have two almost identical computers with Gigabyte Z77 Ivy Bridge boards with Intel i5-3475s cpus and DDR3 memory. One board (a "refurb" I got direct from China) would spontaneously reset whenever I tried to poudriere with a high load. Yet it passed the memory test in all configurations: one dimm at a time in each slot, two dimms in either bank, four dimms filling the slots. Finally I tried to diagnose it with benchmarks/stress or benchmarks/stress-ng and as soon as I ran a VM stress test I got the reset (double buss fault I suppose). Turns out the refurb board has a bit a of problem with timing with both memory banks in use, which only shows up under real hard memory stress, and I had to drop back to using only two dimms in one bank on that board, problem solved.

I had used that computer with max memory for months with some pretty high loads not realizing that there was any problem. stress said there was a problem.

Try stress testing with the stress program. (If it passes that, then you've got a software problem.)

Reference: testing server with stress-ng

Was it perchance GA-Z77N-wifi?

CLimbingKid · Sep 2, 2021

_martin said:
Thanks for the update. From what you're describing it seems like a single VM issue, most likely non-HW related problem. Without having the steps to reproduce it's hard to do anything remotely.
We are getting into the unsupported territory as TrueNAS is not supported here. But may I suggest you create another TrueNAS VM and test if that crashes too ? Preferably without any private data so you could share the vmdk with the crash of it.
I briefly checked the TrueNAS web ; there's too much eye candy there for my taste. The git repos don't have the FreeBSD src, maybe they are using vanilla FreeBSD. But I don't know..

You mentioned it's page faulting all the time. If different programs are #PF in kernel it may be they are victims of something else. As an example it could be some sort of overflow happening in the structure overflowing other data (victim's data) and when victim is trying to access it it gets this bogus info.
Check for the backtrace of each crash ; look for the first function before the trap (it was knote_fdclose() ) in your example. Is it always the same? What are the #PF addresses ?

Martin - really appreciate your help here, but Im pretty sure you have over estimated my linux ability!

If I understood you correctly - I have looked at each text dump tar ball, and extracted the msgbuf.txt from each. You said knote_fdclose() which occurs after the first trap - so have done this to all the crash dumps I have so far, and get this:
Crash 0
--- trap 0xc, rip = 0xffffffff80a7ae8a, rsp = 0xfffffe00a005ea30, rbp = 0xfffffe00a005ea80 ---
knote_fdclose()
Crash 1
--- trap 0xc, rip = 0xffffffff80e45720, rsp = 0xfffffe000f767a20, rbp = 0xfffffe000f767a70 ---
zone_release() at zone_release+0x170/frame 0xfffffe000f767a70
Crash 2
--- trap 0xc, rip = 0xffffffff82b438b0, rsp = 0xfffffe00e6030e10, rbp = 0xfffffe00e6030e30 ---
fletcher_4_avx2_native() at fletcher_4_avx2_native+0x40/frame 0xfffffe00e6030e30
Crash 3
--- trap 0xc, rip = 0xffffffff8298dd90, rsp = 0xfffffe00e1cef740, rbp = 0xfffffe00e1cef770 ---
dsl_scan_io_queue_destroy() at dsl_scan_io_queue_destroy+0x70/frame 0xfffffe00e1cef770
Crash 4
--- trap 0xc, rip = 0xffffffff80a7ae8a, rsp = 0xfffffe00e2ac7a30, rbp = 0xfffffe00e2ac7a80 ---
knote_fdclose() at knote_fdclose+0x13a/frame 0xfffffe00e2ac7a80

From what I can see there was not a single function apart from the first and last. Not sure why I dont have a textdump from the most recent crash.

As you say, my next plan is to install truenas into a fresh VM, maybe add a single disk for now, and run in parallel, before migrating over the pools when it seems stable. Initially Truenas was stable, its only in recent months this has all started. Appreciate this is not a Truenas forum, so I really appreciate your help with these panics.

I am starting to suspect TrueNas, so would love to locate the issue and post a report to them

Thanks
CC

_martin · Sep 2, 2021

Yes, it seems the crash is always on something else. Do you have the virtual address it #PF onto ? Such as 0x10 in your first example. Does TrueNAS provide a setup without ZFS? I.e. you'd rely on UFS and/or geom providers.
As it was stable before it's worth checking back what changed in recent weeks then.

garry · Sep 3, 2021

free-and-bsd said:
Was it perchance GA-Z77N-wifi?

It was GA-Z77M-D3H, "refurb" I got direct from China. . No wifi (ethernet everywhere!). I also have a atx GA-Z77X-D3H purchased new and it runs like a champ with all four memory slots occupied with fastest-supported xmp memory.