Post the conditions under which your kernel crashes and why ?

I recently happened to involuntarily "reproduce" this crash on 13.1-p2 - https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263505
Admittedly, not anything one happens to do on a daily basis. I just happend to try adding a vlan to wlan0 and first got a complete system freeze on the first and a reboot on the second try.

The handbook even serves a full chapter on the topic of debug dumps (https://docs.freebsd.org/en/books/developers-handbook/kerneldebug/).

In my instance, I didn't investigate much further since by accident I had seen that bug on bugzilla a couple days before. I suppose crashinfo() could have helped troubleshoot, though and might be a pointer for Alain De Vos?
 
I'm able to crash the kernel under following conditions:
-Run Wayland
-Build ports simultaneously with poudriere
-Play a youtube video with falkon browser
Sometimes the kernel crashes ...
 
Perhaps more interesting is the Miscellaneous Beasts:
 
This one looks rather jolly.
944914CB-AE7A-491D-8303-51AE178DD79F.jpeg
 
using a ggated device as zpool vdev device crashes my host under moderate load, but I haven't tried this config since 13.0
 
Since this is starting to drift off topic, shouldn't we also add a "nice try, NSA, but start looking elsewhere for zero day entry points?" :)
 
I'm able to crash the kernel under following conditions:
-Run Wayland
-Build ports simultaneously with poudriere
-Play a youtube video with falkon browser
Sometimes the kernel crashes ...
Thank you. I have many crashes especially when building ports. No one have ever before said that building in parallel causes this. I will try with just one build at a time :)

For all those of you who has never seen a crash: Try some new hardware. My i7-12700K with 64GB ram on 4 sticks and an ASUS board. Lot of problems = a lot to learn and understand.
 
if (user == "alain" && isFullMoon()) crash(E_CRASH_TYPE::HARD);
:)
We had a customer who had cascade NFS mounts on his environment. Dealing with issues there was a nightmare. Especially on prod where you really didn't want to lose access to those shares. We had to call my colleague who fixed it as for some reason "his commands" worked. We did the same thing he did but it didn't work.
We were laughing that those commands and kernel modules have uid check. :)
 
For all those of you who has never seen a crash: Try some new hardware. My i7-12700K with 64GB ram on 4 sticks and an ASUS board. Lot of problems = a lot to learn and understand.
Within the context of the FreeBSD community known to me, I tend to be the guy with the "latest hardware" more often than not. In fact, I usually get a lot of comments in the line of "just wait a year".

But never the less: Not a single FreeBSD kernel crash. Not once. Not on desktop hardware, not on server hardware, not on laptops...

That being said, the Alder Lake architecture does seem to need some work to make it work as intended on FreeBSD. However, looking at how different Alder Lake is from "everything we've seen before" (in recent years), I wouldn't exactly use that as an argument in this particular conversation.
 
I looked at my logs for all system administration; they are online since 2012:
  • 20120602: Running smartmon on the Seagate disk causes the kernel to hang for one minute, but then comes back to life. Repeatable, ended up making sure that the Seagate disk is never touched by SMART.
  • 20121227: System spontaneously crashed at 2am, when "zpool scrub" starts on a disk connected by USB-3. Cause unknown. No disk or USB error messages on the console. Did not reboot, system remained down until we came back from Christmas vacation and I power cycled it.
  • 20130105: Crashed by pulling out a PCI card without shutting down the system first. Oops. Don't do that.
  • 20131207: I had been plugging in SD cards in mass production to read photos from old cameras; after a few hours of that, the system disk (!) caused a crash with error message "ata3: timeout waiting to issue command" and "ata3: error issuing READ_DMA48 command" and "(ada0:ata3:0:0:0): lost device". Without the boot/root disk ada0, the system stayed up (kernel running), but was completely unusable.
  • 20140203: System hung with "ata0: already running", at 3am. No crash messages, but dead as a doornail. Power-cycled in the morning.
  • 20140212: Around 1PM, machine crashed or hung. Would not reboot. Some disk making strange noises. Eventually, tracked it down to a Seagate disk that when plugged in causes absolutely nothing to work (not even the BIOS), while making noises like a grinder. Threw the disk in the trash and bought a Hitachi.
  • 20140223: Somewhere between 2am and 5am, machine crashed hard. Symptoms: No network, no reaction from the keyboard (not even scroll lock, caps lock, or ctrl-alt-del), disk light on solid. Rebooted, works now.
  • 20140319: System crashed, bottom of screen said "rebooting", then came back, worked fine. Unknown cause.
  • 20140331: System crashed around 2:40am. Managed to destroy the BIOS settings (!), which I had to manually restore. Now the printer doesn't work any more (it's connected via parallel port).
  • 20161013: System crashed this morning exactly at 7AM. No idea why; /var/log/messages says "Fatal trap 12: page fault while in kernel mode" and nothing else. Probably caused by backup trying to run.
  • 20170106: Inadvertent crash due to messing up power cords while replacing UPS battery.
  • 20170110: Running a user-mode program that uses POSIX aio...() calls by the thousand causes a kernel crash. Since that is an unrealistic use case, ignored it.
  • 20171215: I deleted /var/crash/vmcore.0 (was using too much disk space), but then the next reboot crashed immediately and created a new core file. The next reboot succeeded. Not reproducible.
  • 20190720: Computer became unresponsive; running services were still running, but no login possible, and no new processes could be started. Eventually ended up rebooting by cutting power. Afterwards, found that I had forgotten to set kern.kstack_pages=4 in loader.conf (which is needed for ZFS on low-memory machines), so probably this was a self-inflicted injury.
  • 20200316: Kernel crash, message on console "ada1 already running". After reboot, one of the disks is missing. Had to open the machine and reseat power cables. Found an enormous amount of dust in the machine.
To explain things: This machine used to have two Seagate Barracudas, which are crap and ticking timebombs. At night, starting at 1am, I run maintenance jobs, such as "zpool scrub", big backups, and log file moving, so the disks are busiest in the middle of the night. The system is protected against power outages with a UPS, so uptime is usual months.

What do I conclude from the above? Nearly all problems are caused by disk interfaces, which are perfectly capable of bringing the system down, both in hardware and in kernel space. Having built large disk systems, this is not inherently a question of bugs in the kernel, but typically lower-level things (firmware in the HBAs, miscommunication between disk/HBA/kernel developers) that blow everything up. Another handful are user errors, including really dumb things like pulling the power cord when you didn't intend to. But buried in the above ~15 crashes in 10 years are at least two real kernel bugs that caused crashes, under extreme workload. I think of all of those, only one (abuse of aio... calls) was even reproducible.

In addition, there were about 50-100 crashes of vitally important user space programs, such as my backup, my equipment monitoring, apache, and Berkeley DB. So I can gratuitously round as follows: Of the ~100 outages in ~10 years, 80% are caused by userspace programs, 18% by the disk IO system below the kernel or the wetware doing something dumb, and 2% by the kernel itself.
 
My (*-RELEASE) kernel doesn't crash ?‍♂️

The closest I've seen to a crash was a deadlock in the vfs which forced me to reboot the machine...
Deadlocks are worse than panics. At least with a panic the box reboots without having to go downstairs to punch the reset button. Panics also provide register dumps which are much easier to debug, because we're walking the stack through a backtrace. The source of the deadlock could have occurred seconds, minutes or even hours previously. While with a deadlock (or as we in the IBM mainframe world called them deadly embrace) leave you with little or no information.

When I was an MVS systems programmer, a subsidiary of ours did development on a production mainframe -- company made that decision because they didn't want to spend the millions on another mainframe. The developers had a small bug in their kernel code -- they were writing software to support a hardware vendor's to be announced tape robot. The IBM mainfame has no stack so the standard was to dynamically allocate (GETMAIN, aka malloc()) a 72 byte save area to save registers. Just prior to return() the function would restore the callers registers and free (using FREEMAIN, aka free()) the 72 byte save area. Save areas are a linked list giving you something like a stack without aa stack (no such thing as a stack exploit on the IBM mainframe) and save areas are stored on what we in the UNIX world call the heap. Well, with this bug their code freed 144 bytes or two save areas worth of memory while in kernel state. The problem didn't bite us until the next morning when people started to log in. The kernel would GETMAIN (malloc()) free memory, which was in fact still in use but considered freed by memory management in the kernel because 144 bytes were freed instead of the intended 72 bytes. This caused a deadlock because the memory freed were linked lists used to manage page tables (DAT - google dynamic address translation). The affected control blocks maintained a lock which would be used to lock the control block using a spin loop (compare and swap instruction -- we have those same instructions in Intel and ARM). To make a long story short, the CPU went into a tight loop (a compare and swap spin loop) waiting for a resource that no longer existed because it was overwritten by something else because memory management believed the memory was free beause the memory had been erroneously freed (144 bytes instead of 72 bytes) many (up to 8) hours before by the kernel modules the developers were testing.

In UNIX when we issue free() all we need to pass to free() is an address because the length is stored with the address by malloc(3) in userspace and malloc(9) because the mallocs keep track of not just the address but the length malloc()ed so when free() frees the memory all that is needed is an address and memory management does all the rest. On the mainframe the programmer passes not only the address of the memory to be freed to FREEMAIN but also the length to be freed. This is significantly more dangerous because one little mistake like above can lead to a deadlock. The programmer must keep all this minutiae in mind while programming (in assembler), especially when doing work in kernel.

This deadlock took weeks to sift through the kernel dumps to find the root cause. Fortunately the subsidiary's code contained some eye catchers when they wrote their data memory. Eventually we had to simply ask. They looked through their code and embarrassingly, they fixed the bug. But only after the company had lost a significant amount of money due to performance penalties.

Deadlocks are hard to diagnose, debug, and fix. I'd rather have a panic with a register dump telling me, "there's the problem, go fix it."

Ever since my ports bit was upgraded to include src, I run -CURRENT everywhere. Each of my machines has an alternate boot partition, a FreeBSD-CURRENT I can fall back to in case anything goes horribly wrong, plus an external USB boot disk for the extremely rare circumstance when nothing else will work. -CURRENT has been stable for me and any time I've never needed the alternate boot partitions for recovery, except when I've shot myself in the foot, my self and my fault. The last time my -CURRENT panicked was, again, my fault, playing around with some half baked thing.

My uptimes are generally low since I installworld/installkernel quite regularly.

FreeBSD is simply rock solid and stable. Even -CURRENT.
 
Deadlocks are worse than panics.
Of course they are, I didn't qualify anything here. Just saying my (RELEASE) kernel never crashed, in many years ;) 11-CURRENT panicked a few times on me, newer -CURRENT versions indeed only when I messed up something myself.

At least with a panic the box reboots without having to go downstairs to punch the reset button.
Not for me, as I use GELI on my private server and need to type the passphrase on the serial console ? – but:

Panics also provide register dumps which are much easier to debug, because we're walking the stack through a backtrace. The source of the deadlock could have occurred seconds, minutes or even hours previously. While with a deadlock (or as we in the IBM mainframe world called them deadly embrace) leave you with little or no information.
This is certainly true. I didn't even bother trying to find out what happened. Somehow, the vfs subsystem seemed to be in a deadlocked state, so disk I/O was dead, network socket I/O still worked (which is useless when you need the disk for any command typed on SSH).

Fortunately, it only happened twice, while my server was running 13.0. It seems to be gone in 13.1 :cool:
 
  • Like
Reactions: cy@
Ignoring the crashes when writing device drivers, I have not seen real trouble since the 7.x when USB support was pretty sketchy, so unplugging a mounted stick would crash the kernel, and unmounting XFS drives would cause a crash.
 
The point is not weather FreeBSD is stable or not. It for sure is when it is. It is also about configuration and hardware, and how prone FreeBSD are to those things. Right now my machine has become much more stable. Partly because Alain De Vos mentioning of Poudriere crashing the system when building multiple packages at a time. THANK YOU sir ?. Since set to one at a time, no crashes. Before that it happened around every hour or two. And then I have set it to build it all in ram. So actually also faster.

And configuring the bios on my system with 4 sticks of ddr4 ram. Setting the speed manually did make the system stable. If set with XMP or not set (so auto 2133) was unstable. Setting it explicit needed. So I have now enabled the p cores on my I7-1277k. And has been compiling and using the system for 12 hours without a crash. That is a record for this setup. I will update my post about alderlake (in the Install forum thread) if it is still running stable in a couple of days.
 
If poudriere is making the kernel crash, this sounds like someone needs to file a PR. But no one has, as far as I know, which makes me suspicious as to where the problem actually lies.

Fiddling with speed settings to make a system stable tells me the system isn't stable which has nothing to do with FreeBSD. What were the settings beforehand? Were they changed and then the system started crashing? And now one is blaming FreeBSD's kernel?

Inquiring minds want to know. Or not. Fiddling with clock timing requires informed minds. When I designed motherboards, a lot of thought and testing was put into what would work reliably and you didn't touch what I put in there. Any changes were just guesswork on the part of the user (not that you could cause it wasn't an option).
 
Back
Top