Zombie Kernel (reporting it's own crash via network)

PMc · Oct 4, 2024

Unbelievable...
I have an xterm open, ssh to remote system, running tail -f /var/log/messages.

This appeared in the xterm:

Code:

Oct  3 23:40:59 <kern.crit> edge kernel: [718015] Fatal trap 12: page fault while in kernel mode
Oct  3 23:40:59 <kern.crit> edge kernel: [718015] cpuid = 2; apic id = 02
Oct  3 23:40:59 <kern.crit> edge kernel: [718015] fault virtual address = 0x0
Oct  3 23:40:59 <kern.crit> edge kernel: [718015] fault code            = supervisor read instruction, page not present
Oct  3 23:40:59 <kern.crit> edge kernel: [718015] instruction pointer   = 0x20:0x0
Oct  3 23:40:59 <kern.crit> edge kernel: [718015] stack pointer         = 0x28:0xfffffe02e0ce98f8
Oct  3 23:40:59 <kern.crit> edge kernel: [718015] frame pointer         = 0x28:0xfffffe02e0ce9910
Oct  3 23:40:59 <kern.crit> edge kernel: [718015] code segment          = base 0x0, limit 0xfffff, type 0x1b
Oct  3 23:40:59 <kern.crit> edge kernel: [718015]
Oct  3 23:40:59 <kern.crit> edge kernel: [718015] Fatal trap 12: page fault while in kernel mode
Oct  3 23:40:59 <kern.crit> edge kernel: [718015] cpuid = 4; apic id = 04

So, it was sent to the kernel logging (718015 is the uptime), grabbed by syslog, written to /var/log/messages, grabbed by tail, given to sshd, encrypted, written to socket, packaged into mbuf(s), routed, checked through firewall rules, handled by netgraph bridge, given to the iface device driver and sent to my desktop. All with a crashed kernel.

In the logfile there is actually a bit more; it goes on:

Code:

Oct  3 23:40:59 <kern.crit> edge kernel: [718015] fault virtual address = 0x0
Oct  3 23:40:59 <kern.crit> edge kernel: [718015] fault code            = supervisor read instruction, page not present
Oct  3 23:40:59 <kern.crit> edge kernel: [718015]                       = DPL 0, pres 1, long 1, def32 0, gran 1
Oct  3 23:40:59 <kern.crit> edge kernel: [718015] instruction pointer   = 0x20:0x0
Oct  3 23:40:59 <kern.crit> edge kernel: [718015] stack pointer         = 0x28:0xfffffe02e0d0c8f8
Oct  3 23:40:59 <kern.crit> edge kernel: [718015] frame pointer         = 0x28:0xfffffe02e0d0c910
Oct  3 23:40:59 <kern.crit> edge kernel: [718015] code segment          = base 0x0, limit 0xfffff, type 0x1b
Oct  3 23:40:59 <kern.crit> edge kernel: [718015]                       = DPL 0, pres 1, long 1, def32 0, gran 1
Oct  3 23:40:59 <kern.crit> edge kernel: [718015] processor eflags      = interrupt enabled, resume, IOPL = 0
Oct  4 00:27:44 <kern.info> edge syslogd: kernel boot file is /boot/kernel/kernel
Oct  4 00:27:44 <kern.crit> edge kernel: ---<<BOOT>>---

Problem was, I had accidentially overwritten /boot/kernel/kernel. Normally this should not harm. When I noticed the problem, I hit ^C, deleted /boot/kernel and moved /boot/kernel.old back - with the result that at reboot both were gone, and it was a bit of a hassle to restart.

I got a crashdump also, and it looks like a known problem - the iface refcounts cat get wrong when moving into and out of vnet jails. I complained already on occasion in bugtracker, but nobody cares

Code:

[718015] current process                = 7957 (rtadvd)
[718015] trap number            = 12
[718015] panic: page fault
[718015] cpuid = 4
[718015] time = 1727991659
[718015] KDB: stack backtrace:
[718015] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe02e0d0c6c0
[718015] vpanic() at vpanic+0x152/frame 0xfffffe02e0d0c710
[718015] panic() at panic+0x43/frame 0xfffffe02e0d0c770
[718015] trap_fatal() at trap_fatal+0x389/frame 0xfffffe02e0d0c7d0
[718015] trap_pfault() at trap_pfault+0x46/frame 0xfffffe02e0d0c820
[718015] calltrap() at calltrap+0x8/frame 0xfffffe02e0d0c820
[718015] --- trap 0xc, rip = 0, rsp = 0xfffffe02e0d0c8f8, rbp = 0xfffffe02e0d0c910 ---
[718015] ??() at 0/frame 0xfffffe02e0d0c910
[718015] sysctl_iflist() at sysctl_iflist+0x173/frame 0xfffffe02e0d0cb00
[718015] sysctl_rtsock() at sysctl_rtsock+0x35d/frame 0xfffffe02e0d0cbe0
[718015] sysctl_root_handler_locked() at sysctl_root_handler_locked+0x97/frame 0xfffffe02e0d0cc30
[718015] sysctl_root() at sysctl_root+0x2b3/frame 0xfffffe02e0d0ccb0
[718015] userland_sysctl() at userland_sysctl+0x15e/frame 0xfffffe02e0d0cd50
[718015] sys___sysctl() at sys___sysctl+0x60/frame 0xfffffe02e0d0ce00
[718015] amd64_syscall() at amd64_syscall+0x106/frame 0xfffffe02e0d0cf30
[718015] fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe02e0d0cf30
[718015] --- syscall (202, FreeBSD ELF64, __sysctl), rip = 0x216847a072aa, rsp = 0x216845a7dd18, rbp = 0x216845a7dd50 ---
[718015] Uptime: 8d7h26m55s
[718015] Dumping 13722 out of 81790 MB:

Cath O'Deray · Oct 4, 2024

PMc said:
in bugtracker

651 reports changed in the past week.

PMc · Oct 4, 2024

Cath O'Deray said:
651 reports changed in the past week.

Link? Proof?

jbo@ · Oct 4, 2024

Cath O'Deray said:
651 reports changed in the past week.

PMc said:
Link? Proof?

I think/assume it's Cath O'Deray 's way of saying: "Can you please share the link to the PR(s)".

Cath O'Deray · Oct 4, 2024

Thanks jbo@

PMc said:
Link?

Please see my signature, "I provide screenshots to satisfy people who can not tolerate links.".

PMc said:
Proof?

PMc said:
I complained already on occasion in bugtracker, but nobody cares

I would like to see the report(s). Thanks.

PMc · Oct 4, 2024

jbo@ said:
I think/assume it's Cath O'Deray 's way of saying: "Can you please share the link to the PR(s)".

Well, I don't assume - I have issues to solve. Lots of them.

I thought, the bugtracker has fine search capabilities. So would it really be too difficult to use these, put pmc@ into the search field and scan for any comments? I would have to do exactly the same, and also I consider it very unlikely that actual progress might come out of this, and I am not fond of discussing things that will not change from discussing.

PMc · Oct 4, 2024

Cath O'Deray said:
Please see my signature, "I provide screenshots to satisfy people who can not tolerate links.".

I read that, and decided that I am not responsible for Your prescriptions.

Cath O'Deray said:
I would like to see the report(s). Thanks.

276862

Cath O'Deray said:
View attachment 20558

How much substantial information could be written, if we only had those bytes that are wasted by coloured pictures?

Cath O'Deray · Oct 4, 2024

PMc said:
276862

Thanks, I'm now (at least) on the cc list there.

Cath O'Deray · Oct 4, 2024

How much substantial information could be written, if we only had those bytes that are wasted by coloured pictures?

If only.

If ifs and ands were pots and pans, there'd be no work for tinkers' hands.

Cath O'Deray said:
I was grahamperrin. Still grahamperrin elsewhere. I became a cathode ray tube only in The FreeBSD Forums.

/members/grahamperrin.35084/ had linkers' hands. That ship has sailed, and what preceded his departure is no longer open to discussion.

Bon voyage

/members/cath-oderay.35084/ has no such hands. This disability is not a hindrance to quietly, methodically, stumpily, cathodically removing the perceived damage done by her predecessor's hands.

PMc · Oct 4, 2024

Cath O'Deray said:
Thanks, I'm now (at least) on the cc list there.

Well, there are apparently a few corner cases with this mechanism. But my hopes are not high - because when things fail, the internal housekeeping could have been disturbed long before, and one won't see the real cause from a crash dump.
So that is rather a matter of logical verification - and when I mentioned such in other cases, it was already frowned upon twenty years ago. It seems not something people enjoy to do...

Cath O'Deray said:
/members/cath-oderay.35084/ has no such hands. This disability is not a hindrance to quietly, methodically, stumpily, cathodically removing the perceived damage done by her predecessor's hands.

I'm sorry, I don't understand this.
I have, however, noticed the event of strange and sudden gender changes with renowned magickians (name the Wachowskis, name Genesis Porrige, and others), and as there seems to be a pattern, my agency has decided to create an X-file, to investigate further. But this is entirely outside the scope of this forum.

NapoleonWils0n · Oct 4, 2024

zombie kernel

Andriy · Oct 4, 2024

PMc said:
So, it was sent to the kernel logging (718015 is the uptime), grabbed by syslog, written to /var/log/messages, grabbed by tail, given to sshd, encrypted, written to socket, packaged into mbuf(s), routed, checked through firewall rules, handled by netgraph bridge, given to the iface device driver and sent to my desktop. All with a crashed kernel.

Somehow I have very hard time believing that that was what actually happened, sorry.
More believable story:

the kernel messages got written to the msgbuf (in RAM) before the (post-panic) reboot;
your system works in such a way that it does not clear RAM on a warm reboot;
syslogd picked up those messages after the reboot and wrote them to /var/log/messages;
only then you connected and observed the logs.

PMc · Oct 5, 2024

Andriy said:
Somehow I have very hard time believing that that was what actually happened, sorry.

My point exactly. Never seen such before.

Andriy said:
More believable story:

the kernel messages got written to the msgbuf (in RAM) before the (post-panic) reboot;

your system works in such a way that it does not clear RAM on a warm reboot;

Not possible. It did poweroff. kern.poweroff_on_panic=1, [2]

Andriy said:
syslogd picked up those messages after the reboot and wrote them to /var/log/messages;

only then you connected and observed the logs.

Timeline

I was at the desktop. I had noticed that some of my jails (on the backend) have obsolete libraries that weren't updated over one or two release cycles.[1] I had just rebuilt world for them and wanted to install. Accidentially I installed to the host and the kernel got overwritten. I move the kernel back, then run installation to the jails. The jails should stop then - the first three did, then things halt and no xterm accepts any input. The xterm with tail /var/log/messages displays the crash message. .
I go to the hall, switch the monitor on, and there is the message "automatic reboot in XXX seconds, press key to reboot"
I press enter
Another message appears "press key to reboot"
I press enter again
Only then(!) the machine powers off.[2]
On boot the machine does fail to load the kernel
I switch to kernel.GENERIC, do fsck, and find that kernel *and* kernel.old are gone (probably to lost+found).
I search for the proper zpool where the src+obj is, decrypt and run installkernel
reboot
It doesn't startup, doesn't find the zfs pools[3]
I reimport all of them and reboot
orderly startup, finally

[1]
That came out to be a flaw in META_MODE: if the target is configured to not install the development tools (aka WITHOUT_TOOLCHAIN), then META_MODE does not detect a new compiler, and considers the libraries as up to date, although every single object displays that it is built with clang-17.
/usr/src/UPDATING hints that, on a release upgrade (e.g. 13.3->13.4) one must do buildworld, buildkernel, installkernel, installworld, and then again once more buildworld, buildkernel, installkernel, installworld. Because only on the second cycle will the objects actually be built with the new compiler.
For the first cycle I do not use META_MODE, because it is technically impossible to recreate correct file timestamps from commitlog when switching branches. (In a different branch a commit can have happened earlier and still been different - and that change would go undetected.)
For the second cycle I did suppose that META_MODE should be possible again - but in fact it does not recognize a changed compiler.

[2]
The machine must power off after crashdump, because temperature cannot be contained without fan steering, there may be nobody present on location to enter a startup password, and without a working OS the thermal steering cannot run. The cores should protect themselves, but disks and other hardware will runaway. (The thermal steering software monitors all temperatures, puts unused disks into spindown, limits the uncore frequency, throttles compute via rctl, suspends scrubs, and switches the fan arrays.)

[4]
All or nothing: when a zpool gets imported in singleuser, only that zpool will be known on the next multiuser boot, the others are forgotten.

Andriy · Oct 5, 2024

PMc is this interesting behavior reproducible?
E.g., with sysctl debug.kdb.trap=1.

One thing that stands out is that there were two traps "at the same time" on different CPUs.
Perhaps, there is some regression where interrupts are not disabled and CPU cores are not parked, so the system keeps working (although, in a potentially messed up, unsafe state) while the thread that crashed is waiting for reboot confirmation.

BTW, you probably want to disable that confirmation prompt given the overheating problem.

PMc · Oct 5, 2024

Andriy said:
PMc is this interesting behavior reproducible?
E.g., with sysctl debug.kdb.trap=1.

I don't think so, and never seen it before. Evidence is now only that it is possible.

Problem, this is my backend machine, and it runs all the infra. Investigative kernel crashing is a bit disruptive there.
Trying it on a test system raises the question, which are the relevant components/configurations? - otherwise anybody could try it on any metal.

Andriy said:
One thing that stands out is that there were two traps "at the same time" on different CPUs.

Yepp, noticed that. But the second trap hindered the first on from telling us which process caused it.
The second one was rtadvd, which makes sense, because rtadvd feeds to the ng_eiface that connect the jail(s), and there are always complaints in that area, like so:

Code:

rtsold[3945]: <rtsock_input_ifannounce> interface nadmn1l removed
rtadvd[8072]: <rm_ifinfo_index>: ifinfo not found (idx=12)

Andriy said:
Perhaps, there is some regression where interrupts are not disabled and CPU cores are not parked, so the system keeps working (although, in a potentially messed up, unsafe state)[/S]

I understand. That might be a problem during development. It is not a problem here, because in production mode, when there is a crash, it shows that the system was already in some messed-up state.

Andriy said:
BTW, you probably want to disable that confirmation prompt given the overheating problem.

Yes, indeed. I thought I had achieved that, as on some occasion it did not reach that prompt and rightaway poweroff.

Now checking the code, it seems actually all is correct: after the PANIC_REBOOT_WAIT_TIME (which is intended, because it should give me time to get to the hall and read the screen), if there were no keypress, it would continue and probably reach poweroff.