Debugging a kernel panic

Greetings! I recently installed FreeBSD 11 on a server machine, Xeon E3-1275v6, wtih ZFS as the rootfs in a 3-way mirror configuration. On the first day of use I experienced two kernel panics which are not readily reproducible. I have the kernel crash dump and the output of kgdb > backtrace, but I don't know what to read from that. The first panic occured when I tried printing the kernel routing table with netstat -r, while the other one occured when I tried running dhclient on an inactive Ethernet port (that is, there was no cable plugged in). The motherboard has two Ethernet ports, both using igb driver. dhclient sends broadcasts on the active one but gets no reply, so I just wanted to make sure I was using the correct port by running dhclient on the other one as well. It displayed "No link" and then the kernel paniced.

I attach the output of kgdb, any help appreciated.
 

Attachments

  • kgdb-log.txt
    5.4 KB · Views: 239
This is weird, it looks like it was ntpd(8) that caused it:
Code:
current process		= 790 (ntpd)

Besides the CPU what else can you tell us about the hardware?
 
Thanks for replying! The motherboard is Intel DBS1200SPLR and it's got two 16GB Kingston KVR21E15D8/16 DIMMs (it's ECC). Maybe I should mention that one of the DIMMs is actually Kingston KVR21E15D8/16I which is identical except that it's "certified" for Intel. The store just didn't have two of those, so I got one ../16, which was still advertised as suitable for this processor.

I doubt it's of relevance but the rootfs is on three Kingston 120G HyperX SSDs and the PSU is a SeaSonic G-360W 80Plus Gold.
 
If you look at that 'current process' does each panic(9) happen for random processes? If the crashes are seemingly random I'm more inclined to suspect some bad memory. But if the crashes are always caused by the same process with a similar backtrace the issue may be driver related.
 
So after some experimentation I figured out how to reproduce it. In short, I found to ways to get it to panic every time:

1)
Code:
service netif stop
dhclient igb<ACTIVE IF>
In the mean time, in another terminal
Code:
service netif start
ntpd -q
If I don't start netif again, ntpd fails with [FONT=Courier New]unable to bind to wildcard address ::[/FONT].

2)
Code:
service netif stop
dhclient igb<ACTIVE IF>
In the mean time, in another terminal
Code:
netstat -r
In both cases it doesn't panic if dhclient is running on the inactive (unplugged) interface. And it doesn't happen unless I stop netif prior to running dhclient...

The error in all cases is the same: [FONT=Courier New]Fatal trap 12: page fault while in kernel mode[/FONT].
 
If you look at that 'current process' does each panic(9) happen for random processes? If the crashes are seemingly random I'm more inclined to suspect some bad memory. But if the crashes are always caused by the same process with a similar backtrace the issue may be driver related.
The backtrace here seems to have the same footprints as this freebsd-hackers@ post, starting at soclose+0x3c. There may be some underlying cause common to both that bug report and this one. Certainly, usermode code shouldn't cause references to NULL + somesmallnumber (in this case, 0x17). This looks like it is supposed to be a reference to offset 0x17 in some data structure. In most of 11.x (the original poster didn't specify the exact FreeBSD version, and the SVN tag would also help), uipc_socket.c:1046 is:
Code:
error = (*so->so_proto->pr_usrreqs->pru_disconnect)(so);
I'd suggest opening a PR in category base / kern with the information in the OP as well as a link to the freebsd-hackers@ thread and this reply. If you post the PR number here, people can follow it. But we really need one of the network stack developers to look at it.
 
In short, I found to ways to get it to panic every time
That will be extremely helpful. It's so much easier to debug issues if you know of a way to reproduce the problem. As Terry_Kennedy noted, it's probably time to create a PR for it. Definitely mention how to reproduce the panic.
 
Back
Top