Solved FreeBSD 11.1 has begun to crash accidentally - how to debug?

Petr Fischer · Jan 9, 2018

Hello, FreeBSD 11.1 (somewere after 11.1-RELEASE-p1) has begun to crash sometimes - I am on Xorg + i3 (often in a Firefox) and suddenly Xorg crashes, I can see a few lines in a console (for a fraction of a second) and then, laptop restarts.

How can I find out what happened? There is nothing interesting in /var/log/messages.
Can I debug something about this crashes?

I am using official stock kernel + freebsd-update script for base updates.

Thanks very much!

HW: Toshiba Z30 laptop, Intel i5-4210U (Haswell), Intel HD integrated graphics.

Code:

$ freebsd-version -ku
11.1-RELEASE-p4
11.1-RELEASE-p6

ShelLuser · Jan 10, 2018

One option which I'm familiar with is to specify dumpdev in /etc/rc.conf (see also /etc/defaults/rc.conf as well as rc.conf(5)).

Depending on your filesystem this may or may not be usable.

ralphbsz · Jan 10, 2018

How do you start X? If you do it manually from the console, try: "startx >& /home/myusername/x.log &", to save the messages from X that happen right before restart.

Petr Fischer · Jan 10, 2018

I enabled dumpdev and startx logging, next time I will see. Thanks.

Snurg · Jan 10, 2018

Have you already considered a thorough testing using memtest86?

Petr Fischer · Jan 10, 2018

@Snurq - I had the same suspicion on RAM, so I took out one additional 8GB module I bought recently. With just one 8GB original module it crashes too. I will try some Live admin Linux distro/USB with memtest86 inside, thanks.

SirDice · Jan 10, 2018

Petr Fischer said:
I had the same suspicion on RAM, so I took out one additional 8GB module I bought recently. With just one 8GB original module it crashes too.

Perhaps it's the old one that's broken? Have you tried swapping the old with the new one?

Petr Fischer · Jan 10, 2018

Ohhh OK, screwdriver again!

Petr Fischer · Jan 11, 2018

status: Memtest86 OK (overnight). Waiting for next crash.

SirDice · Jan 11, 2018

There's nothing worse than a crash that can't be reproduced easily. If you can reproduce a crash by issuing a certain command or getting it in a certain state it's typically a lot easier to solve. Random crashes are the worst as you don't know where to begin. Identifying the culprit is usually the hardest part, once you know where the issue is it's generally easy to fix.

max21 · Jan 11, 2018

My FreeBSD-11 stable in vBox rebooted on me twice in one week for no reason at all. This was like a day after my latest SVN update. The 2nd time it happen I restore from the oldest backup that I could find (under a month I think) and it did not happen again from this point. I'm keeping an eye on the CPU temp because I can hear the computer running these days. It needs the dust blown-out of it. Strange coincidence. I wonder what p is this #0 r326742. Now I think I'll stick with it until FreeBSD-11.2 JIC.

Code:

(~) uname -a
FreeBSD web.devel.local 11.1-STABLE FreeBSD 11.1-STABLE #0 r326742: Sun Dec 10 19:07:58 UTC 2017     root@web.devel.local:/usr/obj/usr/src/sys/GENERIC  amd64
(~)

SirDice · Jan 11, 2018

max21 said:
I wonder what p is this #0 r326742.

-STABLE doesn't use the -p notation, it only keeps track of the revision (that's what the r number is). Revisions are constantly pushed so there's never a 'fixed' patch point. Security advisories typically mention the minimal revision number for -STABLE (you need to have revision number X or higher). The #0 is the number of times you've built the kernel after a make clean.

Petr Fischer · Jan 11, 2018

Fresh crash! Unsatisfying.
Nothing interesting in startx log.

I have:

Code:

dumpdev="AUTO"

in rc.conf. So, crashdump in default swap from /etc/fstab. Default dir for crash dumps is /var/crash (savedump after reboot).

But - nothing in /var/crash (no crashdump)

Petr Fischer · Jan 11, 2018

Next crash. Some news.

startx log (tail):

Code:

led 11 20:15:32 JavaScript error: resource://gre/modules/TelemetrySession.jsm, line 1698: NS_ERROR_NOT_AVAILABLE: Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsIMemoryReporterManager.residentUnique]
led 11 20:15:32 JavaScript error: resource://gre/modules/TelemetrySession.jsm, line 1698: NS_ERROR_NOT_AVAILABLE: Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsIMemoryReporterManager.residentUnique]
led 11 20:15:33 
led 11 20:15:33 (sakura:5492): Gdk-CRITICAL **: gdk_keymap_get_entries_for_keyval: assertion 'keyval != 0' failed
led 11 20:15:34 
led 11 20:15:34 (sakura:5492): Gdk-CRITICAL **: gdk_keymap_get_entries_for_keyval: assertion 'keyval != 0' failed
led 11 20:15:47 stty: stdin isn't a terminal
led 11 20:15:48 [0111/201548.519423:ERROR:stack_trace_posix.cc(602)] Not implemented reached in bool base::debug::(anonymous namespace)::SandboxSymbolizeHelper::CacheMemoryRegions()
led 11 20:15:48 [0111/201548.903085:ERROR:stack_trace_posix.cc(602)] Not implemented reached in bool base::debug::(anonymous namespace)::SandboxSymbolizeHelper::CacheMemoryRegions()
led 11 20:15:49 [0111/201549.504634:ERROR:stack_trace_posix.cc(602)] Not implemented reached in bool base::debug::(anonymous namespace)::SandboxSymbolizeHelper::CacheMemoryRegions()
led 11 20:15:49 [0111/201549.545147:ERROR:stack_trace_posix.cc(602)] Not implemented reached in bool base::debug::(anonymous namespace)::SandboxSymbolizeHelper::CacheMemoryRegions()
led 11 20:15:49 [0111/201549.810591:ERROR:stack_trace_posix.cc(602)] Not implemented reached in bool base::debug::(anonymous namespace)::SandboxSymbolizeHelper::CacheMemoryRegions()
led 11 20:15:50 [0111/201550.104376:ERROR:stack_trace_posix.cc(602)] Not implemented reached in bool base::debug::(anonymous namespace)::SandboxSymbolizeHelper::CacheMemoryRegions()
led 11 20:15:54 [0111/201554.536366:ERROR:stack_trace_posix.cc(602)] Not implemented reached in bool base::debug::(anonymous namespace)::SandboxSymbolizeHelper::CacheMemoryRegions()

I have GELI encrypted SWAP - is it a problem for crash dump saver (/var/crash is still empty)? Thanks!

max21 · Jan 11, 2018

Heavy!

Just an idea .. Maybe you can reinstall parts or all of XORG or whatever seem to be the problem, but clean it up first. GDK must have a bug in my case also and the SVN update could have set it off. ... TelemetrySession could have something to do with MySQL or Firefox. I would reinstall a downgraded Firefox first to see what happen.

Code:

make config-recursive
make rmconfig-recursive
make clean-depends
make deinstall clean

Petr Fischer · Jan 11, 2018

Side note about the startx log: if I redirect complete output from the startx script, there is messages not only from Xorg, but also other apps like: i3, i3-status, gtk, gdk, sakura terminal, firefox, chrome...

I am using official packages (pkg), for everything from Xorg to all the apps - there is no personal mess somewhere in libraries, different port options etc.

tankist02 · Jan 11, 2018

BTW what filesystem you have? I remember long time ago I had a computer with UFS which first crashed because of video driver. Then the filesystem got corrupted so much that it started crashing much more often.

Petr Fischer · Jan 11, 2018

tankist02 said:
BTW what filesystem you have? I remember long time ago I had a computer with UFS which first crashed because of video driver. Then the filesystem got corrupted so much that it started crashing much more often.

ZFS on GELI encrypted GPT parition (SSD disk). Yesterday I ran zpool scrub - no errors (no bad checksums).

Now I'm trying "scfb" Xorg driver instead of "intel" (HD Graphics). I will see.

max21 · Jan 12, 2018

Now that I see how debugging works I’m going to make room for dumpfs. Taking it from the top, you probably found this link already. It does seem to be Firefox related. ZFS, redirecting scripts are way over my head … my setup is simple so other than this I don’t have a clue. To me I think the update triggered things off causing me to take one step back by restoring previous. That's all the debugging I know how to do.

https://stackoverflow.com/questions...ailure-code-0x80040111-ns-error-not-available

ralphbsz · Jan 12, 2018

Clearly a bug in GDK. An assert always indicates a bug (at least in well-written software, if people use assert for other purposes, that would be weird). The question is: is that the only bug? Or is the bug triggered because something more fundamental is already wrong?

Find a place where GDK developers hang out, and ask there.

I have no idea what that javascript error means.

Petr Fischer · Jan 12, 2018

max21 + ralphbsz - Forget about app errors like these from GDK and web browsers - my laptop crashes completely. These client errors can not crash the computer/kernel.

I need a kernel crashdump (+ knowlegde how to find something usefull in it). But my /var/crash is empty. Is it due to GELI encrypted swap?

Petr Fischer · Jan 12, 2018

OK, this: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=124747

ralphbsz · Jan 12, 2018

Good. This hopefully gives you the ability to get crash dumps.

In theory, you are correct that app errors (like GDK) can't crash the whole OS. But using the same theory that software is perfect, the OS shouldn't crash in the first place, nor should these asserts even occur. So the GDK etc. problem might actually have crashed the OS. Or it might be another effect of the same root cause that also killed the OS. So I think it's also worth looking at. But the kernel crash dump is more likely to yield information.

Petr Fischer · Jan 12, 2018

Oh yes! Crashdump is here. Proper config is this:

/etc/fstab

Code:

/dev/ada0p4.eli     none    swap    sw,late     0   0

/etc/rc.conf

Code:

dumpdev="/dev/ada0p4"

kgdb output:

Code:

$ sudo kgdb /boot/kernel/kernel /var/crash/vmcore.0

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...(no debugging symbols found)...
Attempt to extract a component of a value that is not a structure pointer.
Attempt to extract a component of a value that is not a structure pointer.
#0  0xffffffff80a6b98a in doadump ()
(kgdb)

No debug symbols, no output. I wrote above that I am using official stock kernel - I have not "kernel.debug" or something like this. Any ideas or good tutorial for kernel debugging noob? Thanks!

Snurg · Jan 12, 2018

Sorry for disturbing again. The laptop apparently crashes randomly, and this can have many reasons aside from RAM problems.
For example, laptops are very susceptible to BGA solder problems, aggravated from constant twisting and tilting. There are even resoldering companies specialized on this kind of problems.

Memtest not showing errors does not tell anything. It is only meaningful in case it detects an error.
To find instabilities, there is better method: make buildkernel+buildworld. This process stresses the whole hardware and if there are any problems they'll likely show up. Because, this testing method is far more sensitive in indicating single bit flips than for example web browsing or memtests. On a good computer this should just succeed. If it fails, especially if it fails in varying ways every run, then you know there is hardware instability. In such a case I doubt it makes much sense to investigate what bit error caused one particular crash.

And, I'd recommend to backup important data before (if not already done). Filesystem can easily get corrupted on a murky computer.

Solved FreeBSD 11.1 has begun to crash accidentally - how to debug?

Petr Fischer

ShelLuser

ralphbsz

Petr Fischer

Snurg

Petr Fischer

SirDice

Administrator

Petr Fischer

Petr Fischer

SirDice

Administrator

max21

SirDice

Administrator

Petr Fischer

Petr Fischer

max21

Petr Fischer

tankist02

Petr Fischer

max21

ralphbsz

Petr Fischer

Petr Fischer

ralphbsz

Petr Fischer

Snurg