Solved self compiled openbox binary crashes on 12.2 but the pkg version works

Hello, I'm having a relatively small problem when building a fully configured FreeBSD 12.2 system with a graphical setup.
This has been a POS system based on FreeBSD 12.1 for some time without any problems. Everything is built from src and ports, so I'm not using any precompiled binaries.
This is a dual monitor system and it includes openbox as window manager. After upgrade to 12.2 openbox starts crashing with signal 11/core dump when starting new SDL 1.2 based programs. This problem can be fixed by using the openbox binary from pkg instead of the compiled ports version, or the compiled binary taken out of my previous FreeBSD 12.1 based build tree, but for consistency I'd like to keep everything source-based and compiled in a single system tree.

The strange thing is that the 12.1 and 12.2 x11-wm/openbox port directories contain the same source files when extracted. The only difference is a change from revision 6 to 7 in Makefile. Not sure what causes this crash but I think it has to do with the change from llvm80 to llvm90 als default compiler. I tried compiling openbox in a 12.2 system with the setting DEFAULT_VERSIONS+=llvm=80 in /etc/make.conf but the problem doesn't go away with it.

Has anybody an idea on what could cause this crash and how I compile a port in a FreeBSD 12.2 installation, but with all the build settings from 12.1, if this is possible?
 
Not sure what causes this crash
Verify all its dependencies too. Make sure they're all up to date as the wrong version of a dependency or any other problems with dependencies could also lead to crashes.

The fact that the package works more or less indicates a local problem, most likely with one or more dependencies. Packages are built from the same port you're using, with the same default compiler for each FreeBSD version. The biggest difference however is that packages are always built with a "clean" environment. Every dependency is built first (also in a clean environment) then installed prior to the build of this port.

but for consistency I'd like to keep everything source-based and compiled in a single system tree.
If you really want consistency then you should do your building with ports-mgmt/synth or ports-mgmt/poudriere. Both tools use a "clean" environment to build ports.

I compile a port in a FreeBSD 12.2 installation, but with all the build settings from 12.1, if this is possible?
There is no difference in build settings between 12.1 and 12.2. Both versions used the exact same port.
 
openbox is an X application so it does pull fair amount of dependencies with it. To at least start getting idea you need to know where it crashes and/or on what. If you have coredump you can check it with the gdb (gdb /path/to/binary /path/to/coredump) and see the stack backtrace.
I like to examine asm instructions up and below the current IP to get the idea what was happening. If you compile it with debug (-g) you'll get more friendly environment to debug. But it's SDL so that can be a problem (lots of stuff to dig through to get the idea what was happening).
 
openbox is an X application so it does pull fair amount of dependencies with it. To at least start getting idea you need to know where it crashes and/or on what. If you have coredump you can check it with the gdb (gdb /path/to/binary /path/to/coredump) and see the stack backtrace.
I like to examine asm instructions up and below the current IP to get the idea what was happening. If you compile it with debug (-g) you'll get more friendly environment to debug. But it's SDL so that can be a problem (lots of stuff to dig through to get the idea what was happening).

I actually don't like Openbox but it does a good dual-monitor setup, where I can choose which program has to run on which display at which moment. This construction allows a "stretched desktop" to be used by 2 persons at the same time. There's only 1 X-server and 1 mouse pointer but left and right run different independent program's.
The openbox code is many years old and I don't understand much of it. Debugging probably won't help me...
Currently I'm trying to build a clean FreeBSD 12.2 from the latest source with the latest portstree. I just noticed that a portstree fetched with portsnap is older that 1 that I manually downloaded from FreeBSD.org with ftp. It broke on a minor version difference of png as imlib2 dependency that exists since 12 days.

Next step is to verify all exact versions of all required ports that openbox 1.6_7 depends on. Yesterday I noticed that I had a slightly different glib20 version by actually comparing the versions a shown on the ports webpage. This might be related...
As this POS-system contains many configurations for a specific system (HP SFF 64-bit family) and fully runs from RAM. I still expect many problems. It still runs perfectly on FreeBSD 12.1, though. I'm trying this to just have a theoretically more stable recent FreeBSD world.
 
I was not aiming at your preference of WM but rather that it's X and hence checking out why it crashed could be harder.
You asked if we know what could be the cause of the crash. I wanted to share where to start to get the idea of the initial crash.
 
I just noticed that a portstree fetched with portsnap is older that 1 that I manually downloaded from FreeBSD.org with ftp.
Portsnap makes snapshots at regular intervals. If you want the absolute latest ports tree use subversion (soon to be replaced with git).
 
  • Thanks
Reactions: MG
Portsnap makes snapshots at regular intervals. If you want the absolute latest ports tree use subversion (soon to be replaced with git).
I wasn't aware of that. Is ftp://freebsd.org/pub/freebsd/development/tarballs/ports_current.tar.gz not the most recent either?
 
The ports tree is constantly in motion. So these are always snapshots in time. Ports tree is managed through version control software (subversion or git), that's the only way to always have every update/change available as soon as it's been committed.

 
I was not aiming at your preference of WM but rather that it's X and hence checking out why it crashed could be harder.
You asked if we know what could be the cause of the crash. I wanted to share where to start to get the idea of the initial crash.
Allright, I never do this, but let's try. I recreated the core dump and installed gdb. Now how do I see what happened?


What happens here: I start the system, which should start openbox after Xorg, but it instantly crashes, leaving X without a window manager. As the interface is now down, I remotely login with ssh and manually start openbox. It accepts the situation and takes 2 running SDL programs into the window manager. This works fine. Also my openbox keyboard shorcuts to open xterm's work. But when I start another SDL program, openbox crashes with the nonsense message "all you base are belong to us blabla" and dumps the core. Note this is no intrusion. This text exists in de openbox source and is just an unhandled error.
 
I'm assuming it's the openbox binary that crashed. You can do this: gdb `which openbox` /path/to/corefile
In gdb (my preference) do
Code:
set disassembly-flavor intel
bt
x/4i $pc
info registers
This gives us an idea where it failed. Reason why could be complicated because of X, SDL and its dependencies.
 
I'm assuming it's the openbox binary that crashed. You can do this: gdb `which openbox` /path/to/corefile
In gdb (my preference) do
..

Results of:
Code:
bt:
#0  0x0000000800f69c2a in thr_kill () at /mfsbin/libc.so.7
#1  0x0000000800f68084 in raise () at /mfsbin/libc.so.7
#2  0x0000000800ede279 in abort () at /mfsbin/libc.so.7
#3  0x00000008002fa849 in  () at /usr/local/lib/libobt.so.2
#4  0x0000000800929b70 in  () at /lib/libthr.so.3
#5  0x000000080092913f in  () at /lib/libthr.so.3
#6  0x00007ffffffff003 in <signal handler called> ()
#7  0x000000000022625c in  ()
#8  0x0000000000226c7f in actions_run_acts ()
#9  0x00000000002427c5 in keyboard_event ()
#10 0x000000000023551b in  ()
#11 0x00000008002fbf12 in  () at /usr/local/lib/libobt.so.2
#12 0x0000000800bc089e in g_main_context_dispatch () at /usr/local/lib/libglib-2.0.so.0
#13 0x0000000800bc0c44 in  () at /usr/local/lib/libglib-2.0.so.0
#14 0x0000000800bc0f9a in g_main_loop_run () at /usr/local/lib/libglib-2.0.so.0
#15 0x000000000024aae2 in main ()

x/4i $pc:
=> 0x800f69c2a <thr_kill+10>:   jb     0x800f6e404
   0x800f69c30 <thr_kill+16>:   ret    
   0x800f69c31: int3   
   0x800f69c32: int3   

info registers:
=> 0x800f69c2a <thr_kill+10>:   jb     0x800f6e404
   0x800f69c30 <thr_kill+16>:   ret    
   0x800f69c31: int3   
   0x800f69c32: int3
 
You didn't paste the proper info registers. I'd say the main focus would be in frame 7 (right before signal handler is called), so:
Code:
f 7
disass $pc-0x30, $pc+0x30
info registers
This would just give an idea and location of the crash.

But to get some deeper information it would be really beneficial to compile it with debug (ports allow that) to see what's going on.
 
(System booted again, so the bt output changed a bit. Not sure if I'm still doing this right. My assembly knowledge stops at IBM 8086)

f 7
Code:
#7  0x000000000022625c in ?? ()
disass $pc-0x30, $pc+0x30
Code:
Dump of assembler code from 0x22622c to 0x22628c:
   0x000000000022622c:  sub    BYTE PTR [rax-0x77],cl
   0x000000000022622f:  repz mov r14,rdi
   0x0000000000226233:  mov    rax,QWORD PTR [rip+0x36546]        # 0x25c780 <__stack_chk_guard>
   0x000000000022623a:  mov    QWORD PTR [rbp-0x20],rax
   0x000000000022623e:  xorps  xmm0,xmm0
   0x0000000000226241:  movaps XMMWORD PTR [rbp-0x30],xmm0
   0x0000000000226245:  movaps XMMWORD PTR [rbp-0x40],xmm0
   0x0000000000226249:  cmp    DWORD PTR [rsi+0x30],0x0
   0x000000000022624d:  je     0x226274
   0x000000000022624f:  mov    r15d,DWORD PTR [rbx+0x2c]
   0x0000000000226253:  test   r15d,r15d
   0x0000000000226256:  js     0x2262dc
=> 0x000000000022625c:  movups xmm0,XMMWORD PTR [rbx+0x8]
   0x0000000000226260:  movups xmm1,XMMWORD PTR [rbx+0x18]
   0x0000000000226264:  movaps XMMWORD PTR [rbp-0x30],xmm1
   0x0000000000226268:  movaps XMMWORD PTR [rbp-0x40],xmm0
   0x000000000022626c:  cmp    DWORD PTR [r14],0x7
   0x0000000000226270:  jne    0x22629c
   0x0000000000226272:  jmp    0x2262c2
   0x0000000000226274:  mov    r15d,DWORD PTR [rip+0x362d5]        # 0x25c550 <screen_num_monitors>
   0x000000000022627b:  mov    edi,r15d
   0x000000000022627e:  call   0x253690 <screen_physical_area_monitor>
   0x0000000000226283:  mov    ecx,DWORD PTR [r14+0x8]
   0x0000000000226287:  sub    ecx,DWORD PTR [rax]
   0x0000000000226289:  mov    DWORD PTR [rbp-0x40],ecx
End of assembler dump.
info registers
Code:
rax            0x1                 1
rbx            0x5                 5
rcx            0x25c780            2475904
rdx            0x20                32
rsi            0x33                51
rdi            0x0                 0
rbp            0x7fffffffe500      0x7fffffffe500
rsp            0x7fffffffe4a0      0x7fffffffe4a0
r8             0x801a75400         34387481600
r9             0x10                16
r10            0x10                16
r11            0x0                 0
r12            0x801b0a490         34388092048
r13            0x1                 1
r14            0x801b7ef40         34388569920
r15            0x1                 1
rip            0x22625c            0x22625c
eflags         0x10202             [ IF RF ]
cs             0x43                67
ss             0x3b                59
ds             <unavailable>
es             <unavailable>
fs             <unavailable>
gs             <unavailable>
fs_base        <unavailable>
gs_base        <unavailable>
 
This does answer the easier part of the problem - where it crashed. Instruction 0x22625c is trying to dereference a pointer in %rbx which has a bogus value (5, i.e not a valid memory address in this context). Most likely was attempting to access a structure member.

Now why is the harder part of this problem. As you do have a way of fixing it (using binary version, or maybe using updated port version) you are probably done.
To look deeper into the problem compiling openbox with debug symbols ( /etc/make.conf, include WITH_DEBUG=yes) is really a good idea.

From backtrace we know events happened in the order: keyboard_event()->actions_run_acts()->unknown() where crash occurred. It's a starting point for debugging.

Why did you mention SDL ? Are you able to start openbox if SDL application is not started ? Maybe the compiled version has some sort of keyboard event compiled version doesn't recognize (very wild guess) ?
While in the wild guess area sometimes it happens that the function with the same name changes (more parameters or parameters in different order). Older library call could mess this up then.
libc location suggests this is not a standard FreeBSD installation maybe?

Note:
the start of the disassembly may not necessary make sense. We used the $pc-0x30 which may have landed in the middle of the instruction. It doesn't matter, I wanted to see the few instructions up and down to make more sense on what was going on.
 
This is a custom FreeBSD "live" system used in a store/business. It depends on the somewhat deprecated mfsroot and uzip constructs to get everything booted up without relying on a harddisk. It consists of 2 touch-monitors, each running an SDL program that represents the basic interface. Additional programs like member-management and print operations run on top of these existing windows. The only non-SDL program used is xterm for administrative tasks. Many additional USB hardware is required: 2 touchscreen data-wires, 2 drawer-triggers, 2 NFC cardreaders, printer and webcam. Because of this, I know an instant system upgrade will introduce new problems without any doubt.

While looking at this debugging as a noob, I have the idea it indeed has to do with glib20 and the way openbox handles window positioning on a dual head setup. This probably has to do with a version incompatibility, because the pkg-based openbox version on 12.2 works without problems, as well as the binary taken from the working 12.1 system.

Currently still trying to build a working 12.2 system, but a whole compile takes a few hours. Introduction of llvm90 and Rust as dependencies of this system in FreeBSD 12.2 make me think about getting another development-system :)
 
While it's a stretch you could objdump -d openbox_from12.1 > dump-12.1, do the same from a port version and diff it. It does make sense to do this on your version and port version of 12.1. If it produces too many changes it's useless but sometimes it can help. But installing the debug version is really a way to go if you're interested to dig deeper.

You could even share your crashing version of openbox. Check for the name of your package by pkg info openbox\*. Then do the pkg create -o . <full_name>. It creates the package you can then share.
 
The cause of this crash appears to be the compiler being llvm90 in FreeBSD 12.2
With:
Code:
cd /usr/ports/x11-wm/openbox
make CC=clang80 CXX=clang++80 CPP=clang-cpp80
a correct openbox binary is built.
 
I can't say I'd be able to identify the issue but it did spark an interest in me. Would you be willing to build and share these two versions of openbox ? Preferably with debug symbols (WITH_DEBUG=yes in /etc/make.conf. Build two versions of packages and create it as I mentioned above.
I've heard some issues between g++ and clang++ when it came to optimization but I can't think of the problem here.
 
I can't say I'd be able to identify the issue but it did spark an interest in me. Would you be willing to build and share these two versions of openbox ? Preferably with debug symbols (WITH_DEBUG=yes in /etc/make.conf. Build two versions of packages and create it as I mentioned above.
I've heard some issues between g++ and clang++ when it came to optimization but I can't think of the problem here.
Going to bed now, otherwise I miss the Perseverance landing tomorrow, but I will do that later.
1 thing I remember is a new clang warning in FreeBSD 12.2 related to snprintf() parameters in my own SDL programs. The openbox code contains a few g_snprintf(). This might have to do with it.
 
...this could also mean that the previous LLVM-8.0 is incorrect in some subtle cases, and the openbox source code is buggy but that bug is revealed only by the new LLVM compiler, while the previous version got away with it. AFAIK there are no 100% correct compilers available... ;)
 
The cause of this crash appears to be the compiler being llvm90 in FreeBSD 12.2
Are you sure (just on the clang-version-that-ships-with-12.2 part)?

I've been looking at MySQL 5.7 needing a different version of clang on FreeBSD 13.0 which ships with clang 11 - and it looks like FreeBSD 12.2 has clang 10?
Code:
% uname -a
FreeBSD something.co.nz 12.2-RELEASE-p3 FreeBSD 12.2-RELEASE-p3 GENERIC  amd64
% clang -v
FreeBSD clang version 10.0.1 (git@github.com:llvm/llvm-project.git llvmorg-10.0.1-0-gef32c611aa2)
Target: x86_64-unknown-freebsd12.2
It's not directly related to your issue but trying to understand to make sure I'm looking at the right things.

Also: https://www.freebsd.org/releases/12.2R/relnotes/

The clang, llvm, lld, lldb, compiler-rt utilities and libc++ have been updated to version 10.0.1. [r363494]
 
On my system, the problem exists when openbox is built with llvm90 or llvm10. This is a complete buildtree that contains FreeBSD 12.2 and many ports. The default compiler is llvm10 but llvm90 is installed too as dependency for some ports.
I partially fixed it by installing llvm80 manually to compile openbox. It can now execute different programs on different displays without crashing again, but I still noticed a problem. This compile of openbox doesn't listen anymore to window-placement. Usually, on a dual monitor setup, there were keyboard shortcuts ctrl-alt-1, ctrl-alt-2 to send the active window to monitor 1 or 2. I used this in combination with the xdotool port to send programs to the correct monitor but this doesn't work anymore. As I mentioned earlier, I'm suspecting window placement handling and recent versions of glib20 to cause this.

I think the problem basically is that the exact combinations of required compiler versions and port dependencies aren't fully available. You still have to find it out by trial and error. Also the naming is a bit inconsistent. llvm10 should be called llvm100 instead, just to keep things clear. This is a "wild" version numbering change in the portstree. We need different llvm versions but the port name somehow must be 6 chars long? Same with glib. Why is the port named glib20 while the version number starts with 2. and the portname in Makefile is just glib?

I'm having a struggle with this forum. It doesn't like X-style copy/paste, and the method for replying on someone is vague... This was meant to be a reply to richardtoohev2. Why is repeating the full comment like in a mailing list required to be a reply?
 
Are you sure (just on the clang-version-that-ships-with-12.2 part)?
At this moment, I can't recreate the failing binary again because I deleted old buildtrees for more diskspace.
FreeBSD 12.2 now builds everything fine with llvm10. I think there was a dependency problem. I upgraded the host system with the latest 12.2 source and the portstree.
Also I found an old parameter for building ports inside some scripts: UNAME_r=12.1-RELEASE. I don't remember what it does or what required this but it's the wrong FreeBSD version, so I deleted all instances of that variable.

1 strange thing I noticed is that on a different 12.2 system, llvm80 was a build dependency of the port mesa-libs but that has now changed to the default (llvm10). Not sure how this is possible but it seems the build-dependencies of a port don't only depend on the port's source. There's also system-wide configuration involved.

What I have changed is the method to make SDL programs send a message to their own Xorg-window to be placed on the correct monitor.
Earlier I used the xdotool program to simulate a ctrl-1 or ctrl-2 keyboard-action that openbox used to send the window with focus to another monitor. For some reason that doesn't work anymore, even while openbox is the exact same version as I had on the 12.1 system. It now works again inside C, also with xdotool but with a windowmove command instead of a simulated key-sequence. When a program has to run on the 2nd display, 1024 pixels are added to the x-position so it goes to monitor 2. Small disadvantages are that both screens flash while this is done, and I have to change this when I change the displays resolution.
 
Back
Top