C Compiling with GCC

ekvz · Sep 14, 2020

So i am desperately trying to get some application to build. Well i've already gotten it to build but only with -O0. Building with anything higher than that makes the application instantly SIGSEGV(?) / SIGBUS on startup so i thought switching to GCC (or really G++ as it's a C++ codebase) might work a little better but it's the same sad result (-O>0 = crash). Is there maybe something i need to take care of when switching the compiler to GCC? I've seen posts about having to set rpath and/or include paths to have GCC work correctly? Is this still accurate information?

mark_j · Sep 14, 2020

Well, personally, I'd forget about gcc or clang or tcc, and resolve the segmentation fault. It sounds like otherwise it will be a sheer fluke if it works (and what other nasties are hiding in code that can't seem to handle optimisation)

Bobi B. · Sep 14, 2020

Is it your program or you are only trying to make it run? Why don't you start it through a debugger (lldb(1) or gdb(4)) to see where the problem is? Remember to compile and link with -g command-line option to include debug information in the binary.

ekvz · Sep 14, 2020

mark_j said:
Well, personally, I'd forget about gcc or clang or tcc, and resolve the segmentation fault. It sounds like otherwise it will be a sheer fluke if it works (and what other nasties are hiding in code that can't seem to handle optimisation)

In general i fully agree with you but fixing this is somewhat hopeless for me. It's a giant C++ codebase. Building takes literally hours, it's not crashing on anything obvious (it seems to be a random MOVQ instruction where the pointer isn't easily traceable triggers sigbus), just loading a coredump with debug symbols takes 10 minutes or something (and uses about like 7/8+gb of RAM), actually single stepping the program..., the build system is hot garbage and on top of it all C++ isn't really my thing. As much as i'd love to it's just not practical for me to debug this.

I also pretty much share your feelings on the quality of said code. It's just sadly to huge to really be replaced by something less flaky. Even the author himself somewhat admits to it as he basically points to a set of GCC versions that should be used and declares everyhing else unsupported...

Bobi B. said:
Is it your program or you are only trying to make it run? Why don't you start it through a debugger (lldb(1) or gdb(4)) to see where the problem is? Remember to compile and link with -g command-line option to include debug information in the binary.

Thanks but the fact that the project is so damn huge, the buildsystem being really, really "special" and my lack of more than very basic C++ makes debugging pretty painful. I'd want to but i just don't see it happening.

Edit: No, this is very much not my code. I guess that came across between the lines anyways but i thought i might as well state it directly.

As sad as it is if GCC does not magically fix this (an old port also used GCC so i had hopes that might be the solution) i'll probably settle on -O0 or simply abandon it. I've spend the the last 4-5 days fixing compile errors and running builds. After finally getting it to build (which admittedly wasn't that hard even if some of the stuff the various compilers would choke on was somewhat cryptic and changing a single line mostly would trigger a rebuild of like half the project...) i've tried pretty much every reasonable (and most unresonable too) combination of flags (as far as the codebase allows it... there is some flags that cannot be removed without being majorly yelled at...) to see if it would be happy but no.

mark_j · Sep 14, 2020

Well if you suspect movq then it's likely some stack corruption. If its some gcc-specific code in it, does it require the gnu linker?
Also you can compress the symbols up but I forget the option. Maybe -compress-sections??? It might let you attach the debugger if it's compressed. Core dump?
If you can run it, you can use gdb to attach to it.
(It all sounds vague but obviously I gave no idea about the code so it's guesswork at best)

Also you mention -rpath. In such a large code base is it even possible to find what library path it requires hard-coding?

As to SIGBUS, is this program of yours using mmap(2) by any chance? If it is, examine that closely as it will most likely be the cause.

ekvz · Sep 14, 2020

mark_j said:
Well if you suspect movq then it's likely some stack corruption. If its some gcc-specific code in it, does it require the gnu linker?

That's a good question. The uncompressed codebase is around 1.4GB. It's somewhat hard to tell what's actually in there and what isn't. I think i've read about successful Clang builds but who knows what kind of settings were used. Maybe those persons just used -O0 and called it a day. What's kinda interesting is how building with Clang required less patches than building with GCC which kinda of also gave me the idea that there might be something wrong with my GCC setup.

mark_j said:
Also you can compress the symbols up but I forget the option. Maybe -compress-sections??? It might let you attach the debugger if it's compressed. Core dump?
If you can run it, you can use gdb to attach to it.
(It all sounds vague but obviously I gave no idea about the code so it's guesswork at best)

Neither have i. All i can say for sure is that it's some serious C++ monstrosity. I'll look into compressing the symbols and see if that makes a difference. My system only has 6GB of RAM so that might very well speed things up a bit. I have my doubts if i will be able to make much sense of the code though. From what i saw it seems to be crashing somewhere in a JS engine(?) while trying to build a list of sorts (shuffling pointers around to set next/prev members) with one of them being bad. I am not very optimistic i'll be able to trace this back far enough to figure out the underlying cause.

mark_j said:
Also you mention -rpath. In such a large code base is it even possible to find what library path it requires hard-coding?

This wasn't really meant in relation to the project itself but rather the way FreeBSD's GCC port is setup. From what i read it seemed like it might need some help finding it's own libstdc++ as it's in a non default location (/usr/local/lib/gccX) and could otherwise possible pick up a version from base which obviously would not really be compatible (looking into it it seems there is no libstdc++ in base anyways though).

Checking GCC/G++ default include paths it still seems it won't be using it's own C++ headers by default. At least /usr/local/lib/gcc7/include/c++ is not among the locations listed (which suspiciously look identical to the C defaults though so listing C++ paths might not work the same way it does with C). How it would locate the system headers at /usr/include/c++/v1 is beyond me though and obviously it's finding something. I've just stuck a -isystem /usr/local/lib/gcc7/include/c++ into CXXFLAGS and in about 2 hours i'll see if that did anything.

mark_j said:
As to SIGBUS, is this program of yours using mmap(2) by any chance? If it is, examine that closely as it will most likely be the cause.

Likely. At this size it's kinda hard to tell but i figure it has to be as what else besides access beyond some mapped region could cause the sigbus? MOVQ doesn't have any requirements concerning alignment. I think it's possible to still turn non aligned access into errors on X86 but it doesn't seem very likely this kind of code would do this (it insists on -fno-exceptions so i don't think it would bother with any kind of strict alignment).

The whole thing is pretty messed up. It's build system consists of a shell script that calls a ton of python scripts but also somehow relies on at least 3 autoconf configurations. 2 of which i had to patch to even respect CC/CXX because they would stubbornly try to detect the compilers themselves and always end up using Clang even if the first configure would obey and list GCC as the compiler... There might be very well something fishy going somewhere during the build. It's practically impossible to really follow the build log.

kpedersen · Sep 14, 2020

If it isn't getting to main yet, perhaps there is a large object / array outside of a function that is too large for stack memory?

How big is the resulting binary? I don't suppose stripping would help (-s or via strip command).

ekvz said:
The whole thing is pretty messed up. It's build system consists of a shell script that calls a ton of python scripts but also somehow relies on at least 3 autoconf configurations.

Build systems are my biggest weakness. Many are so bizarre. Does it show output? Perhaps grep for ^gcc, ^cc, etc and see if any weird arguments are being passed in for any compilation units.

ekvz said:
would trigger a rebuild of like half the project

Ugh, how do some developers work like this! My OCD generally requires a compilation every 2-3 lines of code... If iteration speeds aren't typically less than 2 seconds, I die a little bit inside.

ekvz · Sep 14, 2020

kpedersen said:
If it isn't getting to main yet, perhaps there is a large object / array outside of a function that is too large for stack memory?

I am pretty sure main is reached. There is even some output from GTK2 on the console about my theme (no it's not that - i moved my gtkrc-2.0 file out of the way and it still crashed).

kpedersen said:
How big is the resulting binary? I don't suppose stripping would help (-s or via strip command).

The binary itself isn't to bad (~300kb) but the real offender is a 137mb (at -O0, stripped) shared library it uses.

ekvz · Sep 14, 2020

kpedersen said:
Build systems are my biggest weakness. Many are so bizarre. Does it show output? Perhaps grep for ^gcc, ^cc, etc and see if any weird arguments are being passed in for any compilation units.

Yes, it does. Usually not alot but i've figured it takes a -v switch that at least shows the actual commands. I've been looking for a while and didn't see anything all that out of the ordinary (at least in relation to this peoject...). Now most commands are about 1/3 of my (graphical) terminal so there might very well be something i've missed.

kpedersen said:
Ugh, how do some developers work like this! My OCD generally requires a compilation every 2-3 lines of code... If iteration speeds aren't typically less than 2 seconds, I die a little bit inside.

I am not sure but in this case it might actually have been needed. Some of the patches had to be done to headers that seemed very low level so maybe they were actually included by almost half the project. Who knows? Doesn't change the fact that patching is super painful though. Try something, wait 30-60 minutes, get the same error again because it wasn't the right way to fix it... meh.

By now i've given up on leaving anything i can do myself to the build system anyways and just delete everything to force a clean build every time i try something new. I really don't trust this thing to not include some stale objects here or there so every try needs something like 2+ hours to build.

kpedersen · Sep 14, 2020

Hmm.

Are you able to 'ldd' the shared object? If that crashes when just loading to get the info that eliminates many other potential issues.

Also, does Valgrind work on FreeBSD still? Perhaps it is worthwhile running it through that. There is an experimental stack memory checker too which sounds like it could be useful. If the port is in bad shape, perhaps run it through Valgrind on Linux, hopefully it will pick up the potential memory issue causing symptoms on FreeBSD.

unitrunker · Sep 14, 2020

Valdgrind was broken for a spell but appears to be back.

Making sure you're not a bot!

unitrunker · Sep 14, 2020

Does the library have unit tests?

ekvz · Sep 14, 2020

kpedersen said:
Hmm.

Are you able to 'ldd' the shared object? If that crashes when just loading to get the info that eliminates many other potential issues.

Also, does Valgrind work on FreeBSD still? Perhaps it is worthwhile running it through that. There is an experimental stack memory checker too which sounds like it could be useful.

unitrunker said:
Valdgrind was broken for a spell but appears to be back.

Making sure you're not a bot!

Yes, Valgrind might be at least worth a try. I haven't really worked with it much yet so i was a bit reluctant but it's probably the most sensible thing to do in this situation.

kpedersen said:
If the port is in bad shape, perhaps run it through Valgrind on Linux, hopefully it will pick up the potential memory issue causing symptoms on FreeBSD.

There is official builds for Linux so i figure it would "just work" (tm) there and not be of much use trying to pin down the actual problem on FreeBSD.

unitrunker said:
Does the library have unit tests?

I guess not. There is a build switch called --enable-tests but everything i can locate in the sources that might be testing related seems rather random. I am trying it right now though. Let's see what it results in. Mystery building...

Edit: That was a quick one. --enable-tests crashes the build system with some pretty generic python backtrace and the message that the build configuration file is invalid... I am now building at least without --disable-tests (which is part of the recommended build configuration) but i have a feeling this is not going to do much of anything.

Side note: Building with GCC actually seems to be worse (rpath and isystem switches don't seem to make a difference) than with Clang (the resulting binary doesn't even get to GTK and just crashes with zero output) so i've switched back to using Clang for now. Actually scratch that. I might have done something that had a side effect on the GCC build so i am giving it a second chance while trying to build the "tests".

ekvz · Sep 14, 2020

O-M-G!! That second chance really paid off it seems. While there still is some possibility of this being down to the missing --disable-tests build option but with a combination of -isystem, -Wl,rpath and without the recommended --enable-jemalloc (which also could be a very good clue as to where to start hunting for the real cause - this frankenstein monster bundles it's own version of jemalloc...) GCC 7.5.0 managed to produce a working -O2 build! I guess some of the GCC specific stuff is likely superfluous but i can figure that out later. Finally some progress! I really can't believe it. I'll do some more testing. Let's hope the build working is not just down to pure chance. If not i guess FreeBSD might get some new (old) port soon

Edit: I got so excited about the build working i totally forgot to look for the tests that might have been built without --disable-tests. Surprise, surprise, ... i don't see any kind of tests around the build directory...

Edit2: Of course it would be nicer to be able to build with Clang. So investigating the SIGBUS is not completely off the table. I would just love to get a working result after trying for so long. I can (and likely will) still improve it later on.

Edit3: I spoke a bit to soon. It's sadly not overly stable so further investigation is needed anyways but to finally make at least SOME progress is very motivating at this point.

Jose · Sep 14, 2020

The last time I had problems like these, Clang's Address Sanitizer saved my bacon. Valgrind was still broken on Freebsd at the time.

ekvz said:
...also somehow relies on at least 3 autoconf configurations
...this frankenstein monster bundles it's own version of jemalloc...

Argh. I suspect you've stumbled into a project that uses the Palemoon development model. Let's bundle our own outdated buggy versions of open-source projects in our code base.

kpedersen said:
Build systems are my biggest weakness. Many are so bizarre.

Tup is my favorite build system. Well, the Lua binding is anyway. I did manage to get it working on Freebsd. Now I just gotta write a port. It's on the todo list.

ekvz · Sep 15, 2020

Small update: I've looked into the new crash which happens way later but still happens rather consistently. It's actually a SIGSEGV this time. Sadly the backtrace is useless (libthr calls some signal handler which goes on the call parent handlers until one of them calls the default handler and triggers a core dump) so i've installed Valgrind. This worked fine aside from a little bug which is already known (got the fix from the FreeBSD bug tracker) in hopes of getting some kind of a clue. Valgrind itself also works but chokes on kevent for being an unhandled syscall. It seems there is no option to ignore kevent either but it directed me at some readme about adding the missing wrapper. Not exactly my idea of fun but i guess i don't have a choice. I just hope it will actually give me some useful information after that. Up until running into kevent it found nothing but a couple supposedly (false positives i guess) unitialized variables in lib.c. Let's see.

Paul Floyd · Sep 18, 2020

kpedersen said:
Also, does Valgrind work on FreeBSD still? Perhaps it is worthwhile running it through that. There is an experimental stack memory checker too which sounds like it could be useful. If the port is in bad shape, perhaps run it through Valgrind on Linux, hopefully it will pick up the potential memory issue causing symptoms on FreeBSD.

There are two Valgrind ports: devel/valgrind which is roughly version 3.10.1 and more or less nonfunctional. Then there's devel/valgrind-devel which is roughly version 3.16.1 and is more or less fully functional. I'm maintaining valgrind-devel. ATM neither is available via pkg install, you will need to update your ports tree and build install from there.

I would also strongly recommend using the sanitizers.

Paul Floyd · Sep 18, 2020

ekvz said:
Valgrind itself also works but chokes on kevent for being an unhandled syscall. It seems there is no option to ignore kevent either but it directed me at some readme about adding the missing wrapper. Not exactly my idea of fun but i guess i don't have a choice. I just hope it will actually give me some useful information after that. Up until running into kevent it found nothing but a couple supposedly (false positives i guess) unitialized variables in lib.c. Let's see.

Try the latest devel/valgrind-devel (see my previous post). It will save you huge amounts of grief trying to fix things yourself.

ekvz · Sep 18, 2020

Paul Floyd said:
There are two Valgrind ports: devel/valgrind which is roughly version 3.10.1 and more or less nonfunctional. Then there's devel/valgrind-devel which is roughly version 3.16.1 and is more or less fully functional. I'm maintaining valgrind-devel. ATM neither is available via pkg install, you will need to update your ports tree and build install from there.

I would also strongly recommend using the sanitizers.

Oh, that's pretty interesting. I had just mindlessly punched in cd /usr/ports/devel/valgrind and didn't even check if there were other versions. I can report that with the patch from the bug tracker (rather easy to locate and from what i understand already integrated in CURRENT?) and a couple added syscalls devel/valgrind seems to work though.

The output was also helpful in narrowing down the error. Sadly it's not exactly obvious as to why the responsible part fails (it's an intentional crash generated because some pretty complex code failed to initialize and figuring out that reason wil be pretty hard/painful it seems). Right now i am back to checking if some build setting might work around the problem but it doesn't look very promising so it's not unlikely i will soon have to fire up a debugger and step through that monstrosity in hopes to narrow the reason down further...

Paul Floyd said:
Try the latest devel/valgrind-devel (see my previous post). It will save you huge amounts of grief trying to fix things yourself.

Thank you but i've already "fixed" it. It wasn't all that bad anyways. Mostly just a bit boring (and somewhat questionable since i have little idea what i actually did there).

ekvz · Sep 18, 2020

Here is what i came up with to get the "unhandled syscalls" out of the way (it's probably not very useful/pretty bad but i figured i might as well share it):

Code:

diff -ur work/stass-valgrind-freebsd-ce1acb28953f/coregrind/m_syswrap/syswrap-freebsd.c patched/stass-valgrind-freebsd-ce1acb28953f/coregrind/m_syswrap/syswrap-freebsd.c
--- coregrind/m_syswrap/syswrap-freebsd.c    2016-01-13 20:20:20.000000000 +0100
+++ coregrind/m_syswrap/syswrap-freebsd.c    2020-09-17 21:41:54.117179000 +0200
@@ -3670,6 +3670,64 @@
         POST_MEM_WRITE( ARG5, ARG4 );
}

+// ekvz
+PRE(sys_kevent_fbsd12)
+{
+    *flags |= SfMayBlock;
+    PRINT("sys_kevent_fbsd12 ( %ld, %#lx, %ld, %#lx, %ld, %#lx )\n", ARG1,ARG2,ARG3,ARG4,ARG5,ARG6);
+    PRE_REG_READ6(long, "kevent",
+                 int, fd, struct vki_kevent *, newev, int, num_newev,
+         struct vki_kevent *, ret_ev, int, num_retev,
+         struct timespec *, timeout);
+   if (ARG2 != 0 && ARG3 != 0)
+      PRE_MEM_READ( "kevent(changeevent)", ARG2, sizeof(struct vki_kevent)*ARG3 );
+   if (ARG4 != 0 && ARG5 != 0)
+      PRE_MEM_WRITE( "kevent(events)", ARG4, sizeof(struct vki_kevent)*ARG5);
+   if (ARG6 != 0)
+      PRE_MEM_READ( "kevent(timeout)",
+                    ARG6, sizeof(struct vki_timespec));
+}
+
+POST(sys_kevent_fbsd12)
+{
+   vg_assert(SUCCESS);
+   if (RES > 0) {
+      if (ARG4 != 0)
+         POST_MEM_WRITE( ARG4, sizeof(struct vki_kevent)*RES) ;
+   }
+}
+
+//ekvz
+PRE(sys_getrandom) {
+    *flags |= SfMayBlock;
+    PRINT("sys_getrandom ( %#lx, %ld, %#ld )",ARG1,ARG2,ARG3);
+    PRE_REG_READ3(ssize_t, "getrandom",
+                 void *, buf, size_t, buflen, unsigned int, flags);
+    PRE_MEM_WRITE("getrandom(buf)", ARG1, ARG2);
+}
+
+POST(sys_getrandom)
+{
+    vg_assert(SUCCESS);
+    if (RES != -1)
+        POST_MEM_WRITE( ARG1, ARG2 );
+}
+
+//ekvz
+PRE(sys_minherit) {
+    PRINT("sys_minherit ( %#lx, %ld, %#ld )",ARG1,ARG2,ARG3);
+    PRE_REG_READ3(int, "minherit",
+                 void *, addr, size_t, len, unsigned int, inherit);
+    PRE_MEM_WRITE("minherit(addr)", ARG1, ARG2);
+}
+
+POST(sys_minherit)
+{
+    vg_assert(SUCCESS);
+    if (RES == 0)
+        POST_MEM_WRITE( ARG1, ARG2 );
+}
+
#undef PRE
#undef POST

@@ -3986,7 +4044,8 @@

// BSDXY(__NR_ntp_gettime,        sys_ntp_gettime),        // 248
    // nosys                                   249
-// BSDXY(__NR_minherit,            sys_minherit),            // 250
+   // ekvz
+   BSDXY(__NR_minherit,            sys_minherit),            // 250
    BSDX_(__NR_rfork,            sys_rfork),            // 251

    GENXY(__NR_openbsd_poll,        sys_poll),            // 252
@@ -4309,6 +4368,10 @@
    BSDXY(__NR_shmctl,            sys_shmctl),            // 512

    BSDXY(__NR_pipe2,            sys_pipe2),            // 542
+
+   //ekvz
+   BSDXY(__NR_kevent_fbsd12,    sys_kevent_fbsd12),        // 560
+   BSDXY(__NR_getrandom,        sys_getrandom),            // 563

    BSDX_(__NR_fake_sigreturn,        sys_fake_sigreturn),        // 1000, fake sigreturn

diff -ur work/stass-valgrind-freebsd-ce1acb28953f/include/vki/vki-scnums-freebsd.h patched/stass-valgrind-freebsd-ce1acb28953f/include/vki/vki-scnums-freebsd.h
--- include/vki/vki-scnums-freebsd.h    2020-09-17 20:30:07.460426000 +0200
+++ include/vki/vki-scnums-freebsd.h    2020-09-17 21:38:07.349167000 +0200
@@ -417,6 +417,10 @@
#define    __NR_getdirentries64    554
#define    __NR_fstatfs64        556

+//ekvz
+#define __NR_kevent_fbsd12    560
+#define __NR_getrandom        563
+
#define __NR_fake_sigreturn    1000

#endif /* __VKI_UNISTD_FREEBSD_H */

That kevent definition should be checked though. I've just copied the layout from the existing (older?) syscall without looking into it much further.

Paul Floyd · Sep 18, 2020

ekvz said:
Oh, that's pretty interesting. I had just mindlessly punched in cd /usr/ports/devel/valgrind and didn't even check if there were other versions. I can report that with the patch from the bug tracker (rather easy to locate and from what i understand already integrated in CURRENT?) and a couple added syscalls devel/valgrind seems to work though.

The output was also helpful in narrowing down the error. Sadly it's not exactly obvious as to why the responsible part fails (it's an intentional crash generated because some pretty complex code failed to initialize and figuring out that reason wil be pretty hard/painful it seems). Right now i am back to checking if some build setting might work around the problem but it doesn't look very promising so it's not unlikely i will soon have to fire up a debugger and step through that monstrosity in hopes to narrow the reason down further...

Thank you but i've already "fixed" it. It wasn't all that bad anyways. Mostly just a bit boring (and somewhat questionable since i have little idea what i actually did there).

It's your choice, but even with a few patched syscalls devel/valgrind is vastly inferior to devel/valgrind-devel.

ekvz · Sep 18, 2020

Paul Floyd said:
It's your choice, but even with a few patched syscalls devel/valgrind is vastly inferior to devel/valgrind-devel.

It likely is if the -devel version is so much newer and of course i plan to use this one from now on. I am compiling right now (ZZzzzZzzzZzzz...) but when it's finished (and isn't magically fixed because of build settings - which i have little hope for) i'll probably rerun the checks. Old version + third rate patch isn't exactly what i want to rely on if i don't have to.

Paul Floyd · Sep 18, 2020

ekvz said:
That kevent definition should be checked though. I've just copied the layout from the existing (older?) syscall without looking into it much further.

valgrind-devel should be up to the same level as the "official" Valgrind, with its own regression tests acceptably clean.

Paul Floyd · Sep 18, 2020

If you have any problems, either post here or open a bugzilla item.