Solved [C] Debug multithreading issue that *seems* to be related to LibreSSL

zirias@

Developer
Context: I'm working on a web service here: https://github.com/Zirias/swad ... this uses my library https://github.com/Zirias/poser which offers a "reactor" design, and for performance reasons, I'm extending it to allow a "multi-reactor" (multiple threads each with their own event loop) with "acceptor-connector" scheme (main thread accepts connection, some other thread creates the local connection object and manages it in its event loop). Was quite an effort so far, and I know there are still some issues, but for the basic operation, everything seems fine now except it crashes with LibreSSL, and I can't find a good way to debug this.

Copying a text I already posted on mastodon for further details:

On FreeBSD, with all ports built with LibreSSL, can I somehow use the clang thread sanitizer on a binary actually using LibreSSL and get sane output?

What I now observe debugging swad:
  • A version built with OpenSSL (from base) doesn't crash. At least I tried very hard, really stressing it with jmeter, to no avail. Built with LibreSSL, it does crash.
  • Less relevant: the OpenSSL version also performs slightly better, but needs almost twice the RAM
  • The thread sanitizer finds nothing to complain when built with OpenSSL
  • It complains a lot with LibreSSL, but the reports look "fishy", e.g. it seems to intercept some OpenSSL API functions (like SHA384_Final)
  • It even complains when running with a single-thread event loop.
  • I use a single SSL_CTX per listening socket, creating SSL objects from it per connection ... also with multithreading; according to a few sources, this should be supported and safe.
  • I can't imagine doing that on a *single* thread could break with LibreSSL, I mean, this would make SSL_CTX pretty much pointless
  • I *could* imagine sharing the SSL_CTX with multiple threads to create their SSL objects from *might* not be safe with LibreSSL, but no idea how to verify as long as the thread sanitizer gives me "delusional" output
 
Update: It finally also crashed once with OpenSSL ... of course only testing a build *without* thread sanitizer, after massively stressing it over and over again without crashing. So, there's still also another issue hiding. 😞

Nevertheless, with LibreSSL, I can reliably make it crash pretty quickly under heavy load, and the sanitizer output looks useless as described ... so the question remains 🤔
 
I'm afraid I'm going to be of no help. If you don't mind me asking you a question anyway, why didn't you use libtls? It's a new library, no? There is an Openssl back end for libtls, security/libretls.
 
Jose ever since I first used the OpenSSL API, I kind of hated it because it's really stupid (not to say: broken) in many areas. So, perfectly understand this question. The main reason I'm still sticking to it is its status as the "de facto standard". My goal is to create reasonably portable software with minimal dependencies ... and every "POSIXy" system out there will have OpenSSL (or an API-compatible library) installed. Besides, this current project also uses it for two other purposes: Verifying "proof-of-work" challenges (the same used by Anubis) for "guest logins", and issuing and verifying signed JSON-Web-Tokens.

Now, after two days of hunting fata morganas (e.g. I meanwhile found the clang memory sanitizer is also useless unless I'd create an instrumented build of OpenSSL), I'd like to ask a different question:

Can anyone either confirm or refute the assumption that sharing an SSL_CTX across multiple threads creating their SSL objects from it really is safe? I didn't find any official documentation about that, only statements on mailing lists and in github issues etc that claim this should work. I'm reluctant trying to create one SSL_CTX per thread without actual proof that this is the issue, because it would be a pretty massive redesign in my code ... but maybe I'll have to bite that bullet, just to know... :(
 
Just wow, I finally have a lead to a possible crash, and it has nothing to do with TLS at all. After days of almost getting insane, I noticed another pattern: It only ever crashed when it was allowed to daemonize. The code doing THAT is battle-tested for years and executed before any threads are created at all, BUT: it comes with a different default logging configuration. Logging is done "async", well, faked by letting a job for the thread pool do the actual logging. And this thread job has a timeout. Turns out something is wrong with handling the timer for that timeout ever since I added support for multiple service worker threads. 🤯

Well, finally something to analyze, I can now easily reproduce it by forcing to always use "async logging". I wouldn't be surprised if that was the reason for crashing at all. Seeing it happen much more often and quickly with LibreSSL is not a contradiction, who knows what exactly happens in the address space and how it affects LibreSSL, you know, "nasal daemons". 🙈

Anyways, the question would remain interesting (I mean, I really can't use the thread sanitizer as soon as LibreSSL is used, which is quite unfortunate) ... although maybe much more theoretically for me.
 
Just in case anyone is interested: It also wasn't the timer directly, but a broken state where the timer had a different "owning thread" than its own "expired" event. In a nutshell, creating and enqueueing a thread job (for running on the pool) was safe from any thread (including "pool threads") previously, but with my move to create multiple "service worker threads" (those running an event loop on select, kqueue, etc), a thread job for the pool now needs to have an "owner thread", so its completion is signalled on the correct thread (which was previously always the main thread). I completely missed that my async logging still had the assumption it could create a thread job from "anywhere". And here's the fix: https://github.com/Zirias/poser/commit/b1aa9784955db4712d695d217344cb65bbff6ed6

Well, building a "high performance" web service from scratch in C is kind of hard ... 🙈

The fixed version is currently deployed in my prod environment, showing not a single crash after several "stress tests" with jmeter, and being triggered by scraper bots regularly ... plus manual login tests in the browser. So, I'm pretty confident that I found the real issue, and it really was just coincidence that using LibreSSL triggered crashes much more often.

As I said, would still be interesting to know whether it's possible using the thread sanitizer on something linking LibreSSL without getting tons of silly false positives.
 
As I said, would still be interesting to know whether it's possible using the thread sanitizer on something linking LibreSSL without getting tons of silly false positives.

  • Thank you for alerting me to the existence of the clang thread sanitizer!
  • Congrats on finding and fixing the root-cause of your original problem. :)
  • I took a quick peek at the OpenSSL sources -- it looks like they added specific support for thread sanitizer (tsan), so I'm not surprised your program runs "quieter" under the thread sanitizer when linked with openssl. OTOH, I could not find any evidence that LibreSSL spent any effort whatsoever relating to clang sanitizers (other than one tiny thing I found in their tests/ subdirectory that is probably not relevant here).
  • Guessing you are already aware, but just in case... It appears tsan can be configured to ignore certain functions or source files; e.g.: using __attribute__((no_sanitize("thread"))) on function declarations.
  • I think this means you could examine the noisy output of tsan when running with libressl, and try strategically placing a few "ignorelist" entries for tsan, to see if that quiets spurious output enough to where tsan would be useful with your code. This is a shot in the dark and depends on if the noise is because-of just a few libressl functions. YMMV.. Obviously you should be careful about adding a tsan ignore for something that might actually be relevant to your code.
  • If LibreSSL has more reliance on unlocked atomic thread-safe variable updates vs OpenSSL, one could expect more tsan output, since those are likely to look like a genuine race problem.

  • Regarding OpenSSL locking, I've had discussions with a dev who is very experienced with OpenSSL, he says that generally most common OpenSSL calls are thread-safe, however, he had an openssl core routine return NULL occasionally after he allocated an openssl object in one thread and then used/deallocated it in another thread. Since then, his "strategy" is to allocate, use, and deallocate any SSL* / SSL_CTX* / BIO* objects from the same thread as much as possible; at least he has not noticed any new problems after several years of 24x7 use after implementing this strategy. It's also possible (likely?) that newer openssl versions don't have this particular problem.
 
Thanks marq for your interesting thoughts on this!

If LibreSSL has more reliance on unlocked atomic thread-safe variable updates vs OpenSSL, one could expect more tsan output, since those are likely to look like a genuine race problem.
I had to think a while about that one. A while ago, I started to add more and more alternative code paths (build-time configurable) using some lock-free algorithms based on atomics to replace mutexes and semaphores. This rendered valgrind's "helgrind" tool practically useless, and thinking about it, that's no surprise as there's nothing to intercept any more. And that's how I discovered the thread sanitizer at first. It could perfectly interpret my lock-free code. So, your remark looked like a contradiction at first to me. But then I wondered how tsan can do that at all, and the obvious answer is instrumentation at compile time. So, of course, a shared lib won't have the necessary instrumentation.

I took a quick peek at the OpenSSL sources -- it looks like they added specific support for thread sanitizer (tsan)
I'm not sure this is the whole story? I see for example the following (nonsense) report in a build with LibreSSL:
Code:
WARNING: ThreadSanitizer: heap-use-after-free (pid=56420)
  Write of size 8 at 0x720c00001f20 by thread T17:
    #0 SHA512_Final /usr/src/contrib/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_common_interceptors.inc:9099:1 (swad+0x2ace6c)
    #1 SHA384_Final /usr/src/contrib/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_common_interceptors.inc:9098:1 (swad+0x2ac7d6)
    #2 EVP_DigestFinal_ex <null> (libcrypto.so.53+0x160df4)
    #3 readConnection /home/felix/git/swad/poser/src/lib/core/connection.c:678:59 (swad+0x316c26)
    #4 PSC_Event_raise /home/felix/git/swad/poser/src/lib/core/event.c:163:6 (swad+0x321333)
    #5 processEvents /home/felix/git/swad/poser/src/lib/core/service.c:1277:3 (swad+0x337927)
    #6 serviceLoop /home/felix/git/swad/poser/src/lib/core/service.c:1917:6 (swad+0x335d81)
    #7 runsecondary /home/felix/git/swad/poser/src/lib/core/service.c:1486:19 (swad+0x3374b4)

  Previous write of size 8 at 0x720c00001f20 by thread T17:
    #0 free /usr/src/contrib/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:725:3 (swad+0x291a88)
    #1 <null> <null> (libssl.so.56+0x4b19d)
    #2 readConnection /home/felix/git/swad/poser/src/lib/core/connection.c:678:59 (swad+0x316c26)
    #3 PSC_Event_raise /home/felix/git/swad/poser/src/lib/core/event.c:163:6 (swad+0x321333)
    #4 processEvents /home/felix/git/swad/poser/src/lib/core/service.c:1277:3 (swad+0x337927)
    #5 serviceLoop /home/felix/git/swad/poser/src/lib/core/service.c:1917:6 (swad+0x335d81)
    #6 runsecondary /home/felix/git/swad/poser/src/lib/core/service.c:1486:19 (swad+0x3374b4)

So, looking at the uppermost stackframes, you see tsan intercepts and replaces OpenSSL API calls, even though LibreSSL is linked. :-/
 
I had to think a while about that one. A while ago, I started to add more and more alternative code paths (build-time configurable) using some lock-free algorithms based on atomics to replace mutexes and semaphores. This rendered valgrind's "helgrind" tool practically useless, and thinking about it, that's no surprise as there's nothing to intercept any more. And that's how I discovered the thread sanitizer at first. It could perfectly interpret my lock-free code. So, your remark looked like a contradiction at first to me. But then I wondered how tsan can do that at all, and the obvious answer is instrumentation at compile time. So, of course, a shared lib won't have the necessary instrumentation.

Sorry about my earlier post being confusing/ambiguous..

I'm making some assumptions about the way the thread sanitizer works, because -fsanitize=thread is a compile-time option. My current thoughts are:
  • Mixing code compiled with sanitize=thread with code not compiled with -fsanitize: It's unclear what behavior can be expected if a program compiled with -fsanitize=thread is linked against a library not compiled with sanitize support. I wished the clang documentation for the sanitizers made this point clearer! Maybe only functions compiled with sanitize support might be checked for races, while any code that is compiled without sanitize support (e.g.: external libs) may be partially or quietly exempt -- I don't know.
  • Debug symbols: I believe the quality/accuracy of your backtrace would be better if dependent libraries (at least SSL) were built/installed with debug symbols (i.e.: probably would fix your backtrace "<null>"s).
  • Support for only one sanitizer at a time: It appears that the -fsanitize=x allows the code to be compiled for only one particular sanitizer. So it doesn't seem likely to me that any libraries would be "pre-compiled" to support any clang sanitizers, because it has to be compiled with the particular one you want at that moment.
  • Evidence suggests openssl was tested with various sanitizers, but not libressl: When I said that openssl had sanitizer "support", I should have qualified that by saying I saw evidence in the openssl sources in several places that they provided-for and tested it with at least sanitize=thread and sanitize=address enabled (this might explain why tsan's diagnostic output under openssl was quieter than libressl).
    I believe the base system's openssl (and the one from ports - security/openssl) is not compiled with any sanitize= option by default, so you probably won't be able to learn about any thread data race conditions within either SSL library, or between your code and either SSL library, unless you recompile/reinstall SSL lib(s) with -fsanitize=thread.
  • Have you run ldd on your executables to verify you're getting the exact SSL libs you expected? I'd suggest verifying both your executable linked with openssl & libressl are "switching" all ssl libs correctly (i.e.: look at the path for libssl and libcrypto, so you don't unintentionally mix/match libs between openssl & libressl).
    I'm assuming for openssl, all of the SSL libs should be from /lib or /usr/lib, while for LibreSSL, probably they will come from /usr/local/lib (at least if you installed libressl from ports). Shot in the dark. YMMV.
For your reference, here are my temporary modifications to /etc/make.conf before recompiling/reinstalling security/libressl from ports to enable debugging symbols and enable sanitize=thread support:
Code:
STRIP=
CFLAGS=-g -O -fsanitize=thread
(FWIW, the STRIP= line tells any port builds not to run the strip utility on the resulting libraries and/or executables, and the CFLAGS line overrides the system-defaults to enable the thread sanitizer, enable debugging symbols, and reduce the default level of optimization so backtraces should be more accurate)

If you do this, please don't forget to restore the original /etc/make.conf after you are done recompiling/installing libraries. :)

If you re-run tsan after recompiling libressl with debugging symbols and sanitize=thread support, I'd be very curious to hear the results. Hope this helps..
 
Hi marq, you didn't address the one "little detail" that confused me the most, which is the stackframes showing clang's thread sanitizer intercepted some OpenSSL API (namely SHA384_final and SHA512_final). I think this might be part of the issue, and it's unlikely using a debug build of LibreSSL would change anything about that.

Mostly agree with the rest of what you wrote, and I guess I'll give it a test with some instrumented LibreSSL debug build some day. It will have to wait for a while until I finally upgraded all my infrastructure again and completed some work on ports that's been waiting for too long. Just for completeness, the ports framework supports the WITH_DEBUG knob (which should do "the right thing" no matter which upstream build system is used, so no need for hacks like setting STRIP to empty, and custom CFLAGS just adding the sanitizer should suffice), and also a nice WITH_DEBUG_PORTS helper.

Oh, of course I verified the dynamic linker links the library I expect. My build system makes it easy to pick an OpenSSL implementation, using pkg-config by default, but offering overrides, see https://github.com/Zirias/poser/blob/master/src/lib/core/core.mk#L183
 
...you didn't address the one "little detail" that confused me the most, which is the stackframes showing clang's thread sanitizer intercepted some OpenSSL API (namely SHA384_final and SHA512_final). I think this might be part of the issue, and it's unlikely using a debug build of LibreSSL would change anything about that.

Thank you for alerting me to the WITH_DEBUG and WITH_DEBUG_PORTS knobs.

Sorry I missed the "little detail".

Are the symbols SHA384_Final and SHA512_Final defined in your LibreSSL's libcrypto?

If so, then I don't think their presence in your backtrace can be used as a telltale that OpenSSL-specific code is still being somehow/mysteriously called by your LibreSSL-linked executable. FWIW, I confirmed those two symbols are defined in my LibreSSL's libcrypto (can be confirmed using nm on the .a or nm -D on the .so)

If you still have any open questions about this, please let me know..
 
Are the symbols SHA384_Final and SHA512_Final defined in your LibreSSL's libcrypto?
I think it's pretty obvious "these are not the symbols we're looking for" (*scnr*) by another close look at the stacktrace:

Code:
#0 SHA512_Final /usr/src/contrib/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_common_interceptors.inc:9099:1 (swad+0x2ace6c)
#1 SHA384_Final /usr/src/contrib/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_common_interceptors.inc:9098:1 (swad+0x2ac7d6)

Note the source location ;) ... we're somewhere inside llvm's "compiler runtime". We're being called from within LibreSSL's libcrypto of course:

Code:
#2 EVP_DigestFinal_ex <null> (libcrypto.so.53+0x160df4)

No source location here, because this wasn't a debug build of libcrypto.

But as I said, thanks for all your thoughts about the issue anyways, it certainly makes sense to try with an instrumented debugging build and see what happens. I'll give an update here once I'm getting around doing that.
 
I had to think a while about that one. A while ago, I started to add more and more alternative code paths (build-time configurable) using some lock-free algorithms based on atomics to replace mutexes and semaphores.
On FreeBSD, replacing mutex by atomic integers is hardly an optimization though, as our pthread_mutex_lock(3) has been atomic test-and-set-ing to acquire uncontested or quickly acquirable mutex, and only sleeping after atomic acquisition failed a number of times in sequence ... since FreeBSD-7 (current implementation here)

There's also pthread_spin_lock(3), which acquires a lock exclusively using atomic test and set, but it's considered inefficient as in cases where it can acquire a lock quickly, pthread_mutex_lock will acquire it just as quickly, and in cases it can't, it'll waste processor time and memory interface locks while pthread_mutex_lock doesn't.

The system stuff has also been scrutinized and hardened by decades of (ab)usage and audits, and will recognized "almost all" nieche cases to either handle them or bail out with a failure code and errno set appropriately, while "lightweight implementations" usually are faster because they don't try detecting the nieche cases (e.g. atomic int ... being placed on a memory mapped file ... that is on an nfs share).
 
On FreeBSD, replacing mutex by atomic integers is hardly an optimization though, as our pthread_mutex_lock(3) has been atomic test-and-set-ing to acquire uncontested or quickly acquirable mutex, and only sleeping after atomic acquisition failed a number of times in sequence ... since FreeBSD-7 (current implementation here)
It depends. Just replacing a mutex by your own custom locking using atomics (effectively a "spinlock") won't improve things in most cases and is even likely to perform worse, when threads on the same CPU/core contend for the same lock, indeed. But that's not what I did. I went for actual lock-free algorithms instead, together with some logic defaulting the number of potentially competing threads to the number of available cores. Benchmarks show an improvement for that. In a nutshell, these algorithms don't naively wait on something but just start doing their job on the actual datastructure's metadata and only retry when they detect inconsistency, possibly even trying to fix that themselves (often called "assist" in the literature) instead of waiting for another thread to do it. A (well-known) lock-free queue I use for example to execute something on a different event-handling thread mainly works on atomic "tail" and "next" pointers, and if the tail node is found to already have a next pointer, tail is adjusted by whatever thread finds that inconsistency.

edit: This whole thing is designed to be portable, but I would just expect any sane OS these days to provide a good mutex implementation that tries a limited amount of spinning before issuing a costly syscall and context switch. The code paths with mutexes and similar are still there, are chosen when a compiler doesn't have C11 atomics or the target platform can't guarantee lock-free "atomic pointers", or when the user explicitly opts out of atomics at build time. As a side effect, I can easily compare performance 😉. As expected, there's no huge difference, but it's measurable ... depending on the scenario, around 10-20% difference in the maximum throughput the service can handle without "dropping" some requests.
 
For thread hazard detection you have 3 tools:

Thread Sanitizer. The main problem with this is that you will need to instrument your code and all libraries that are important to you.

Valgrind Helgrind and Valgrind DRD. No need for special builds (unless you are using OpenMP).

All of these tools are only as good as the thread mechanism encapsulation that you see. If you are using any custom synchronisation mechanisms you will need to add manual instrumentation to the code. Generally if the tools fail to see locks for any reason they will generate vast numbers of false positives.

Concerning spinlocks, it depends. If your code needs many short locks then spinlocks will be significantly faster than pthread locks. If your code has few locks but held for long periods then spinlocks are inefficient and pthread locks are better.
 
If you are using any custom synchronisation mechanisms you will need to add manual instrumentation to the code.
Not true in general, clang's tsan doesn't show any false positives in my current code, and it did show diagnostics when there were still errors, with messages naming "atomic loads" and similar, in places where the generated amd64 assembly most likely had some normal load instruction not needing any barrier. I mean, sure, this needs instrumentation to work, but tsan seems to "auto-instrument" here.

But anyways, the issue here is probably not (directly?) related to atomics, they were just the reason I first experimented with something other than helgrind. What remains mysterious to me is how and why tsan intercepts some OpenSSL API calls. Sure, the reason could be that their implementations are known to use some sort of custom synchronization that would cause trouble with "normal" (not instrumented) builds of OpenSSL, but that's just guessing without some docs or other sources explaining that. And even if I knew the "why", it remains unclear how to deal with that when trying to debug something that links LibreSSL instead...
 
So, looking at the uppermost stackframes, you see tsan intercepts and replaces OpenSSL API calls, even though LibreSSL is linked. :-/
I’m not a tsan developer but l’m fairly certain that it neither intercepts nor replaces any SSL calls. If you think otherwise then you are deluding yourself into believing what you want to believe. Stick to hard evidence. What is the output of ldd? Even that is not enough, you also need to run the binary under gdb and check all calls to dlopen.
 
Not true in general, clang's tsan doesn't show any false positives in my current code, and it did show diagnostics when there were still errors, with messages naming "atomic loads" and similar, in places where the generated amd64 assembly most likely had some normal load instruction not needing any barrier. I mean, sure, this needs instrumentation to work, but tsan seems to "auto-instrument" here
Are all your third party libraries instrumented?

For the tsan synchonisation support their faq says

  • Q: What synchronization primitives are supported?
TSan supports pthread synchronization primitives, built-in compiler atomic operations (sync/atomic), C++ &lt;atomic&gt; operations are supported with llvm libc++ (not very throughly tested, though).
 
I’m not a tsan developer but l’m fairly certain that it neither intercepts nor replaces any SSL calls.
See my post above containing the stacktraces I get, which clearly show intercepted SSL calls. Do you have any other interpretation for these?
 
See my post above containing the stacktraces I get, which clearly show intercepted SSL calls. Do you have any other interpretation for these?
Well there is the blindingly obvious possibility that the tool is simply reporting what is really happening.

I would expect asan and memcheck to also report heap use after free errors. Have you tried them?

As always, one of the biggest problems with sanitizers is needing to instrument ALL of the code? Your missing call stack information indicates to me that you haven’t instumented all your libraries. There is a tsan option to ignore uninstrumented modules. Did you try that?

When it generates call stacks, tsan will be doing something like traversing the stack to get instruction addresses, querying what library or exe is mapped to that address and then using ELF and DWARF (if available) to get the function name, file name and line. You should be able to see with truss or ktrace when the SSL library is being loaded.

The main issue that I’m aware of with crypto libs is that they often use SIMD in a manner analogous to str* and mem* functions. Specifically that means they take advantage of the fact that memory will always be allocated in multiples of 8 or 16 bytes (depending on the OS). These optimisations can result in buffer overflow reads. That could be a reason for tsan using a non-SIMD replacement function.
 
Well there is the blindingly obvious possibility that the tool is simply reporting what is really happening.

Which is, this:

Code:
  Write of size 8 at 0x720c00001f20 by thread T17:
    #0 SHA512_Final /usr/src/contrib/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_common_interceptors.inc:9099:1 (swad+0x2ace6c)
    #1 SHA384_Final /usr/src/contrib/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_common_interceptors.inc:9098:1 (swad+0x2ac7d6)

There's a source location (from debugging symbols) given, it's in some interceptor code located in FreeBSD's src tree.

Are you trying to tell me the stacktrace output is wrong?

edit: just checked these source lines, they are:
Code:
SHA2_INTERCEPTORS(384, u64)
SHA2_INTERCEPTORS(512, u64)
from https://cgit.freebsd.org/src/plain/...nitizer_common_interceptors.inc?h=releng/14.1

I still think it's more than obvious these calls are intercepted, but I still don't know why and how to deal with that when trying to debug something linked with LibreSSL.
 
Maybe try a more recent compiler and tsan library?

commit d0b9c2c5647656738cda3fb670aa5d3b3a69d784
Author: David CARLIER <devnexen@gmail.com>
Date: Wed Oct 9 05:47:00 2024 +0100

[compiler-rt] Remove SHA2 interceptions for NetBSD/FreeBSD. (#110246)

To Fix #110215

Interceptors introduced with 18a7ebda99044473fdbce6376993714ff54e6690

See
 
Okay ... I never knew about "libmd", wow! This finally explains it (and I have upgrading the machine where doing all this testing in my pipeline, good to know this will fix the issue).
 
Back
Top