random program crashes, no coredumps, and error 94

PMc · Feb 27, 2022

I upgraded one of my nodes from 12.3 to stable/13

Then I started to get random program crashes from the backup tool sysutils/bareos-client , always after some 10-20 hours runtime.
I hacked the backup tool to actually produce coredump (they had the great idea to catch SIGSEGV&friends, for whatever crap might work on Linux, but not here), and set kern.sugid_coredump=1

Now the crashes are either SIGSEGV or SIGBUS, and they come from arbitrary places in the code. Oh crap.

I looked into the images, and it appears that places in memory (e.g. memory addresses) are overwritten with the string "oSoC" (it seems to always be that string, but at different locations).

I had another look:

Code:

# strings bareos-fd.0.0.core
FreeBSD
bareos-fd
/usr/local/sbin/bareos-fd -u root -g wheel -v -c /usr/local/etc/bareos
FreeBSD
CoSo
CoSo
+oSo
FreeBSD

Now, is that as it should be, or something else? (not sure if it's a stable/13 issue or an issue with the program)

I tried to get some sane coredumps to compare, from 12.3 and stable/13, in the usual way ( kill -SEGV). And I found, I cannot! Instead, I get this:

Code:

kernel: pid 53977 (sleep), uid (0):  Path `/var/tmp/sleep.0.0.core' failed on initial open test, error = 94
kernel: pid 53977 (sleep), jid 0, uid 0: exited on signal 4

Doesn't work in either 12.3 or stable/13, doesn't work in single user, doesn't work in GENERIC kernel.

Error 94 says "Not permitted in capability mode", and this leads to capsicum(4), which doesn't give much of a clue.

Now I am wondering: why is the sleep command in singleuser running in capsicum?

I finally found a switch kern.capmode_coredump in sysctl that helps the issue, but doesn't seem to be documented anywhere (not even the commitlog gives much insight).

Finally, those oSoC (or CoSo) seem to not belong there, so this might be a rogue pointer spamming the memory space. Love it.

Posting this in case anybody else gets error 94.

ralphbsz · Feb 28, 2022

It does smell like a random memory overwrite.

On a little-endian machine (I assume this is Intel?), the string CoSo is 0x436f546f or about a billion, which doesn't seem plausible as as an integer, and isn't even very likely half of a 64-bit pointer.

PMc · Mar 2, 2022

ralphbsz said:
It does smell like a random memory overwrite.

On a little-endian machine (I assume this is Intel?), the string CoSo is 0x436f546f or about a billion, which doesn't seem plausible as as an integer, and isn't even very likely half of a 64-bit pointer.

No, it's certainly not an address:

Code:

# lldb -c /tmp/bareos-fd.0.0.core /usr/local/sbin/bareos-fd
(lldb) target create "/usr/local/sbin/bareos-fd" --core "/tmp/bareos-fd.0.0.core"
Core file '/tmp/bareos-fd.0.0.core' (x86_64) was loaded.

(lldb) bt
* thread #1, name = 'bareos-fd', stop reason = signal SIGSEGV
  * frame #0: 0x00000008002f66e2 libbareos.so.20`alist::get(this=0x0000000801188228, index=0) at alist.cc:132:10
    frame #1: 0x000000080029788a libbareosfind.so.20`AcceptFile(ff=0x000000080117e000) at find.cc:347:39

(lldb) var
(alist *) this = 0x0000000801188228
(int) index = 0

(lldb) print *this
(alist) $0 = {
  items = 0x0000000000000000
  num_items = 1867738957
  max_items = 0
  num_grow = 1
  cur_item = 0
  own_items = true
}

1867738957 = 0x6f536f4d = MoSo

Jose · Mar 2, 2022

PMc said:
Now the crashes are either SIGSEGV or SIGBUS, and they come from arbitrary places in the code. Oh crap.

The one time I had this problem, it was a stack overflow. A colleague had "fixed" a size problem by increasing the size of an array allocated on the stack by 10x. Worked fine on Solaris. Crashed and burned in random places on Windows.

PMc · Mar 2, 2022

Jose said:
The one time I had this problem, it was a stack overflow. A colleague had "fixed" a size problem by increasing the size of an array allocated on the stack by 10x. Worked fine on Solaris. Crashed and burned in random places on Windows.

It's not the first strange effect there. Two releases back, it happened to run for a few days, and then, suddenly and silently, encrypt the backup data in a wrong way. It uses X509 certificate for encryption and signing, and it became obvious that one half of the public/private keypair had strangely changed, in the midst of operation.
I couldn't figure what might be going on, and then the next release worked well again. But now I'm getting an idea that this might be related.

It seems to happen only under specific conditions - and I am doing rather unusual things also (like running a limited backup every 10 minutes as a kind of "undelete" feature) - but here I have no idea how this could be isolated. One would need to record every memory access, during hours of runtime...

ralphbsz · Mar 2, 2022

Debugging random memory overwrites is very very hard. Personally, I know of two techniques that have worked for me. One is boring and brute force: Go through the code, and change it to make sure no pointer variable is used before being initialized (Resource Acquisition is Initialization, named after the Italian radio station), and to make sure any pointer variables are immediately set to NULL when going out of scope. The second one is more amusing, but less likely to succeed: add lots of honeypots to the memory. For example, add to every struct an extra word, which is the checksum of the struct. Update the checksum after every modification of the struct, and check it before every use. Or poison memory before use and after freeing with 0xdeadbeef, and check for those when using freshly allocated memory. The more honeypots you add, the more likely it is that you catch the offender in the act.

PMc · Mar 2, 2022

Thank You for confirming, ralphbsz. This is more or less what I already imagined: one way or the other it boils down to: read and understand the entire code.

PMc · Mar 2, 2022

... or maybe not ...

The node in question, which I upgraded to stable/13, is called Moon. The backup client configuration was copied from another node, called Conn. The backup client configuration file did still carry the name Conn, until I recognized that and changed it to Moon.
At that same time, the bogus memory content changed from CoSo to MoSo.

Searching for something similar from where the other two letters might come, brought me on a track. This line defines two variables ...

bareos/core/src/lib/message.cc at 674a61a742de868a441b3395ac68777e8acda6b3 · bareos/bareos

Bareos is a cross-network Open Source backup solution (licensed under AGPLv3) which preserves, archives, and recovers data from all major operating systems. - bareos/bareos

github.com

... and it seems the bogus content is the first two bytes from each of them. But I am not a native speaker of c++. I don't really understand it.

ralphbsz · Mar 2, 2022

PMc said:
The backup client configuration was copied from another node, called Conn. The backup client configuration file did still carry the name Conn, until I recognized that and changed it to Moon.

That's the kind of smoking gun that makes debugging these things more practical. At this point, you only need to look at the code that uses the names.

Alas ...

But I am not a native speaker of c++. I don't really understand it.

I looked a little bit. The code is "well written", in the sense of being clear and verbose; alas, there is a great lack of comments. And it uses all its own libraries; the places where the names are touched is in functions named Bvsnprintf() and Mmsg(). So to debug this, one has to first dig down a few levels to what these functions do. And the problem seems to be in the area of string handling (in particular there are variable-length printf-style specifications there), which are a notorious source of memory bugs in C and C++. It would take me many hours to dig into this, and I'm not volunteering.

Suggestion: Find out who maintains this thing, and contact them directly. You have a lot of information now, it might be easy for them.

PMc · Mar 2, 2022

ralphbsz said:
That's the kind of smoking gun that makes debugging these things more practical. At this point, you only need to look at the code that uses the names.

Alas ...

I looked a little bit. The code is "well written", in the sense of being clear and verbose; alas, there is a great lack of comments. And it uses all its own libraries; the places where the names are touched is in functions named Bvsnprintf() and Mmsg(). So to debug this, one has to first dig down a few levels to what these functions do. And the problem seems to be in the area of string handling (in particular there are variable-length printf-style specifications there), which are a notorious source of memory bugs in C and C++. It would take me many hours to dig into this, and I'm not volunteering.

I understand that.

The important task of this specific function is to print error messages to stderr, which is /dev/null. And it emits error messages, probably because my site can do IPv4+6, but this KVM provider does not give me an IPv6 address. So there is an error 47 for each backup job, before the other address then does bind().

What they are doing with these Mmsg() etc., they try to maintain their own malloc pools.
Then there is a comment somewhere in the code, that they removed these pools because the OS can nowadays do it as good. But apparently the pools aren't fully removed.

Anyway, these error 47 messages are attached one behind the other in memory, and as there is a job run every 10 minutes, this gets lengthy. And apparently then something overruns. I don't yet know if these memory pools are broken by design, broken by technical debt, or were never supposed to cope with dozens of messages - anyway, I now removed that elaborate task of printing those messages to /dev/null. Let's see what happens.

ralphbsz said:
Suggestion: Find out who maintains this thing, and contact them directly. You have a lot of information now, it might be easy for them.

Yeah, I know who maintains the thing. There are two rivalling gangs, the bacula gang and the bareos gang, and one is said to have stolen the code from the other, and both of them are primarily interested in securing their territories and making money.
I already engaged in one dispute about the definiton of the word "bug" - it ended up with paperwork that I should sign in order to be allowed to find bugs, but would have needed an international industrial property attorney to even understand what I should sign there - so I basically told them to go to hell instead.

On the bright side, I got it all to compile with full debugging symbols (including the libs), so it should now be possible to run it interactively in the debugger, and watch what the code actually does...

_martin · Mar 2, 2022

Looking at the dump (using src from ports) you crashed on alist.cc:132:10:

Code:

return items[index]

in alist::get(). From the output you provided items is set to NULL, index to 0. Segmentation fault is on items[0], i.e. addr NULL.
It seems node has stale data; while items is uninitialized num_items is set to some arbitrary value.

Looking at the code there are several places where malloc() or realloc() is called but no check on its success is made. Some functions do check if items is NULL, some rely on num_items value (such as this get function). Using signed int for tracking size can cause some pain too.

PMc · Mar 3, 2022

_martin said:
Looking at the dump (using src from ports) you crashed on alist.cc:132:10:

It crashes at lots of different places, this is only one example.

_martin said:
num_items is set to some arbitrary value.

It is an ASCII string.

And it doesn't belong there.

_martin said:
Looking at the code there are several places where malloc() or realloc() is called but no check on its success is made.

Oh. Hm, why am I not surprized?

_martin said:
Some functions do check if items is NULL, some rely on num_items value (such as this get function).

Yes, Some things do it this way, some do it the other way, and many do it somehow redundantly, so that it usually works even when it is erroneous.

_martin said:
Using signed int for tracking size can cause some pain too.

Yeah. I once tried to create a "bug of the week" radio show. I think I had enough material for a few years.

But put it that way: Say you want a really versatile semi-enterprise-scale backup solution with an abundance of features. Say you dont want to pay anything for it. Say you want good coding quality.
Choose two, you can't have it all.

_martin · Mar 3, 2022

PMc said:
It is an ASCII string.
And it doesn't belong there.

I'm not sure what you mean by this. num_items is defined as int. If you set it to 26739, or 0x6873, memory where it's stored would show you "sh" if treated as string. But that doesn't mean it's a string. There are plenty of numbers in range of int that could be interpreted as valid string.

items being NULL while get() was called on it would be my first choice to look at. If you have crashes at other places most likely they have some common denominator.

Yeah, if that's good enough for what you need I have no argument against it.

PMc · Mar 3, 2022

_martin said:
I'm not sure what you mean by this. num_items is defined as int.

Yes, it is defined as int, and it has value 1867738957 - which is an int, but which is a bogus value in the application's context.

Then, if you consider that an int is a 4-byte value (usually), and if you try and read these 4 bytes as a 4-byte string, you get 4 characters. And I think I know from where these 4 characters were copied to this place (erroneously, that is).

_martin said:
If you set it to 26739, or 0x6873, memory where it's stored would show you "sh" if treated as string. But that doesn't mean it's a string. There are plenty of numbers in range of int that could be interpreted as valid string.

Yes. But, as I find this very same sequence of 4 characters at various places, here as an int, somewhere as a memory address, and whereever it appears, it consequentially results in a program crash, then I think there is some relation to that.

But we will see - if I am right, there should be no further program crashes now.

ralphbsz · Mar 3, 2022

This sounds like so much fun! Clearly, you have it as much under control as can be.

PMc said:
There are two rivalling gangs, the bacula gang and the bareos gang, and one is said to have stolen the code from the other, and both of them are primarily interested in securing their territories and making money.

They don't sing and dance while doing that? It would really add some flair. Think "West Side Story".

_martin · Mar 3, 2022

PMc said:
as I find this very same sequence of 4 characters at various places, here as an int, somewhere as a memory address, and whereever it appears, it consequentially results in a program crash, then I think there is some relation to that.

Quite possibly, yes. You do see the bigger picture with the dump afterall.

Did you have these issues on 12 too or did they start to occur on 13 only?

PMc · Mar 3, 2022

_martin said:
Quite possibly, yes. You do see the bigger picture with the dump afterall.

Did you have these issues on 12 too or did they start to occur on 13 only?

Yes, 13 only.
I still have the machine's image for 12.3 - I might reinstall that and compare, but only after I know a little bit more about what actually happens.

For now I have verified, indeed because the server is dual-stack, we get an error at first bind() - this is the code where we connect to the server:

bareos/core/src/lib/bsock_tcp.cc at 6310c9b5a256fa10c386427e3497c15072f4c950 · bareos/bareos

Bareos is a cross-network Open Source backup solution (licensed under AGPLv3) which preserves, archives, and recovers data from all major operating systems. - bareos/bareos

github.com

Pmsg2() prints an error message to stdout (which is /dev/null when invoked from rc.d).

When removing that Pmsg2(), no more crashes seem to appear - but with things happening only after 10-30 hours, this doesn't say much.
So I went the other way round, and repeated that Pmsg2() invocation a hundred times, recompiled, and installed. Now things crash after half an hour, and memory looks like this:

Code:

* thread #1, name = 'bareos-fd', stop reason = signal SIGBUS
  * frame #0: 0x00000008002f6700 libbareos.so.20`alist::destroy(this=0x00000008011892c8) at alist.cc:141:14
(lldb) var
(alist *) this = 0x00000008011892c8
(int) i = 0
(lldb) print *this
(alist) $0 = {
  items = 0x6f536f4d6f536f4d
  num_items = 1867738957
  max_items = 1867738957
  num_grow = 1867738957
  cur_item = 1867738957
  own_items = true
}

That rocks!

Also with strings:

Code:

# strings bareos-fd.0.0.core
ekly.l
h_MoSo
MoSo
oMoSo@
MoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSo
MoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoIoSoMoSoMoSo
MoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSoMoSo
ipsec

_martin · Mar 3, 2022

That's smells like some sort of UaF bug.
Reading the bareos-client code is like a punishment. Those guys are not even sticking to the same indent.
But then, if the client is the same (assuming you are using 1:1 same binary as on 12), what that significant to this issue changed in 13? That's a rhetorical question.

PMc · Mar 3, 2022

_martin said:
That's smells like some sort of UaF bug.
Reading the bareos-client code is like a punishment.

You're welcome. I usually try to avoid that - fix things w/o reading the code...

_martin said:
But then, if the client is the same (assuming you are using 1:1 same binary as on 12),

I did. Then it crashed, then I recompiled the port under 13, and that didn't change anything.

_martin said:
what that significant to this issue changed in 13? That's a rhetorical question.

Memory allocation whatever might be. It's enough these bogus things go to some other place, where they are not computationally used (e.g. when they instead overwrite the backup payload, nothing will crash - at least not until after a restore *eg*). And yes, this is a good question.

PMc · Mar 5, 2022

This is getting more entertaining. I found the offending line in the code, it is here:

bareos/core/src/lib/message.cc at f07e6e9180535dc2d5f27c7ee2bc69704cb38f62 · bareos/bareos

Bareos is a cross-network Open Source backup solution (licensed under AGPLv3) which preserves, archives, and recovers data from all major operating systems. - bareos/bareos

github.com

Code:

fputs(buf, stdout);
fflush(stdout);

I didn't understand it, so I changed it to get me returncode and errno:
The fflush() fails at first try. The fputs() also starts failing after writing total of some 4350 bytes.
These 4350 bytes are what I did see in memory with strings, in the core file or in /proc/PID/mem.

The errno is EBADF.

And apparently, when repeating these erroneous calls a couple of times, arbitrary memory gets overwritten.

stdout is present und points to /dev/null, which is writeable:

Code:

[root@moon /tmp]# lsof -p 18050
COMMAND     PID USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
bareos-fd 18050 root    0r  VCHR               0,13      0t0   13 /dev/null
bareos-fd 18050 root    1r  VCHR               0,13      0t0   13 /dev/null
bareos-fd 18050 root    2r  VCHR               0,13      0t0   13 /dev/null
bareos-fd 18050 root    3u  IPv4 0xfffffe0086130870      0t0  TCP moon-

covacat · Mar 5, 2022

maybe stdout is already borked and point to something else than a FILE struct

PMc · Mar 5, 2022

covacat said:
maybe stdout is already borked and point to something else than a FILE struct

Don't think so. Because when running the program in a terminal with default stdout, then the output appears normally, and there are no errors. So it rather looks like libc does not consider /dev/null a file.

covacat · Mar 5, 2022

libc does not know about /dev/null it knows about file descriptor 1 which may go anywhere
EBADF says that stdout->_file is an invalid handle
i wonder if stdout->_file is 1 or something else when it fails

_martin · Mar 5, 2022

Are you able to set breakpoint in that location and check &buf ?
Can you dump the FILE* structure too? The fd is defined in FILE.file (stdio.h).

PMc · Mar 5, 2022

covacat said:
i wonder if stdout->_file is 1 or something else when it fails

fileno(stdout) is 1.