random program crashes, no coredumps, and error 94

PMc · Mar 5, 2022

_martin said:
Are you able to set breakpoint in that location and check &buf ?

Yes, it's fine.

_martin said:
Can you dump the FILE* structure too? The fd is defined in FILE.file (stdio.h).

How should I set a breakpoint via /dev/null?
To set breakpoints, I must run the program in foreground, and then stdout is /dev/tty and everything works fine.

PMc · Mar 5, 2022

covacat said:
libc does not know about /dev/null it knows about file descriptor 1 which may go anywhere

It goes to /dev/null, see lsof output above.

When running in foreground, it looks like this, and there are no errors:

Code:

 # lsof -p 24185
COMMAND     PID USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
bareos-fd 24185 root    0u  VCHR               0,91 0t922928   91 /dev/pts/1
bareos-fd 24185 root    1u  VCHR               0,91 0t922928   91 /dev/pts/1
bareos-fd 24185 root    2u  VCHR               0,91 0t922928   91 /dev/pts/1
bareos-fd 24185 root    3u  IPv4 0xfffffe008612d950      0t0  TCP moon

When I remove the fputs/fflush from the code, there are no failures.
When I step thru the fputs/fflush, they execute code in libc and libthr.

EBADF says that stdout->_file is an invalid handle

And how might this result in libc taking the first two bytes of the output string and writing it to some bogus memory location?

_martin · Mar 5, 2022

I don't use lldb, I use gdb. I'm assuming they are similar enough. I asked to set breakpoint on the line of code you shared (fflush), not via "/dev/null".
You can do it in two ways. Either run debugger, set the breakpoint and continue or run the program, attach the debugger (gdb -p ) , set the breakpoint and let it continue. I'd prefer the later option.
If you have compiled it via ports you could enable debug mode (CFLAGS+=-g) and set it on line of code, or if you're debugging binary without debug symbols you need to find which instruction is calling that fflush and set breakpoint on that.

FILE* structure has fd assigned to it or -1 if none is used. It would be interesting to see if that FILE* structure has sane values. As covacat mentioned it may be that the FILE* structure is already corrupted, not causing the corruption (or in other words it's a victim of a bug, not a bug).

FILE* (and hence printf,scanf & friends) does use buffers (allocated on heap). It could be that this part of the code rubs the bug the correct way and gets triggered.

That EBADF could be that stdout is set to -1, i.e. not used. This is not a bug necessarily. Consider this example of code:

Code:

close(1);
..
..
fprintf(stdout, "hello world\n");

Technically there's nothing wrong with this code. But printing to stdout in the terminal will most likely (stdout can be redirected elsewhere, that's why I said most likely) end up in error EBADF. There may be logic in the code where 0,1,2 is either redirected to socket or closed completely. Hence the error.

But I keep asking myself -- what has changed in FreeBSD that it's being triggered now.

PMc · Mar 5, 2022

covacat said:
I don't use lldb, I use gdb. I'm assuming they are similar enough. I asked to set breakpoint on the line of code you shared (fflush), not via "/dev/null".

Yes, but somehow you must "ask to set breakpoint" - usually by typing the respective debugger command into a terminal via stdin/out. And you cannot do this through /dev/null, and the error only happens when stdin/out is /dev/null.

covacat said:
But I keep asking myself -- what has changed in FreeBSD that it's being triggered now.

Just figured that one out. Hold on...

_martin · Mar 5, 2022

We don't understand each other. You run the bareos-client or whatever that binary is as usual. Then, from other terminal, you attach to that running command with the gdb -p $pid where $pid is the pid of that bareos client. Then you can set the breakpoint. And you continue to debug it "live".

PMc · Mar 5, 2022

covacat said:
EBADF says that stdout->_file is an invalid handle

Apparently it is an invalid handle only for fflush(), not for fputs(), because fputs() writes the first 4350 bytes successfully, and only fflush() fails from the beginning.

So I think this cannot be. But apparently it has to do something with buffering.
So I try, what happens when I do
fclose(stdout);
and then try to write something out -> that fails rightaway with EBADF.

And what happens when I do
close(1);
and then try to write something to stdout? Now there are differences:

12.3	/dev/tty	fputs() fails EBADF, fflush() works
12.3	/dev/null	fputs() works, fflush() fails EBADF
13	/dev/tty	fputs() fails EBADF, fflush() fails EBADF
13	/dev/null	fputs() works for ~4300 bytes, then fails EBADF, fflush always fails EBADF

PMc · Mar 5, 2022

_martin said:
We don't understand each other. You run the bareos-client or whatever that binary is as usual. Then, from other terminal, you attach to that running command with the gdb -p $pid where $pid is the pid of that bareos client. Then you can set the breakpoint. And you continue to debug it "live".

Sorry, didn't know one can do that.

PMc · Mar 5, 2022

It seems this problem is solved. (If you look carefully enough, you could see the nature of the problem already in the quotes above.)

bareos/core/src/lib/daemon.cc at f07e6e9180535dc2d5f27c7ee2bc69704cb38f62 · bareos/bareos

Bareos is a cross-network Open Source backup solution (licensed under AGPLv3) which preserves, archives, and recovers data from all major operating systems. - bareos/bareos

github.com

Now ain't this gorgeous????

Here we close all the stdio handles, and then we open them again - and we open all of them as O_RDONLY. (And one can see this from the lsof output above: it shows stdout and stderr handles as 1r and 2r).

This is precisely why I love this code so much. It does lots of superfluous things, and it mostly does them wrong.

But then, there is also a weakness in libc. One probably should not modify the lower level close()/dup() filehandles while also using the upper level stdio functions. But then, if one tries to write onto the upper level while at the same time having the lower level set RDONLY, this probably should not result in gross memory corruption.

covacat · Mar 5, 2022

PMc said:
12.3 /dev/tty fputs() fails EBADF, fflush() works
12.3 /dev/null fputs() works, fflush() fails EBADF
13 /dev/tty fputs() fails EBADF, fflush() fails EBADF
13 /dev/null fputs() works for ~4300 bytes, then fails EBADF, fflush always fails EBADF

this shows a different initial buffering (see setvbuf)
if the FILE is line buffered or unbuffered fflush has nothing to do because the buffer is always empty (after a fputs)
if the FILE is fully buffered fputs may only write to memory and never touch the file descriptor
so the failure always occurs at _swrite

PMc · Mar 5, 2022

covacat said:
this shows a different initial buffering (see setvbuf)
if the FILE is line buffered or unbuffered fflush has nothing to do because the buffer is always empty (after a fputs)
if the FILE is fully buffered fputs may only write to memory and never touch the file descriptor
so the failure always occurs at _swrite

Maybe. I tried to somehow reproduce behaviour with setbuf, but wasn't successful. Maybe I didn't try hard enough; anyway, we can say for certain that there is a difference between 12.3 and stable/13 (as of 2 weeks ago, because I don't update base while hunting a bug - so this may be related to some development work also). But, honestly, this is rather my least concern.

What is of concern to me is on one hand the phantastic coding quality as shown above, where I don't know what else is lingering there and ready to stab me in the back. And on the other hand these memory overwrites, which are probably not well explainable by different buffering behaviour (and probably also not by ongoing development work, but then, one should check the commit logs).
And the third issue is the original question of error 94, i.e. what kind of things happen in relation to capsicum, and where that might be documented.

PMc · Mar 5, 2022

Please try reproducing:

Code:

#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>

main() {
    char buf[] = "12345678901234567890123456789012345678901234567890";
    int fd = open("/dev/null", O_RDONLY);
    int i = 0;

    close(1);
    dup2(fd, 1);
    close(fd);

    while(1) {
      fputs(buf, stdout);
      fflush(stdout);
      i++;
      fprintf(stderr, "%d\n", i);
    }
}

Here it crashes after 135962 iterations. (stable/13 @ 22ba2970766 )

_martin · Mar 5, 2022

Damn, I had too many beers to concentrate now. 13.0-RELEASE (releng/13.0-n244733-ea31abc261f) ok, 14-current - crash after 44781.

Quick check in gdb:

Code:

root@fbsdc:(/tmp/forums)$ gdb test test.core
(gdb)
..
#0  0x000000082232d82d in memcpy () from /lib/libc.so.7
=> 0x000000082232d82d <memcpy+173>:    48 89 17    mov    QWORD PTR [rdi],rdx
(gdb)

(gdb) i r $rdi
rdi            0x824f07ffa         34979479546
(gdb)

(gdb) x/3i $pc
=> 0x82232d82d <memcpy+173>:    mov    QWORD PTR [rdi],rdx
   0x82232d830 <memcpy+176>:    mov    QWORD PTR [rdi+rcx*1-0x8],r8
   0x82232d835 <memcpy+181>:    ret
(gdb)

(gdb) i r $rdi
rdi            0x824f07ffa         34979479546
(gdb)

(gdb) ip
         0x824d08000        0x824f08000   0x200000        0x0  rw- ----
(gdb)

(gdb) bt
#0  0x000000082232d82d in memcpy () from /lib/libc.so.7
#1  0x00000008222ed03f in ?? () from /lib/libc.so.7
#2  0x00000008222eb5cd in fputs () from /lib/libc.so.7
#3  0x0000000000201b54 in main () at test.c:15
(gdb)

So SIGSEGV is due memcpy overstepping into unmapped memory: 0x824f07ffa + 8 = 0x824F08002 which is unmapped. We should check the /usr/src/lib/libc and see what changed there within 13 release (my 13.0 is working just fine).

edit: btw fflush() is not needed.

PMc · Mar 5, 2022

_martin said:
Damn, I had too many beers to concentrate now.

Hey, didn't want to disturb Your weekend!

_martin said:
13.0-RELEASE (releng/13.0-n244733-ea31abc261f) ok, 14-current - crash after 44781.

Thanks, that's valuable (so it's not one of my local patches or such).

Your crash location is mostly same as here.

I have a stable/13 at fa3cc60e6dc without crash, but that's a generic build for default CPU,
so I wasn't sure yet. But if this is not the CPUTYPE, and only libc, then there is now not
much delta:

fa3cc60e6dc Jan 8 10:22:08 2022
22ba2970766 Feb 10 16:11:22 2022

What do You think about these:
ec2db06d0db22ae11c1b5414446e3aecd71a93e3
afa9a1f5ec9974793a8744c55036ef5c4d08903d

_martin · Mar 5, 2022

It's ok, I do like these types of problems. I may not be able to help tonight much though.
Both seem to be about fflush; we can reproduce it without it (so I didn't check further).

I did spin up 13.0-RELEASE-p7 VM and I can't reproduce it there either.

PMc · Mar 6, 2022

_martin said:
It's ok, I do like these types of problems. I may not be able to help tonight much though.
Both seem to be about fflush; we can reproduce it without it (so I didn't check further).

I suppose the fputs() will internally call fflush() (or equivalent) when the buffer is full.

_martin said:
I did spin up 13.0-RELEASE-p7 VM and I can't reproduce it there either.

It came with afa9a1f5ec9974793a8744c55036ef5c4d08903d into stable/13.

covacat · Mar 6, 2022

this is causing it

__sflush(): on write error, if nothing was written, reset FILE state … · freebsd/freebsd-src@afa9a1f

…back PR: 76398 (cherry picked from commit 86a16ada1ea608408cec370171d9f59353e97c77)

github.com

i built a 13.0-R libc with that file replaced and it bombs when internally fflush is called
if you disable buffering with setvbuf it works
it always bombs at size of vbuf / size of string outputed so when the buffer fills it bombs
a vbuf of 16k and the string of 50 causes it to bomb at 328
the explicit call to fflush is not needed (like _martin said)

covacat · Mar 6, 2022

i suppose that somehow the code in fvwrite.c (fputs) is not aware of this 'reset' and is causing the bomb

PMc · Mar 6, 2022

Folks,

seems we did it! We improved quality! Thank You all!

src - FreeBSD source tree

cgit.freebsd.org

_martin · Mar 6, 2022

Nice. That's why I like this type of threads here.

covacat · Mar 10, 2022

has anybody filed a PR ?
because this is/will be in 13.1

Erichans · Mar 10, 2022

I can't find a PR related to PMc's message #43. Going by the commit to -CURRENT (2022-03-06 15:29:51) in

src - FreeBSD source tree

cgit.freebsd.org

and comparing with:
stable/13/lib/libc/stdio/fvwrite.c L135-L142
stable/13/lib/libc/stdio/fvwrite.c L176-L182
releng/13.1/lib/libc/stdio/fvwrite.c L135-L142
releng/13.1/lib/libc/stdio/fvwrite.c L176-L182

At the moment it is not in 13-STABLE (as the precursor of 13.1-RELEASE) and not in the just branched of releng/13.1 (per commit 13.1: create releng/13.1 branch as of 2022-03-10 00:10:32).

covacat · Mar 10, 2022

the problem is introduced by the fflush commit in feb 1

PMc · Mar 10, 2022

PR is 76398 which is not updated, but not closed yet either. Anyway, this here is good enough for me:

Re: Program crashes on stable/13 (but not on 12.3)