C sigsev after 1200 loops, system boundaries?

escape · Nov 5, 2015

Hi,

A problem with system limits or a program error. A C-program in test uses many malloc calls and pointers. I have run a program in test in a loop about 1200 times and it crashes with a core file. gdb -c ./test.core core does not give a good stacktrace. The execution of the takes about exactly 1 minute. The memory usage is low. The process reads a file at the same time.

Is there a system limit in FreeBSD to hangup a looped process after a time period or if a malloc count has been exceeded or because of some other system resource?

How to debug the program with many loops? Where to read about these? I have tried for example sysctl(8) without any results (known limit).

br,
e.

obsigna · Nov 5, 2015

escape said:
A C-program in test uses many malloc -calls and pointers. I have run a program in test in a loop about 1200 times and it crashes with a core file. gdb -c ./test.core core does not give a good stacktrace. The execution of the takes about exactly 1 minute. The memory usage is low. The process reads a file at the same time.
Is there a system limit in FreeBSD to hangup a looped process after a time period or if a malloc count has been exceeded or because of some other system resource?

No system limit. gdb should at least show the line number, where the crash happened, or did you compile your program with -g0, or did you strip the debugging symbols before testing?

escape said:
How to debug the program with many loops?

My best educated guess is, that your program crashes because at some point it starts to write beyond the end of a chunk of allocated memory.

Verify the size parameter in the calls to malloc(). It happens even to experienced programmers that they forget to increase the allocation size by 1 for the trailing nul-char '\0' of C strings. malloc() allocates in chunks with modulo 16, and so most of the time there would be enough space automagically. For example you allocated 5 bytes, and writing 6 would not crash the program, because the allocation size is 16. However, sometimes it may happen (perhaps after 1200 iterations and 1 minute) that your program allocates exactly 16·n bytes and wants to write 16·n+1, and this would result in an immediate crash.

If this doesn't reveal the issue, then try to find a good IDE, with interactive debugging abilities, and step through your program verifying at each step, whether pointers are really pointing to the intended memory allocation, whether allocation sizes match the expectations, whether the execution sequence is really as you intended or perhaps not because of a bogus if-statement, whether the loop counters are counting correctly, ...

Juha Nurmela · Nov 5, 2015

alarm(3) or setitimer(2) could be used in your program. gdb can also do multiple steps, say 10000 at a time, and watch variables. Check help breakpoints.

Juha

escape · Nov 5, 2015

If this helps me in my cause. Thanks for helping.

Just a few seconds ago I found from tuning(7) a call called getrlimit, getrlimit(2) and printed the allowed memory sizes.

Code:

Maximum size of data segment 536870912 bytes, maximum 536870912. ( sbrk(2) )
Maximum stack segment size 67108864 bytes, maximum 67108864.

And a maximum CPU time:

Code:

Maximum cpu time -1 seconds, maximum -1.

Answer to what gdb prints:

Code:

[New Thread 28803080 (LWP 101408/cbi)]
(gdb) bt full
#0  cb_compare_strict (name1=Cannot access memory at address 0x7f9ff018
) at cb_compare.c:262
  err = Cannot access memory at address 0x7f9ff004
Current language:  auto; currently minimal

I have tried to check /proc/ filesystem with cat /proc/<pid>/map and the closest memory address is the line (procfs(5)):

Code:

0x7f9fe000 0x7f9ff000 0 0 0 --- 0 0 0x0 NCOW NNC none - NCH -1
0x7f9ff000 0x7fa1f000 26 0 0x88cc3444 rwx 1 0 0x3000 NCOW NNC default - CH 1000

The size of one character is four bytes. Just the first from 0x7f9ff004.

Could this be that the stack has run out? I can't find a good explanation of procfs(5).

It's also possible that I'm reading NULLs.

Answer to the question: All strings usually have a trailing '\0' and the +1. All structs do not have this, only sizeof( struct <name>). Usually (of course an error can exist) it was like this.

gdb can be attached to a process. This is nice but it does not help in the problem right now. I also looked at the new lldb debugger. It looks similar to gdb.

Purpose was to test the limits and 1200 times seems too early.

Juha Nurmela · Nov 5, 2015

Embarrassed, I mistook you _wanted_ a way to interrupt the program after some time.

See malloc(3) and the junk mechanism. It often reveals bugs. In sh syntax, MALLOC_CONF=junk:true ./myprog

Juha

Crivens · Nov 5, 2015

You may also try to run your code trough valgrind, this can point you to places where malloc() and free() can be used wrongly.

escape · Nov 8, 2015

Hi,

This helped a lot. It looks like all of my program has a fault:

Code:

==27992== LEAK SUMMARY:
==27992==  definitely lost: 300,758 bytes in 4,903 blocks
==27992==  indirectly lost: 471,117 bytes in 13,173 blocks
==27992==  possibly lost: 0 bytes in 0 blocks
==27992==  still reachable: 672 bytes in 2 blocks
==27992==  suppressed: 0 bytes in 0 blocks

About all the allocated bytes, maby. I have added frees to most places. I usually use malloc like this:

Code:

int  allocate_name(c_name **cbn, int namelen){
  if( cbn==NULL ){
      cbn = (void*) malloc( sizeof( c_name* ) ); // pointer size
      if( cbn==NULL ){ return -1; }
  }
  *cbn = (c_name*) malloc( sizeof(c_name) );
  if( *cbn==NULL){
       return -1;
  }
. . .

And pass the pointers like this:

Code:

c_name *newleaf  = NULL;
err = allocate_name( &newleaf, (**some).namelen );

Does Valgrind recognise this as lost? Is this a not-good style of coding?

br,

e.

escape · Nov 8, 2015

Hi,

I forgot to say that I changed the program and I don't (yet) know why but the core file did not appear this time. The run with valgrind went through to the number 5000 and the crash did not occur at 1200 this time. The run with valgrind took about a day with my 2 core CPU and the program. I should of course test again without valgrind.

regards,

escape

Juha Nurmela · Nov 8, 2015

IMHO allocate_name() ought to

Code:

if (cbn == NULL)
    abort();

Then you get a nice core if caller does a mistake. As it stands now, erring program is allowed to continue, with some lost memory. Better yet, just omit the check. When *cbn is assigned into, it aborts if appropriate.

Juha

escape · Nov 13, 2015

If someone is still interested. The next error appears many times.

Code:

==10518== 372,597 (88 direct, 372,509 indirect) bytes in 2 blocks are definitely lost in loss record 34 of 34
==10518==  at 0x402DFCC: malloc (in /usr/local/lib/valgrind/vgpreload_memcheck-x86-freebsd.so)
==10518==  by 0x8053718: allocate_name (buffer.c:101)
==10518==  by 0x8057D45: put_leaf...

Other errors are:

Code:

==10518== Conditional jump or move depends on uninitialised value(s)
==10518==  at 0x415CAA1: ??? (in /lib/libc.so.7)
==10518==  by 0x415A909: ??? (in /lib/libc.so.7)
==10518==  by 0x415DF1F: vfprintf (in /lib/libc.so.7)
==10518==  by 0x8065718: c_log (cb_log.c:50)
...

and

Code:

==10518==  Uninitialised value was created by a heap allocation
==10518==  at 0x402DFCC: malloc (in /usr/local/lib/valgrind/vgpreload_memcheck-x86-freebsd.so)
==10518==  by 0x805466D: allocate_empty_file (buffer.c:213)
...

Maby the pointers had something before the malloc?

Uniballer · Nov 13, 2015

escape said:

Code:

int  allocate_name(c_name **cbn, int namelen){
  if( cbn==NULL ){
      cbn = (void*) malloc( sizeof( c_name* ) ); // pointer size
      if( cbn==NULL ){ return -1; }
  }
  *cbn = (c_name*) malloc( sizeof(c_name) );
  if( *cbn==NULL){
       return -1;
  }
. . .

If I am reading your code correctly, then the allocate_name() function as presented can't possibly do anything useful if the pointer cbn is NULL on entry. In this case you will malloc enough memory for a single pointer, and assign the address of that memory to cbn, which has local scope as a parameter. The caller of allocate_name() will never know the address of this pointer. Then you allocate another chunk of memory and write its address to the pointer you allocated before. Presumably, when you return, somebody is going to look in the spot where the caller got the value of cbn, which is apparently NULL, and use that as a pointer in a way that will fail.

I suggest you rethink what you are doing here. You should probably change the function calling convention and take more care about parameter scoping. The return convention used by malloc() (i.e. NULL if failure, or a pointer if success) avoids the problem you have created here where the status code (not -1, i.e. success) winds up out of sync with where you are looking for the pointer to the block of memory.

If I am wrong about this snippet of code then somebody please explain it more clearly.

escape · Nov 18, 2015

I just wanted to thank about this. I have been checking the code and the reasons were really at the calling convention. It took time maby three or four days. I'm really happy now, the memory has not increased anymore. If someone want's to look at the code I can give the link to an URL in a github location by email or from this lists conversation at next login time if needed. The last error is only a mysterious calloc error below. br, escape

Code:

==12742== 32 bytes in 1 blocks are still reachable in loss record 1 of 2
==12742==  at 0x402F4B9: calloc (in /usr/local/lib/valgrind/vgpreload_memcheck-x86-freebsd.so)
==12742==  by 0x4223CF6: ??? (in /lib/libthr.so.3)
==12742==  by 0x422829E: ??? (in /lib/libthr.so.3)
==12742==  by 0x42297A4: ??? (in /lib/libthr.so.3)
==12742==  by 0x422948C: ??? (in /lib/libthr.so.3)
==12742==  by 0x422CB11: ??? (in /lib/libthr.so.3)
==12742==  by 0x421DC98: ??? (in /lib/libthr.so.3)
==12742==  by 0x4002AAE: ??? (in /libexec/ld-elf.so.1)
==12742==  by 0x4000F3D: ??? (in /libexec/ld-elf.so.1)
==12742==
==12742== 640 bytes in 1 blocks are still reachable in loss record 2 of 2
==12742==  at 0x402F4B9: calloc (in /usr/local/lib/valgrind/vgpreload_memcheck-x86-freebsd.so)
==12742==  by 0x422828F: ??? (in /lib/libthr.so.3)
==12742==  by 0x42297A4: ??? (in /lib/libthr.so.3)
==12742==  by 0x422948C: ??? (in /lib/libthr.so.3)
==12742==  by 0x422CB11: ??? (in /lib/libthr.so.3)
==12742==  by 0x421DC98: ??? (in /lib/libthr.so.3)
==12742==  by 0x4002AAE: ??? (in /libexec/ld-elf.so.1)
==12742==  by 0x4000F3D: ??? (in /libexec/ld-elf.so.1)
==12742==
==12742== LEAK SUMMARY:
==12742==  definitely lost: 0 bytes in 0 blocks
==12742==  indirectly lost: 0 bytes in 0 blocks
==12742==  possibly lost: 0 bytes in 0 blocks
==12742==  still reachable: 672 bytes in 2 blocks
==12742==  suppressed: 0 bytes in 0 blocks

C sigsev after 1200 loops, system boundaries?

escape

obsigna

Profile disabled

Juha Nurmela

escape

Juha Nurmela

Crivens

Administrator

escape

escape

Juha Nurmela

escape

Uniballer

escape