Find a file with/without uppercase using C ?

Hello,

I use this code and it gives a segmentation fault when the number of files start to be 100 or 1000, or more.

Code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <dirent.h>
#include <ctype.h>   // <- toupper on bsd and linux 
char *strtouppercase( char *str )
{
    char *ptr = malloc( strlen(str)+1 );
    int j ;
    int i;
    j = 0;
    for(i=0; str[i]!='\0'; i++)
    {
        ptr[j++]=  toupper( str[i] );
    }
    ptr[j]='\0';
        size_t siz = sizeof ptr ;
        char *r = malloc( sizeof ptr );
        return r ? memcpy(r, ptr, siz ) : NULL;
}





char searchitem[PATH_MAX];
void listdir(const char *name, int indent)
{
    int i; int j ;
    char str1[PATH_MAX];
    char str2[PATH_MAX];
    char str[PATH_MAX];
    char ptr[PATH_MAX];

    DIR *dir;
    struct dirent *entry;

    if (!(dir = opendir(name)))
        return;

    while ((entry = readdir(dir)) != NULL)
    {
        if ( entry->d_type == DT_DIR )
    {
            char path[PATH_MAX];

            if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
                continue;

            snprintf( path, sizeof(path), "%s/%s", name, entry->d_name);

            listdir( path, indent + 2);
        }
    else
    {
           if ( strstr( strtouppercase(  entry->d_name )  ,  strtouppercase( searchitem  ) ) != 0 )
           {
                  printf("%s/%s\n", name , entry->d_name );
           }
        }
    }
    closedir(dir);
}







int main( int argc, char *argv[])
{
     if ( argc == 2 )
     {
         strncpy( searchitem, argv[ 1 ], PATH_MAX );         
         listdir( ".", 0 ) ;
     }
     return 0;
}

what would be a possible improvement so that it works even for great number of files?

regards, and thank you
 
segmentation faults can be debugged. They are not an accident. Find the place where exactly the segmentation fault happens and why.
 
Now this is one memory hungry program.
While it does sound as homework I'll bite. Part of learning how to program is also learn how to debug. That's as important. Glancing at the code very briefly I see you're using recursion on listdir() wasting 20k of memory in that function. Most likely your segfault is on stack - you reached the limit. While doing ulimit -s unlimited would allow you to run it on more files you should really optimize the code and beware you can't run recursive functions indefinitely.
 
memory allocated in function 'strtouppercase' (for variables 'ptr' and 'r') is not freed
try this
Code:
#include <limits.h>  //for PATH_MAX
char *strtouppercase( char *str )
{
        int i,sz;
       // if(!str)return NULL;
        sz=strlen(str);
        for(i=0;i<sz;i++)str[i]=toupper(str[i]);
        return str;
}
 
Whole bunch of little comments. In the strtouppercase function, you malloc two copies of the string. _al already pointed out that you don't free either copy. But I think the second copy isn't necessary at all: Simply allocate one copy, fill it with characters, return it, done.

Don't call the argument to that function "str", that conveys no information: everyone knows it's a string, from the type char*. Call it "input" or "lower" or something that tells a story. And I really hate your variable names "str1" and "str2". You might as well call the variables Alice, Bob, Charlie ..., it is just as meaningless.

The search string is uppercased every time it is compared. That's waste of CPU time. Instead, do it once in main, and pass the uppercased version down to listdir(). The other thing I hate is the use of a global variable; those are bugs waiting to happen. Add the search string as an argument to listdir, and pass it recursively down.

Don't worry about recursion depth. Every time you recurse, you're only pushing a few dozen bytes onto the stack (three arguments, each 8 bytes long, plus perhaps a stack frame). With stack depths being at least 4KiB (in the kernel) and nearly unlimited (in user space), you'll be exhausting file name length limits before you run out of memory.

In your code, you always treat strings like pointers, using integers as an index, and iterating. That's not idiomatic C, which uses pointers:
Code:
char* strtouppercase(char* lower) {
  char* upper = malloc(strlen(lower)+1);
  char* p = upper;
  for (; *lower;) {
    *p++ = toupper(*lower++);
  }
  *p = '\0';
  return upper;
}
Note that I did not check for malloc returning NULL, because in the real world it never does. The initialization clause of the for loop is empty, as the input variable is already initialized. In the loop test, there is no need to compare the character *lower to zero, that's implicit in C's truth testing. I'm modifying the variable lower inside the function, but who cares, it won't be copied back.

The comparison function which looks for the search string is funny. For example, imagine the search string is "a/b", you will consider /home/user/alpha/beta to be a hit. Do you want that? Also, you check for the search string in both directories and file names. So if your search string is "a", then every single file underneath /home/user/alpha/... will be a hit. Do you want that?

Finally, your uppercasing doesn't do Unicode, which is much more complicated, because it needs a locale. And since file names live in a global name space (which exists independent of the locale of the user), that's inherently ambiguous. It turns out that making a case-blind file system interface is surprisingly complex.
 
This construction is weird,
Code:
char *r = malloc( sizeof ptr );
return r ? memcpy(r, ptr, siz ) : NULL;
Please explain.
I think it is an attempt to handle out of memory errors. But it has two problems. First, in the real world malloc doesn't return NULL. Second, if this function were to return NULL, it's caller would blow up. If you check for errors, you need to do it all the way up the call stack.
 
My code in post #5 is still not quite correct - I change the original string, so the strings in the upper case are displayed on the screen. Corrected version: we do not use the 'strtouppercase' function at all, but instead of 'strstr' we use 'strcasestr'.
Code:
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <dirent.h>
#include <ctype.h>   // <- toupper on bsd and linux 
#include <limits.h>  // for PATH_MAX
char searchitem[PATH_MAX];
void listdir(const char *name, int indent)
{
    int i; int j ;
    char str1[PATH_MAX];
    char str2[PATH_MAX];
    char str[PATH_MAX];
    char ptr[PATH_MAX];

    DIR *dir;
    struct dirent *entry;

    if (!(dir = opendir(name)))
        return;

    while ((entry = readdir(dir)) != NULL)
    {
        if ( entry->d_type == DT_DIR )
    {
            char path[PATH_MAX];

            if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
                continue;

            snprintf( path, sizeof(path), "%s/%s", name, entry->d_name);

            listdir( path, indent + 2);
        }
    else
    {
           if ( strcasestr(  entry->d_name  ,   searchitem  ) != 0 )
           {
                  printf("%s/%s\n", name , entry->d_name );
           }
        }
    }
    closedir(dir);
}

int main( int argc, char *argv[])
{
     if ( argc == 2 )
     {
         strncpy( searchitem, argv[ 1 ], PATH_MAX );         
         listdir( ".", 0 ) ;
     }
     return 0;
}
The variant suggested by ralphbsz can also be used, but you will need to free the memory allocated for the 'upper' variable in the 'strtouppercase' function.
 
Code:
      size_t siz = sizeof ptr ;
      char *r = malloc( sizeof ptr );
      return r ? memcpy(r, ptr, siz ) : NULL;
this is bad
sizeof ptr is always 8 or 4 (platform depending) so you return a pointer to the 1st 8 bytes without null termination so will bomb when no zeroes are found soon enough by strstr

also listdir has the kitchen sink on the stack
char str1[PATH_MAX];
char str2[PATH_MAX];
char str[PATH_MAX];
char ptr[PATH_MAX];
which are unused (the stack defaults to 500k on freebsd so a 10 levels dir will bomb (PATH_MAX is 1024 and you have char path[PATH_MAX] too)

LE: that will be 100 levels so no problem
 
First, in the real world malloc doesn't return NULL.
I once implemented malloc for real-mode DOS programs (not segmented), this implementation does return NULL :cool:. I bet it's not uncommon in the "embedded" domain, if malloc is available there at all. Of course, with modern virtual memory management, your statement is true. And I actually hate that design, I'd very much prefer a way to back out instead of being killed by OOM (or, worse yet, having some other unrelated process killed...)
Second, if this function were to return NULL, it's caller would blow up. If you check for errors, you need to do it all the way up the call stack.
In fact, thinking about what you can do on out of memory, most of the time, the answer is you can't possibly continue. This is a classic (and IMHO rare) case where some "exception" model makes sense, instead of explicit error checking along the whole stack. For a complex program, you might want to use some setjmp(3) "panic handler" that at least tries some cleanup that's still possible without additional RAM. For a simple program, just calling abort(3) (to back out quickly) could be enough.
 
Don't worry about recursion depth. Every time you recurse, you're only pushing a few dozen bytes onto the stack (three arguments, each 8 bytes long, plus perhaps a stack frame)
And that's the problem - stack frame for listdir in this function is big due to those 5 static arrays. Now default FreeBSD stack limits are set a bit higher than let's say default Debian but this is the problem even in general. ulimit can move that "issue" far away too as I mentioned.

If there's a segfault involved proper way of handling it is to debug it. Attach debugger, see where it crashes and go from there. My first guess is this is a problem but as covacat summarized they are handful of problems in that code.

Whole bunch of little comments.
As Crivens said at the beginning this seems like a homework. I personally don't want to solve homework for anybody but I was willing to give a hint to stir to the right direction upon quick code glaze. And others chipped in with their observations.
 
Don't worry about recursion depth.
I must say this keeps bothering me a bit because it's not true. Recursion functions do have their use but one has to be aware of the resources available to them. While the textboot example of computing factorial, drawing ruler or simple walk of the tree doesn't take much resources (usually) one does have to be aware of them.

In my comment I said 20k for a stack frame because I thought MAX_PATH is 4096 (ext2 standard), on FreeBSD it's 1024. so 5k for the stack. Easy to demonstrate the crash on this program:
Code:
#include <stdio.h>

void myfunc();

int main() {
    myfunc();
}

void myfunc() {
    char bigbuf[5*1024];
    static int counter = 0;

    counter++;
    printf("counter: %d\n", counter);

    myfunc();
}
Compile it with cc -g -o stack stack.c and fire it away:
Code:
./stack
..
counter: 104529
Segmentation fault (core dumped)

Let's see the size of the myfunc():
Code:
$ llvm-objdump -M intel -d stack

00000000002018d0 <myfunc>:
  2018d0: 55                               push    rbp
  2018d1: 48 89 e5                         mov    rbp, rsp
  2018d4: 48 81 ec 00 14 00 00             sub    rsp, 5120                                              ; <<< a bit more than 5k (push and call will add 16B to the whole mem needed to call function
  2018db: 8b 04 25 98 3b 20 00             mov    eax, dword ptr [2112408]

Let see this in gdb too: gdb -q stack ./stack.core
Code:
Reading symbols from ./stack...
[New LWP 100389]
Core was generated by `./stack'.
Program terminated with signal SIGSEGV, Segmentation fault.
Invalid permissions for mapped object.
#0  0x00000000002018ff in myfunc () at stack.c:14
14        printf("counter: %d\n", counter);
   0x00000000002018f3 <myfunc+35>:    48 bf 38 05 20 00 00 00 00 00    movabs rdi,0x200538
   0x00000000002018fd <myfunc+45>:    b0 00    mov    al,0x0
=> 0x00000000002018ff <myfunc+47>:    e8 bc 00 00 00    call   0x2019c0 <printf@plt>
How deep did we go in stack? bt is a bit painful to check as stack trace is large. Trimmed output:
Code:
(gdb) bt
#0  0x00000000002018ff in myfunc () at stack.c:14
..
..
#104529 0x0000000000201909 in myfunc () at stack.c:16
#104530 0x00000000002018c9 in main () at stack.c:6
(gdb)

gdb)  f 104530
#104530 0x00000000002018c9 in main () at stack.c:6
6        myfunc();
   0x00000000002018c4 <main+4>:    e8 07 00 00 00    call   0x2018d0 <myfunc>
(gdb) i r $rsp
rsp            0x7fffffffea80      0x7fffffffea80                            <<<<<<<<< stack at the beginning
(gdb) f 0
#0  0x00000000002018ff in myfunc () at stack.c:14
14        printf("counter: %d\n", counter);
   0x00000000002018f3 <myfunc+35>:    48 bf 38 05 20 00 00 00 00 00    movabs rdi,0x200538
   0x00000000002018fd <myfunc+45>:    b0 00    mov    al,0x0
=> 0x00000000002018ff <myfunc+47>:    e8 bc 00 00 00    call   0x2019c0 <printf@plt>
(gdb) i r $rsp
rsp            0x7fffdffffd60      0x7fffdffffd60                          <<<<<<<<<< stack at the time of crash
(gdb)
Frame 104530 is the start of the program, we have 0x7fffdffffd60. Frame 0 is the crash, we have 0x7fffffffea80. Stack is growing down on x86. They are 0x7fffffffea80 - 0x7fffdffffd60 = 0x1FFFED20 bytes apart. That is 536866080B or 524283KB. Default stack size on FreeBSD is 524288kB (ulimit -a, look for stack).
 
You can increase the stacksize limit but that won't help if you have infinite recursion.

Note that while MAXPATHLEN may be 1024, you can always create much deeper hierarchies. Try

count=0
dir=1234567890
while (($count < 1000)); do mkdir $dir; cd $dir; let count=count+1; done
du $dir
 
Let's calculate the maximum recursion depth for this problem. MAX_PATH is 4096 (the worst case). One needs one level of recursion for every directory, so the worst case are short directory names, like /a/b/c/d/a/b/c/d/... Given the slash, that's at most 2048 levels of recursion.

Per level, we need to push one stack frame. That contains the return address, arguments (2 or 3), return value, and local variables. The arguments should all be integers or pointers (passing strings directly as arguments is crazy, and unidiomatic C which likes to work with pointers). The listdir function needs few local variables. For fun let's say 3 local variables, which brings the total up to 8 things on the stack (1+3+3+3), which is 64 bytes. Multiply that by 2048 levels of recursion, and you need 128KiB. The default stack size limit on FreeBSD seems to be 64 MiB for 32-bit machines, 512 MiB for 64-bit machines. So stack size is absolutely no problem.

Now, what may be a problem is having large things (long strings, arrays, big structures) either as arguments or as local variables. But this code doesn't do that. The name argument to listdir is a string pointer, not a string. The same argument is the one used by martin above: Don't create large arrays as local variables. Instead, allocate them (with malloc or new), and only keep the pointer to the array as a local variable.

To zirias' point that it is sad that malloc doesn't return NULL any more, and instead the program crashes with a seg fault: It is the reality we live in. With modern memory management (swapping, memory is only backed when actually touched, not when allocated), it is really hard for the userspace to get up-to-date information. That means that traditional simple error handling doesn't work. But in reality, error handling for out-of-memory has been broken nearly all the time; the best option is to let the program crash (without even attempting cleanup), log the problem from outside, and restart it.
 
To honeybear : what OS are you using to test this on? Your code doesn't compile as you shared it as PATH_MAX is undefined. Did you omit something in your code?

For fun let's say 3 local variables, which brings the total up to 8 things on the stack (1+3+3+3), which is 64 bytes.
In my example you can clearly see stack was moving down 5k, the stack frame in that example _is 5k. On the OPs example it will be alike.

what may be a problem is having large things (long strings, arrays, big structures) either as arguments or as local variables. But this code doesn't do that.
Yes is does:
Code:
void listdir(const char *name, int indent)
..
..
    int i; int j ;
    char str1[PATH_MAX];          <-- 1k
    char str2[PATH_MAX];         <-- 1k
    char str[PATH_MAX];           <--- 1k
    char ptr[PATH_MAX];            <--- 1k

    DIR *dir;
    struct dirent *entry;

    if (!(dir = opendir(name)))
        return;

    while ((entry = readdir(dir)) != NULL)
    {
        if ( entry->d_type == DT_DIR )
    {
            char path[PATH_MAX];               <---- 1k
 
Yes is does:
Darn it, you're right. I completely missed that.

Tell the student that they get an "F" in writing idiomatic C. <- Joke, they're trying to learn.

The correct answer would be:
Code:
char* str1 = malloc(PATH_MAX+1);  // Or use a more accurate length, like strlen(...)+1 here
// Do something with str1
free(str1);
 
To zirias' point that it is sad that malloc doesn't return NULL any more, and instead the program crashes with a seg fault: It is the reality we live in.
Actually, if it's really OOM, it's not a segfault. And there's no guarantee which process will be the "victim" either...

With modern memory management (swapping, memory is only backed when actually touched, not when allocated), it is really hard for the userspace to get up-to-date information. That means that traditional simple error handling doesn't work.
Well, IMHO, it's a problem with semantics of the classic APIs. To my question in a similar discussion why a process would ever allocate much more memory than it would really use, there was one valid reason given: Allocating contiguous address space!

Well then, all it would take would be APIs that clearly distinguish between reserving address space and actually allocating memory for backing it. The latter could fail. I know, it won't happen cause you can't just introduce breaking changes there, but it's still a pity...

But in reality, error handling for out-of-memory has been broken nearly all the time; the best option is to let the program crash (without even attempting cleanup), log the problem from outside, and restart it.
I don't agree. Talking about the simple case of malloc(), the C standard defines it can fail (and, as mentioned above, I'm pretty sure it does, e.g. on some embedded platforms). So, it makes sense to handle that gracefully. Yes, that's a complex thing to do correctly, which is likely the cause why it was broken so often....

edit: but the idiomatic xmalloc() you often see most of the time just calls abort(3) on failure, so that's pretty close to what you're suggesting.
 
I don't agree. Talking about the simple case of malloc(), the C standard defines it can fail (and, as mentioned above, I'm pretty sure it does, e.g. on some embedded platforms). So, it makes sense to handle that gracefully. Yes, that's a complex thing to do correctly, which is likely the cause why it was broken so often....
Agree that on embedded, you have to seriously think about it. Because there "fail and restart" or "keep multiple copies running" is not an option. But malloc and friends (sbrk...) is only one aspect of the problem; there are lots of other resource constraints one has to deal with (stack space, window handles, open files, even disk space). In normal C userspace code, handling it correctly is, as you said, super hard. I've seen it attempted thoroughly once, and it took weeks of engineering work. And even then it was (by its nature) incomplete.
 
I wonder what tool I could use to detect such errors?

Watch this space.

Valgrind:
==1523== Memcheck, a memory error detector
==1523== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==1523== Using Valgrind-3.21.0.GIT and LibVEX; rerun with -h for copyright info
==1523== Command: ../forum/forum2/find_file_orig foo.c
==1523==
==1523== Invalid read of size 1
==1523== at 0x485730E: strstr (in /usr/local/libexec/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==1523== by 0x201F3F: listdir (find_file_orig.c:56)
==1523== by 0x201F05: listdir (find_file_orig.c:52)
==1523== by 0x201F05: listdir (find_file_orig.c:52)
==1523== by 0x201FDC: main (find_file_orig.c:76)
==1523== Address 0x5467338 is 0 bytes after a block of size 8 alloc'd
==1523== at 0x484CBC4: malloc (in /usr/local/libexec/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==1523== by 0x201D7B: strtouppercase (find_file_orig.c:22)
==1523== by 0x201F1A: listdir (find_file_orig.c:56)
==1523== by 0x201F05: listdir (find_file_orig.c:52)
==1523== by 0x201F05: listdir (find_file_orig.c:52)
==1523== by 0x201FDC: main (find_file_orig.c:76)

Address sanitizer:
aulf> ../forum/forum2/find_file_orig_asan foo.c
=================================================================
==1525==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x602000000056 at pc 0x00000028d0a7 bp 0x7fffffffa180 sp 0x7fffffff9948
READ of size 8 at 0x602000000056 thread T0
#0 0x28d0a6 in __asan_memcpy /usr/src/contrib/llvm-project/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:22:3
#1 0x2b7bd6 in strtouppercase /home/paulf/scratch/forum/forum2/find_file_orig.c:23:20
#2 0x2b7f88 in listdir /home/paulf/scratch/forum/forum2/find_file_orig.c:56:62
#3 0x2b7f32 in listdir /home/paulf/scratch/forum/forum2/find_file_orig.c:52:13
#4 0x2b7f32 in listdir /home/paulf/scratch/forum/forum2/find_file_orig.c:52:13
#5 0x2b80aa in main /home/paulf/scratch/forum/forum2/find_file_orig.c:76:10

0x602000000056 is located 0 bytes to the right of 6-byte region [0x602000000050,0x602000000056)
allocated by thread T0 here:
#0 0x28dc0d in malloc /usr/src/contrib/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:129:3
#1 0x2b7a23 in strtouppercase /home/paulf/scratch/forum/forum2/find_file_orig.c:12:17
#2 0x2b7f88 in listdir /home/paulf/scratch/forum/forum2/find_file_orig.c:56:62
#3 0x2b7f32 in listdir /home/paulf/scratch/forum/forum2/find_file_orig.c:52:13
#4 0x2b7f32 in listdir /home/paulf/scratch/forum/forum2/find_file_orig.c:52:13
#5 0x2b80aa in main /home/paulf/scratch/forum/forum2/find_file_orig.c:76:10
#6 0x2353cf in _start /usr/src/lib/csu/amd64/crt1_c.c:75:7
#7 0x8002dd007 (<unknown module>)

Either should tell you where your problem was.
 
Back
Top