C Recursive copy of specific files in C language in Unix? [open]

Spartrekus · May 1, 2018

Hello,

I would like to recursive copy of specific files in C language in Unix? I would like to copy all the *.txt, *.nfo, *.dat, *.tex,... files (plain text) with keeping the directory (to mkdir on target).

Here my current attempt. Basic example.

mkdir does create a dir
chdir change the directory
ncp to copy a file
and the recursive is herewith:

Code:

#include <stdio.h>
#if defined(__linux__)
#define MYOS 1
#elif defined(_WIN32)
#define MYOS 2
#elif defined(_WIN64)
#define MYOS 3
#elif defined(__unix__)
#define MYOS 4
#define PATH_MAX 2500
#else
#define MYOS 0
#endif


              #include <unistd.h>
              #include <sys/types.h>
              #include <dirent.h>
              #include <stdio.h>
              #include <string.h>

#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <dirent.h>
#include <stdio.h>

              void listdir(const char *name, int indent)
              {
                  DIR *dir;
                  struct dirent *entry;

                  if (!(dir = opendir(name)))
                      return;

                  while ((entry = readdir(dir)) != NULL) {
                      if (entry->d_type == DT_DIR) {
                char path[1024];
                 if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
                      continue;
                          snprintf(path, sizeof(path), "%s/%s", name, entry->d_name);
                          printf("%*s[%s]\n", indent, "", entry->d_name);
                          listdir(path, indent + 2);
                      } else {
                          printf("%*s- %s\n", indent, "", entry->d_name);
                      }
                  }
                  closedir(dir);
              }

              int main(void) {
                  listdir(".", 0);
                  return 0;
              }

Code:

#include <stdio.h>
#include <stdlib.h>

int main( int argc, char *argv[])
{

   FILE *source, *target; int ch ; 
   source = fopen( argv[ 1 ], "r");
   if( source == NULL )
   {
      printf("Press any key to exit...\n");
      exit(EXIT_FAILURE);
   }
 
   target = fopen( argv[ 2 ] , "w");
   if( target == NULL )
   {
      fclose(source);
      printf("Press any key to exit...\n");
      exit(EXIT_FAILURE);
   }
 
   printf("Source: %s\n",  argv[ 1 ] );
   printf("Target: %s\n",  argv[ 2 ] );
   printf("Copying...\n");

   while( ( ch = fgetc(source) ) != EOF )
      fputc(ch, target);
 
   printf("File copied successfully.\n");
   fclose(source);
   fclose(target);
   return 0;
}

Please feel free to start C coding discussion.
It is almost done, using above code... let's look how to make it easily. The simpler the better here.

Thank you and looking forward to reading your posts.

Bobi B. · May 1, 2018

Well,

Consider adding at least some error messages; see warn(3) and err(3) (not universally available), use perror(3) or fprintf(3) + strerror(3);
Error handling is missing here and there.
Copying file byte by byte, even buffered, is far from optimal. Using larger buffers, 128KB or so, is much-more effective. There are lots of things to consider here (not everything is a regular file or a directory, sparse files, size preallocation) and lots of things missing: preserving ownership, mode, timestamps.

ralphbsz · May 1, 2018

There are so many things to say, in addition to what Bobi B already said.

Scope
First: Is this a production tool or a coding exercise? I'm going to assume that you are writing a production tool, or you are pretending to do so as an exercise. In that case, your statement "the simpler the better" is simply wrong. It needs to be feature-complete so people can actually use it. It needs to have some advantage compared to existing tools, such as rsync. It needs to be bullet-proof and debuggable enough so it will survive the vagaries of production.

Missing features
As Bobi said: If this program fails, it needs to give detailed information what happened. Not "press any key", but "when attempting to write to file '/home/dir/foo.bar' had error ENOSPC after 123456 bytes, 789 bytes were not copied, continuing with other files". It has to be clear to the user what happened.

How do you handle errors when you can not create the new file, because something already exists there? Overwrite it? Warn? I don't know the answer, but you need to come up with a coherent story.

How do you recover after an error? Say copying one file failed. Do you want to continue with the others or not? What do your users really want? But one thing is for certain: If you fail, you can not leave half-done work behind. At least at the level of each file, you need to either succeed or fail, not something in between. The standard technique for this is: Do not copy to the actual file, but instead to a temporary file. Only when the copy succeeds, then rename the file to the eventual destination; the advantage of that is that rename on file systems is supposed to be atomic, and either completely fail or completely succeed. Maybe you want to extend this logic to saying that if anything fails, the whole recursive tree copy is not just aborted but undone, and you remove everything you have copied. But what if you get a second error while removing? What if creating the copy already changed something else unrecoverably (like overwrite an existing file)?

You need an option to stop recursing on mount points.

For real-world use, in addition to the other attributes Bobi already mentioned, you need to copy extended attributes and ACLs. At least most of the time; sometimes users might want to explicitly not copy them.

How about restarting the program after it has aborted (control C or error or crash)? Do you want to store intermediate state to make it restartable? What is the restart granularity? What do you do if the state of the source directory tree has changed by the time of restart?

Performance
Where is your performance bottleneck? That depends on file sizes, number of files per directory, whether source and target are on the same file system and on the same disk, and on the implementation of the underlying file system. Assuming a really good file system underneath, your goal has to be to keep the disk drive 100% busy. The way to do that is to try to run sequential as much as possible over files, while decoupling metadata operations (which are typically heavily cached by file systems) from data operations. If you are running this over a multi-disk, cluster or network file system, your performance will be definitely be limited by your implementation; on a single-disk file system, your single-threaded implementation might actually be good enough.

If you have performance problems: the first step will probably be to be to multi-thread this program. At least one thread to perform directory walking operations, which then fires off worker threads to copy one file at a time. The correct number of threads will need investigating, and should have a sensible default. Whether to throw copy operations to another thread or do them in-line depends on many factors.

Bobi already mentioned copying the files in big chunks. Given how modern file systems are implemented, I would use gigantic chunks (at least 16MiB, use powers of two). There are lots of file systems out there that allocate data in chunks of 64 or 128MiB, and you really help them do their job if you copy large files in single large streams. I would also read as much of the file as possible into memory (at least a GiB) before starting to write. Like that the prefetch/writebehind in the file system can do its jobs undisturbed.

Implementation
Why C? If it is a small exercise program, it could be much smaller, simpler, reliable and logical by using a high-level language, where things like recursive directory walking are already built in. If your program gets really big (and if you add all the features we discussed above, you will have reimplemented rsync), then it is way too large for straight C. You should probably go to an object-oriented design paradigm, and talk about your class and object hierarchies before even starting implementation.

Coding
Will be discussed later; I have to run and do something else now.

ralphbsz · May 2, 2018

Coding issues:

OS specific defines

#if defined(__linux__)
#define MYOS 1
#elif defined(_WIN32)
#define MYOS 2
...

To begin with, you never actually use these flags. And actually, you should hardly ever need them: Why do you care which OS you are compiling on? If you are using portable interfaces (such as Posix), you don't need to be OS-specific at all.

Your flags also don't do anything: you are just translating from one set of #defines (namely ones like "__linux__") to another set (namely the MYOS one). That really doesn't create new knowledge.

Now in practice, that doesn't actually work 100%. Even with using Posix calls as much as humanly possible, you will still have OS dependencies. For example: eventually, you will need to copy extended attributes. But they are handled differently in each OS (to my knowledge never standardized in Posix). So at the top you could do something like this: if Linux -> Use getxattr/setxattr... calls, else if AIX use getea/setea, and so on. Like this you have actually created knowledge that is useful and specific to your program.

Includes
You include the same thing multiple times. The indentation of includes is crazy. To prevent this, I like to define a fixed order of includes. For example: First OS includes in one block, then big libraries, then project-wide, then local, each in its own block. Within each block, have some logical sorting criteria, for example alphabetic. Comment each include with the reason why it is needed: "#include <dirent.h> // Needed to parse directory entries"

Indentation
Your indentation is broken all over. Fix it. Indentation has to be nearly 100% consistent (and exceptions need to be well argued). We don't need to discuss whether you indent each level by 2 or 4 spaces, either has pros and cons, but pick one and follow it. The way your code is indented makes it hard to read.

While we are at it: Never put multiple variable declarations on one line.

Commenting and spacing
You typically need on average one line of comment per line of source, explaining what is going on, and why you are doing it. At the top of each block (main program, file, function, block of code) needs to be a comment explaining the "ToO" or Theory of Operations of what is coming: "This is the catching function. The purpose of this program is to catch and count all purple elephants. This function catches them using a net. Input data is an array of GPS locations where elephants have been spotted using the purple camera. Output is a zoo of elephants".

Remember: The real consumer of your source code is not the compiler who turns it into an executable program. Compiling a program is trivial. Enhancing, debugging and maintaining it is hard. You need to make your code readable, because most of your future work will be reading code, understanding it, and changing it.

Your spacing conventions are inconsistent. Pick one, and follow it consistently. There are lots of C and C++ style guides on the net, use one.

Your spacing convention is crazy. Why are there spaces before the argv: "fopen( argv[ 2 ] , "w")" ? Why is the opening paren right after while, but then why is there a space after it in "while( ( ch = fgetc(source) ) != EOF )" ? Here is how I would write it: "while ( (ch=fgetc(source)) != EOF)". I took the inner expression (assigning the result of fgetc into ch) and packed it tightly. Like that the reader can see that this is one coherent block. Then spaces to delimit this block from the outer test. This trick works well, but only for inner and outer blocks; if you have 3-deep nesting, it stops working. Similar problem here in your code: "printf("Source: %s\n", argv[ 1 ] );": Why is printf and source bound closely together, while arg and its array index are falling apart?

Variables
Personally, I like to follow the C++ model: Variables are declared as late as possible. Namely when they are first used. Why? Because then they can be immediately initialized to a valid value. For example, you have "int ch" in your program. It is left uninitialized for a long time; some fool might use it, and get a nonsensical value (at least the compiler will warn you, but warnings are easy to ignore). Another good practice is to declare variables in the tightest possible enclosing block. For example, your ch remains in scope after the copy loop: someone might use it, without knowing that it is no longer valid. If the variable goes out of scope, that won't even compile, so that mistake the compiler prevents.

C Recursive copy of specific files in C language in Unix? [open]

Spartrekus

Bobi B.

ralphbsz

ralphbsz