Simple Filecopy in C language?

Spartrekus · Jun 25, 2017

Hello,

Filecopy is fun, and there has been many variants so long.

Let's give many variants of a simple Filecopy in C language?

With <fcntl.h> or not, with fopen or not,... there are many ways.

Here a first attempt:

Code:

/*
 * Filename:    cat.c
* Author:      Thomas van der Burgt <thomas@thvdburgt.nl>
* Date:        24-MAR-2010
*
* The C Programming Language, second edition,
* by Brian Kernighan and Dennis Ritchie
*
* Exercise 8-1, page 174
*
* Rewrite the program cat from Chapter 7 using read, write, open and
* close instead of their standard library equivalents. Perform
* experiments to determine the relative speeds of the two versions.
*/

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>  /* File Control Operations */
#include <unistd.h> /* Symbolic Constants */

/* cat:  concatenate files */
int main(int argc, char *argv[])
{
    int fd;
    void filecopy(int ifd, int ofd);
    char *prog = argv[0];   /* program name for errors */

    if (argc == 1) /* no args; copy standard input */
        filecopy(0, 1);
    else
        while (--argc > 0)
            if ((fd = open(*++argv, O_RDONLY, 0)) == -1) {
                fprintf(stderr, "%s: can't open %s\n", prog, *argv);
                exit(1);
            } else {
                filecopy(fd, 1);
                close(fd);
            }
    exit(EXIT_SUCCESS);
}

/* filecopy:  copy file ifd to file ofd */
void filecopy(int ifd, int ofd)
{
    char buf[BUFSIZ];
    int c;

    while ((c = read(ifd, buf, BUFSIZ)) > 0)  /* read from ifd */
        write (ofd, buf, c);                  /* write to ofd */
}

ralphbsz · Jun 25, 2017

Lots of interesting error handling you could and should add: (a) What if argc is less than 1 (that can't happen when called from a shell, but I think it can be done using execve(2), and it is possible if the same source is used in embedded systems). Either assert in that case (sensible choice, the program needs its program name to print messages), or make it so it functions correctly. (b) When open fails, don't just print that it doesn't work, but print why: translate errno into a human-readable string using strerror(3). (c) This one is really important: Check errors after every read and write call, and after close (yes, close can return errors!), print error messages, again with clear messages. (d) Come up with a sensible policy of what to do when an error occurs. Maybe you want to delete the half-copied output file?

Doing the copy one byte at a time is slow. Allocate a sensible size buffer, and copy a whole buffer at a time. Requires special handling for the last (partial) buffer. Make sure it is a multiple of the file system block size. Unfortunately, I don't remember a system-independent way to get the block size. But since block sizes of modern file systems are probably always powers of 2, and since the largest block size I've ever heard being in production today is 64MiB, you might as well go for a 128meg buffer, and be done with it. Interesting error handling could be written here: If you have so little memory that you can't get a 128meg buffer, fall back to smaller ones. Not worth the effort on modern systems, since I can't imagine a computer that doesn't have this much memory free.

Does the user want the file to REALLY be on disk when the program ends? If yes, add an fsync(2) call at the end, and handle errors correctly.

What if the copy overwrites an existing file? Do you want to allow that or prohibit it? If you want to allow it, it is probably a bad idea to open the file and overwrite it in place, since then a failed copy will leave a damaged file (perhaps the worst possible case, a file that contains half the old and half the new content). In that case, it's better to create a temporary file in the current directory, and after the writing to the temporary has successfully completed, atomically rename it to the target file. This has lots of complexity: How to get a temporary file name collision free? Also be warned: modern file systems can have file placement policies, which can depend on things like file names; the temporary file may end up physically on different disks than the intended target file, and the atomic rename might end up having to copy the file (internally in the file system). Consult with the manuals for the file system you're using.

Parallelism: With modern file systems, the actual read and write IO will be asynchronous anyhow (the file system will prefetch reads and write-behind writes), so adding parallelism yourself will probably give no performance gain and only add complexity. Most good modern file systems are optimized so sequential reads and writes are already maximally fast, so trying to do crazy tricks (like strided IO) will probably not be beneficial. For the same reason, with good modern file systems the read() and write() calls are the fastest thing you can do from user space; trying crazy things (aio, mmap, direct IO, ...) will probably not help. But what if someone is copying from something that doesn't get the benefit of buffering or of a modern file system? What if someone is using your program to read from a tape drive or write to a serial port? Then you need to design appropriate parallelism, and use appropriate buffer sizes.

What you wrote is cat, not file copy. File copy is a little harder from the argument parsing point of view. A simple file copy has exactly two arguments (input and output), but I could also see the value in having a program that concatenates multiple input files into one output file (which is morally equivalent to "cat a b c > d"). In your application, would it be useful to do multiple copies? For example "cp in1 out1 in2 out2 in3 out3" for copying 3 files simultaneously (parallel or serial)? Another question: Do you want to be able to copy from or to stdin or stdout? I could see value in a command "cp - outfile", but that ends up being the same as "cat > outfile" from the shell (but you don't always have the benefit of a shell). If you want to do that, you need to come up with a syntax for specifying that the "filenames" really refer to stdin/stdout.

How about the file attributes? Do you want to be able to copy ownership (user, group), permissions, and time stamps of the file? How about extended attributes (many modern file systems implement those, and they can often be complex and arbitrarily long)?

Very much extra credit: You know what sparse files are, right? To make the copy more efficient, detect sparseness when reading, and skip while writing. This is actually somewhere between very hard and impossible, and probably requires knowing about the internals of the file system you are using.

I think this really is a lesson in software engineering: The coding is the easy part. The hard part is to understand what your requirements are (which of the above features are needed), what the cost is (some of the above would take an experienced person a day or a week), and making a cost/benefit tradeoff.

Simple Filecopy in C language?

Spartrekus

ralphbsz