script for detecting files in a directory but not in another

Hi all,
I've two very large directory trees (around 60 GB each): the first one, called src has been copied and reorganized into the second one dst. The reorganization part means that files under a specific path in src could be on a very different path on dst. Is there a quick way to see which files are only in src and not in dst?
I was wondering to create a kind of directory scanner that stores each file checksum and peform a kind of left join against them, but I hope there is some tool ready for this kind of purpose. Please note that files are binary (images and video).
Any idea is welcome.

Thanks
 
Re: script for detecting files in a directory but not in ano

I can't give you a better answer now but net/rsync uses exactly the kind of comparisons you're after when it determines if a file has to be copied over or not. Maybe it can be adapted to your needs with the right options.
 
Re: script for detecting files in a directory but not in ano

It's a little rough and ready but how about the following series of commands (tested in tcsh(1)):

Generate a list of files and associated SHA-1 hashes from all files in src directory and descendants with find(1) and sha1(1):
Code:
# find /path/to/src -type f -exec sha1 {} \; > src.list

Generate a list of SHA-1 hashes from all files in dst directory and descendants with find(1) and sha1(1):
Code:
# find /path/to/dst -type f -exec sha1 -q {} \; > dst.list

Print a list of lines in src.list where the hash is not found in dst.list with fgrep(1):
Code:
# fgrep -v -f dst.list src.list

The output could be formatted with sed(1) or similar if required.
 
Re: script for detecting files in a directory but not in ano

Are the files renamed too? Or are they just moved around?

find /path/to/src -type f -exec basename {} \; | sort | uniq > src.list
find /path/to/dst -type f -exec basename {} \; | sort | uniq > dst.list
diff src.list dst.list

Not really tested but I think you get the idea.
 
Re: script for detecting files in a directory but not in ano

So far this is the script I've produced, but quite frankly is a dirty piece of code, so I'm wondering to rewrite it in Perl.
Thanks to everyone.

Code:
#!/bin/sh

# check arguments
if [ $# -lt 2 ]
then
    echo "Usage: $0 src_directory dst_directory"
    exit 1
else
    SRC_DIR=$1
    DST_DIR=$2

    if [ ! -d "$SRC_DIR" -o ! -d "$DST_DIR" ]
    then
        echo "Please specifies only directories! [$SRC_DIR] [$DST_DIR]"
        exit 2
    fi
fi


# # setup
SRC_DB="/tmp/src.$$"
DST_DB="/tmp/dst.$$"



touch $SRC_DB $DST_DB > /dev/null



echo "Indexing source directory [ $SRC_DIR ] => $SRC_DB"
find "$SRC_DIR" -type f -exec md5sum {} \;  > $SRC_DB

echo "Indexing target directory [ $DST_DIR ] => $DST_DB"
find "$DST_DIR" -type f -exec md5sum {} \;  > $DST_DB



echo "Doing cross-lookup..."
while read hash file 
do

    grep $hash $DST_DB > /dev/null 2>&1
    if [ $? -ne 0 ]
    then
        echo "Source file [ $file $hash ] is missing!"
    fi

done < $SRC_DB

echo "All done!"
 
Re: script for detecting files in a directory but not in ano

That probably works but it's not really efficient. It does a grep for each file, which can have quite an impact if the file list gets large. To speed things up you could write the hash and the filename on 1 line each, hash first. Then sort(1) both files and run a diff(1) against the sorted lists. That should be significantly faster.
 
Re: script for detecting files in a directory but not in ano

SirDice said:
Then sort(1) both files and run a diff(1) against the sorted lists. That should be significantly faster.

Uhm...since file names could have been changed, and for sure directory names have changed, I believe a diff(1) would produce too many false positives. Am I wrong?
 
Re: script for detecting files in a directory but not in ano

No, you're not wrong. But you could coax diff(1) into only looking at the first column. The hashes alone should be different enough.
 
Re: script for detecting files in a directory but not in ano

I've used sysutils/duff for this basic purpose. It should be a bit more efficient than the others since it first checks make sure files are the same size before attempting a more detailed comparison. It looks for files that are duplicates in so the list of files it produces would be everything that is that is in both. Anything outside that list should be files unique to the src directory.
 
Back
Top