Finding duplicate files

Hello,

Is there a easy way / tool to verify duplicate files on the filesystem? I tried filedupe, but I'm unsure about the output.

I'm looking for something checking via checksum/hash if the it is the same file and if printing it out so I can put it into a to delete list.

regards,
 
Last edited by a moderator:
Conceptually, sorting the suspects by size, & then cksum(1)ming any identically sized files would probably be more efficient than checksumming every file & looking for duplicates.
 
fronclynne said:
Conceptually, sorting the suspects by size, & then cksum(1)ming any identically sized files would probably be more efficient than checksumming every file & looking for duplicates.

hm, sorting same file size before checksum'ing is a good idea.
 
dup.sh

I've been using this script I've been putting together as needs arose.
I'll need to add the 'sort by size', then cksum the files if size matches. That'll definitely make it go faster:

It'll run cksum on files starting in current directory, then will spit out duplicates.


Code:
#!/bin/sh
#
#	Tue Apr 28 08:30:40 MDT 2009 - 0.9
#		cleaned up code, do checks if output files already exist
#		 no more need to manually disalbe parts of this script
#		output files are left intact for manual removal/reuse
#		added in 'maxdepth' to $findcmd to prevent directory
#		 traversal if not required - by default goes 999 dirs deep
#  Thu Feb 21 08:20:55 MST 2008 - 0.5
#		basics, more "options" coming soon
#
# script to recursively check for duplicate files in current directory
#
# [email]scripts@pknet.net[/email]
# [url]http://peterk.org/scripts/[/url]
#

PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/games:/usr/local/sbin:/usr/local/bin
export PATH

# user configurable variables
# files
cksumraw=~/.tmp.cksumraw
cksumsorted=~/.tmp.cksumsorted
cksumresults=~/.tmp.cksumresults
findcmd="/usr/bin/find ./ -type f -maxdepth 999 -print0"
datecmd='/bin/date +"%Y.%m.%d.%H%M.%S"'


# end user configurable variables

# check if output file exists, if it does, go on to next step
#  using the existing files

echo
# 'cksum' each file in directory
if [ ! -f $cksumraw ]
then
	echo "populating $cksumraw - `$datecmd`"
	$findcmd | /usr/bin/xargs -0 cksum >> $cksumraw
else
	echo "$cksumraw ALREADY exists, NOT repopulating"
	echo "  continuing with sorting it"
	echo
fi

if [ ! -f $cksumsorted ]
then
	echo "populating $cksumsorted - `$datecmd`"
	sort $cksumraw > $cksumsorted
else
	echo "$cksumsorted ALREADY exists, NOT repopulating"
	echo "  continuing with analyzing it"
	echo
fi

# now go through sorted list to check for dups
# cksum write: checksum CRC, total number of octets, the filename

if [ ! -f $cksumresults ]
then

	echo "populating $cksumresults - `$datecmd`"
	#init start of list
	startlist=0
	echo > $cksumresults
	cat $cksumsorted | \
	while read crc octets filename
	do
   	# if first in list, make it same as previous
   	if [ $startlist -eq 0 ]
   	then
      	startlist=1
      	prevchksum=$crc
      	prevfile=$filename
      	continue
   	fi
   	# if current checksum equals saved checksum,
   	# the file is a duplicate
   	if [ $prevchksum -eq $crc ]
   	then
      	printf "file $prevfile \n  has duplicate $filename \n" >> $cksumresults
      	printf "file $prevfile \n  has duplicate $filename \n"
		else
      	prevchksum=$crc
      	prevfile=$filename
   	fi
	done
	echo "done populating $cksumresults - `$datecmd`"
else
	echo "$cksumresults ALREADY exists, NOT repopulating"
	echo 
fi

echo 'temp files NOT deleted:'
echo "	$cksumraw"
echo "	$cksumsorted"
echo Output list of duplicates saved in $cksumresults
 
Check this script as well:

Code:
find . -type f -exec md5 {} \; | sort -t\=  -k 2 | awk -F" = " '
{ gsub(/^MD5\ \(|\)$/,"",$1) }
$2 != b { a=$1 ; f=0 }
$2 == b {
        f++
        if (f==1) {
                print "#Same files below#"
                print a
        }
        print $1
}
{ b=$2 }
'

It just scans the current directory and reports the files that have the same checksum.
 
hej guys, thanx for the scripts, i'll check them out the next days.
@rbelk. yes ive seen this tool too, and others, currently i'm testing them all(i've found in the portstree) to find the one which suits me best or writing/enhancing a script.

regards
 
Back
Top