How Can I Delete Duplicate Files

wmichaelb · Sep 26, 2017

I'm running Ghost BSD 11.1, which is a FreeBSD distro, in my case running the MATE desktop. I have a Music folder with over 1900 mp3 files, the bulk of which have duplicates. Because they were accumulated from various machines, the filenames for identical songs differ; many times they have the same filename except for a two digit number followed by a space in the front, e.g., 13 songname.mp3. In other cases, the filename has the full song name, but no number, just songname.mp3. In yet other cases, the filename may or may not have the number, and will also show only a part of the entire song title, as in song.mp3. I would like to be able to eliminate all the duplicates and keep the album directories. I have rudimentary CLI skills, but I'm not afraid to try things. I have tried to research this, but not successfully. Does someone have a suggestion on a FISH script that would identify and eliminate all but one copy of a song file? Thanks so much in advance.

aragats · Sep 26, 2017

Here is a simple sequence of commands you can use in a terminal:

Code:

$ cd <music directory>
$ find . -name "*mp3" -exec md5 {} \; > /tmp/songs.txt
$ sort -k 4 -u /tmp/songs.txt > /tmp/filtered.txt

So, you navigate to your music directory and find all music files. You may want to adjust the -name pattern or use -iname in case you may have MP3 extension. On each music file you calculate md5 checksum and store the output in /tmp/songs.txt. So identical files must have the same md5 sum. Then you sort that file by the 4th field (switch -k 4), i.e. by the actual checksum. There is another switch -u which lets the output to have only one of identical (by 4th field) lines, thus you'll have unique music files list in /tmp/filtered.txt.
If you want to see the difference i.e. which lines have been filtered out, you can run a couple of more commands:

Code:

$ sort -k 4 /tmp/songs.txt > /tmp/unfiltered.txt
$ diff /tmp/unfiltered.txt /tmp/filtered.txt

Now you'll have another list without -u switch to compare.

ralphbsz · Sep 26, 2017

There is a fly in the ointment: mp3 files are frequently modified without actually changing the music. For example, the MacOS iTunes player used to be in the habit of updating the ID3 tags whenever the file was being played (to show how often it was played, what rating the user had given it, and so on). Don't know whether it still does that. Or humans like to adjust the ID3 tags, so the music files sort nicely in their music players, or to add album artwork. All these minor changes will cause the md3 checksum that aragats proposed above to come out different.

Here is something I've been thinking about doing, but not implemented yet: For each .mp3 file, convert it to .wav format (fundamentally meaning strip the ID3 tags off), and then checksum just the .wav-format output. Then remove duplicates that contain the same "sound", meaning the same .wav file. Obviously, this is extra work, and takes more CPU and IO time. It also raises an interesting question: If there are two .mp3 files that contain the same music (have the same md5 checksum on their .wav-format output), which one to keep?

Side remark: I do something very similar in my home-made backup program, except I don't use md5, but sha512. Takes fundamentally just the same time to calculate, and has lower birthday-paradox collision probability.

SirDice · Sep 26, 2017

wmichaelb said:
I'm running Ghost BSD 11.1

PC-BSD, FreeNAS, NAS4Free, and all other FreeBSD Derivatives

scottro · Sep 26, 2017

https://forums.freebsd.org/threads/61441/

So, a three page thread about systemd is relevant, but a question about scripting that would be identical on FreeBSD is considered taboo? This is a scripting question, not a GhostBSD question.

Just trying to clarify.

SirDice · Sep 26, 2017

scottro said:
So, a three page thread about systemd is relevant, but a question about scripting that would be identical on FreeBSD is considered taboo?

The systemd thread is general chit-chat about something technical that's not related to FreeBSD, which is why it's posted in "off-topic". This is a support question about a FreeBSD derivative.

Besides that, I didn't say it wasn't allowed. The FAQ is more meant to inform you that the solutions offered here are for FreeBSD and may have adverse effects on a heavily modified derivative. If it was something that's really not allowed I would have closed the thread immediately and point to rule #7.

scottro · Sep 26, 2017

Ok, fair enough. As I said, I was looking for clarification.

SirDice · Sep 26, 2017

scottro said:
As I said, I was looking for clarification.

Truthfully, some threads are definitely borderline. Nothing wrong with seeking some clarification

aragats · Sep 26, 2017

ralphbsz , yes, those are very good points, thanks!
Also, initial problem formulation is rather too loose. You may also have the "same" composition published in different albums. For example, I have more that 30 Pink Floyd albums in mp3 issued within a span of 40 years. Of course, I want to keep all versions of this and that song.
wmichaelb mentioned the song names. In case when they all have meaningful names it is possible to calculate correlation functions on those strings, although it's not trivial.
Another way is to use an online service which can identify the melody ;-)

tankist02 · Sep 26, 2017

I use http://www.freshports.org/sysutils/fdupes/ port" href="http://www.freshports.org/http://www.freshports.org/sysutils/fdupes/">http://www.freshports.org/sysutils/fdupes/ to find duplicate files.

aragats · Sep 26, 2017

Actually sysutils/fdupes does use md5 to identify, but it optimizes the algorithm, in particular, for large files it calculates md5 on a part of file to save time and resources.

tankist02 · Sep 27, 2017

fdupes also does not run md5 on files of different sizes.

wmichaelb · Sep 27, 2017

So, I was able to eliminate the duplicate files and the multiple labels. Now, I would like to copy the directory of mp3 files back to my FAT32-formatted USB HD so that I can also use that on other machines. I can copy the directory and files back to the HD, but when I try to look at that directory, I get an error message stating "Sorry, could not display all the contents of "Music_Base": Error when getting information for file '/mnt/Music_Base/??�?�?��.6�?': Invalid argument". I can read the original directory on the USB HD just fine. I've tried using both cp and rsync, and got the same results. Can anyone suggest what else I might be doing wrong? Thanks again!

ralphbsz · Sep 28, 2017

What do you mean by "try to look at the directory"? Is this an ls command? If yes, what is your cwd? Is this some GUI-based file manager? Also, your error message has unreadable gibberish (probably mangled unicode characters) after '/mnt/Music_Base'. Seems something didn't survive cut and paste. Can you tell us what what those really look like on your screen?

How Can I Delete Duplicate Files

wmichaelb

aragats

ralphbsz

SirDice

Administrator

scottro

SirDice

Administrator

scottro

SirDice

Administrator

aragats

tankist02

aragats

tankist02

wmichaelb

ralphbsz