Sponsored Content
Top Forums Shell Programming and Scripting Duplicate file remover using md5sum--good enough? Post 302449436 by Michael Stora on Monday 30th of August 2010 02:43:14 PM
Old 08-30-2010
Duplicate file remover using md5sum--good enough?

This is not a typical question. I have a fully working script but I'm interested in optimizing it.

I frequently back up photos and movies from my digital camera and cell phone to both my home/work desktops, laptops, wife's netbook, and my home NAS and often end up with multiple versions of the same files in folders of varying completeness. I previously wrote a very slow script that checked if files with the same length were identical. It was also limited to a single directory. I recently wrote this much faster script that digs recursively through a directory tree (with find instead of ls), creates a field containing size, checksum, basename and full path name, sorts them, and deletes all but the first when size and checksum are identical. The script uses pipes to avoid arrays and executes quite fast. The alphabetically first basename gets kept. I use a comma to separate the size and checksum part (to test if identical) and a backslash before the path name so I can pass everything on the pipe and separate after. There is a dash between size and checksum that was useful for debugging and I don't think impacts speed. The path ends up being part of the sort, but that is harmless since it is after the basename. The parameter substitutions are robust for either comma or backslash occurring in the file name or path (God forbid!).
Code:
#! /bin/bash

#Delete duplicate files starting at $1 recursive

dir=${1:-.}                                                                 #defaults to current directory '.'

find "$dir" | { while read path; do
    name=${path##*/}                                                        #basename
    if [ -f "$path" ]; then                                                 #if regular file
        sum=$(md5sum "$path")
        echo `stat -c %s "$path"`'-'${sum%%' '*}','"$name"'\\'"$path"       #length-md5sum,basename\path
    else continue                                                           #skip if not regular file
    fi
done } | sort | {                                                           #sort files
test=''
while read line; do
    front=${line%%,*}                                                       #size-md5sum
    back=${line##*\\}                                                       #full file name with path
    if [ "$front" = "$test" ]; then                                         #same size-md5sum as previous file?
        echo 'deleting duplicate file '"$back"; rm "$back"                  #if so, delete it.
    fi
    test="$front"
done }

Since it works so well, I'd like to use it as a general tool. Is md5sum good enough? Does adding size (was a legacy optimization from my first script) really add any benefit in collision resistance or execution speed in the sort command? Any ideas on how to make this script faster or more robust?

Edit: sha1sum is about 13% slower than md5sum but probably eliminates collision concerns.

Mike

Last edited by Michael Stora; 08-30-2010 at 04:44 PM..
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

What is md5sum???

Hi all, I am kinda puzzled. When and Why do we use md5sum? I've read man pages for mp5sum, but didn't get anything out of it. Please, can someone explain this to me in couple of words. Thank you all. (1 Reply)
Discussion started by: solvman
1 Replies

2. UNIX for Dummies Questions & Answers

the file: MD5SUM

i downloaded a Linux distribution from a FTP site today, and i found there is a file named MD5SUM in the same directory, with the following contents: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 c9a4d963a49e384e10dec9c2bd49ad73 valhalla-SRPMS-disc1.iso 41b03d068e84d2a17147aa27e704f79b ... (1 Reply)
Discussion started by: samprax
1 Replies

3. Shell Programming and Scripting

Remover Banner and SQL prompt from isql

Hi, I am using isql and putting the output in a file in a shell script in Linux. However in my output i am getting the banner and a SQL prompt(at the start and the end of query output) +---------------------------------------+ | Connected! | | ... (6 Replies)
Discussion started by: lifzgud
6 Replies

4. Programming

Computing an MD5Sum in C

Is it possible to call the unix command md5sum from within a C program. I am trying to write a C program that scans a directory and computes the MD5Sum of all the files in the directory. Whenever I use md5sum 'filename' I get the error 'md5sum undeclared'. Is there a header file or some library... (3 Replies)
Discussion started by: snag49ers
3 Replies

5. Shell Programming and Scripting

Script to check MD5SUM on file

Hi, I currently have a shell script that takes an RPM and scp's it to a set of remote servers and installs it. What I would like to be able to do is make the script get the md5sum of the RPM locally (so get the md5sum of the rpm from where im running the script) and then scp the rpm to the... (0 Replies)
Discussion started by: tb1986
0 Replies

6. Shell Programming and Scripting

how to get a md5sum in perl

hi All: i write a adduser script in perl , but I don't know how to deal with the password , for it stored as md5. and i don't use the shell command passwd. give me some advice...thanks (1 Reply)
Discussion started by: kingdream
1 Replies

7. Shell Programming and Scripting

Using md5sum to name file based on URL

I am trying to download a file and make the filename of the file be the md5sum of the URL. I know to use wgets to download the file but I do not know how to do the rest...any help would be appreciated. (2 Replies)
Discussion started by: The undertaker
2 Replies

8. Shell Programming and Scripting

md5sum on a file with backslash in its name

Hi there, I found something very weird! Should I report that as a bug or is it me misusing the command? I've got a file with a backslash in its name. I know it's a horrible policy but it's not me. The file came from a mac computer because this is a backup server. Anyway, when using... (8 Replies)
Discussion started by: chebarbudo
8 Replies

9. Shell Programming and Scripting

Removing md5sum lines stored in text file

Hello. I'm writing a script where every file you create will generate a md5sum and store it into a text file. Say I create 2 files, it'll look like this in the text file: d41d8cd98f00b204e9800998ecf8427e /helloworld/saystheman d41d8cd98f00b204e9800998ecf8427e /helloworld/test I... (3 Replies)
Discussion started by: batarangs_
3 Replies

10. Shell Programming and Scripting

Compare two md5sum

Hello, First of all I want to apologize because i'm not a admin or coder and maybe all my efforts to write only this small script in my life would need one week full time reading man pages and forums but... I don't have the money to offer me to get this time and the script I want to do seems... (5 Replies)
Discussion started by: toscan
5 Replies
CKSFV(1)						      General Commands Manual							  CKSFV(1)

NAME
cksfv - tests and creates simple file verification (SFV) listings SYNOPSIS
cksfv [-bciqrL] [-C dir] [-f file] [-g path] [file ...] DESCRIPTION
cksfv is a tool for verifying CRC32 checksums of files. CRC32 checksums are used to verify that files are not corrupted. The algorithm is cryptographically crippled so it can not be used for security purposes. md5sum (1) or sha1sum (1) are much better tools for checksuming files. cksfv should only be used for compatibility with other systems. cksfv has two operation modes: checksum creation and checksum verification In checksum creation mode cksfv outputs CRC32 checksums of files to to stdout, normally redirected to an .sfv file. In checksum verification mode cksfv reads filenames from an sfv file, and compares the recorded checksum values against recomputed check- sums of files. OPTIONS
These options are available -b Strip dirnames from filenames that are checksumed. loads the files from original positions, but prints only basenames to catalogue in sfv file. -c Use stdout for printing progress and final resolution (files OK or some errors detected). This is useful for external programs analysing output of cksfv. This also forces fflushes on the output when needed. -C dir Change current directory before proceeding with a verification operation. This option is mostly obsoleted with -g option. Earlier this was used to verify checksums in a different directory: cksfv -C foo -f foo/bar.sfv -f file Verify checksums in the sfv file -g file Change current directory to the path name of the file and verify checksums in the sfv. -i Ignore case in filenames. This is used in the checksum verification mode. -L Follow symlinks when recursing subdirectories. This option is used with the -r option. -q Enable QUIET mode (instead of verbose mode), only error messages are printed -v Enable VERBOSE mode, this is the default mode -r recurse directories and check the .sfv files in each. Symlinks are not followed by default. This option cannot be used with -f and -g options. EXAMPLES
Verify checksums of files listed in 'foo/files.sfv': cksfv -g foo/files.sfv Create checksums for a set of files: cksfv *.gz > files.sfv Verify checksums of case-insensitive filenames listed in 'files.sfv'. This is sometimes useful with files created by operating systems that have case-insensitive filesystem names. cksfv -i -g files.sfv Check checksums of files 'foo' and 'bar' listed in 'files.sfv': cksfv -g files.sfv foo bar Create checksums of files matching /foo/bar/* and strip dirnames away: cksfv -b /foo/bar/* > files.sfv Recursively scan /foo/bar and verify each .sfv file: cksfv -C /foo/bar -r Same as previous, but starting from the current working directory and also following symlinks during recursion: cksfv -r -L SEE ALSO
basename(1) dirname(1) md5sum(1) sha1sum(1) AUTHOR
This manual page was originally written by Stefan Alfredsson <stefan@alfredsson.org>. It was later modified by Heikki Orsila <heikki.orsila@iki.fi> and Durk van Veen <durk.van.veen@gmail.com>. CKSFV(1)
All times are GMT -4. The time now is 05:14 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy