Duplicate filename algorithm

03-20-2010

Registered User

35, 0

Join Date: Feb 2010

Last Activity: 27 February 2014, 7:19 AM EST

Posts: 35

Thanks Given: 16

Thanked 0 Times in 0 Posts

Duplicate filename algorithm

Over the years I've created a bit of a mess in my directories with duplicate files. I've used fdupes to remove complete duplicates but there are still files which are almost identical which fdupes doesn't look for.

These have the same (or very similar) filenames. So I have tried to create a script to look for them and list them like fdupes (sets of duplicates separated by a blank line). What i have so far is this very inelegant script.

Code:

#!/bin/sh

filepathlist="filepathlist.txt"
filepathlistcomp="filepathlistcomp.txt"
duplicatefilenamelist="duplicatefilenamelist.txt"

echo "" > "$filenamelist"
echo "" > "$filepathlistcomp"
find > "$filepathlist"

while read path ;do
 filename=`basename "$path"`
 dupes=0
 while read pathcomp ;do
  filenamecomp=`basename "$pathcomp"`
  if [ "$filename" = "$filecomp" ];then
   if [ $dupes -gt 0 ];then
    echo "$filename" >> "$duplicatefilenamelist"
   fi
   dupes=1
  else
   echo "$path" >> "$filepathlistcomp" 
  fi
 done < "$filepathlist"
 
 echo "" >> "$duplicatefilenamelist"
 "$filepathlist" < "$filepathlistcomp"
done < "$filepathlist"

I'm sure there is a better way of doing this. would this script even work since I'm trying to change the file in the loop that's reading it. My main concern is efficiency in the algorithm. I tried to remove duplicates already accounted for by removing them from the list as it progresses through it but I have a feeling this will actually make it less efficient because of the added file operations. Any ideas on how best to approach this problem?

cue

View Public Profile for cue

Find all posts by cue

03-20-2010

Registered User

839, 54

Join Date: Oct 2009

Last Activity: 1 February 2016, 9:47 AM EST

Location: France

Posts: 839

Thanks Given: 4

Thanked 54 Times in 53 Posts

I'm searching for a similar tool.
I found a simple way to print the duplicate file names

Code:

#!/bin/bash
FILES=/dev/shm/filelist
find -type f | awk -F'/' '{print $NF}' | sort | uniq -d > $FILES
while read F
do
	find -type f -name "$F"
	echo
done < $FILES

---------- Post updated at 13:05 ---------- Previous update was at 12:04 ----------

A complete script which could be optimized

Code:

#!/bin/bash
# Usage: find-dup [Path [Name]]
if [ -z "$1" ]
then
    read -p "Path to scan: " DIR    # Ask for a base path if not given as argument
else
    DIR="$1"
    if [ -z "$2" ]
    then
        read -p "File Names: " NAME    # Ask for a file pattern if not given as argument
        [ -n "$NAME" ] && NAME="-name $NAME"
    fi
fi
cd $DIR || exit 1
LIST=/dev/shm/filelist    # to store the temp filelists (ramdisk)
find -type f $NAME | awk -F'/' '{print $NF}' | sort | uniq -d > $LIST-1
while read F
do
    find -type f -name "$F" > $LIST-2
    i=0
    unset FILE
    while read L    # Creates an array with duplicate files
    do    ((i++)); FILE[$i]="$L"
    done < $LIST-2
    FILE[0]="Do not delete"
    OPT=""
    for ((i=0; i<${#FILE[@]}; i++))    # Displays the files with numbers for deletion
    do    OPT+=$i; echo -e "$i. ${FILE[$i]}"
    done
    K1=""
    until [[ $K1 = [$OPT] ]]
    do    read -s -n1 K1 <&1
    done
    if (($K1))
    then
        read -s -n1 -p "Confirm deletion of ${FILE[$K1]} (Y/N): " K2 <&1
        [[ $K2 = [yY] ]] && { echo; rm -v "${FILE[$K1]}"; } || echo "No deletion"
    else
        echo "No deletion"
    fi
    echo
done < $LIST-1

Last edited by frans; 03-20-2010 at 08:10 AM..

This User Gave Thanks to frans For This Post:

frans

View Public Profile for frans

Find all posts by frans

03-20-2010

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

You can use the Perl non standard module File::Find:Duplicates,
if you need to compare the content:

Code:

perl -MFile::Find::Duplicates -e'
    @dupes = find_duplicate_files("dir1", "dir2");
    printf "Files %s (of size %d) hash to %s\n", 
      (join "," , @{$_->files}), $_->size, $_->md5
        for @dupes'

radoulov

View Public Profile for radoulov

Find all posts by radoulov

03-22-2010

Registered User

35, 0

Join Date: Feb 2010

Last Activity: 27 February 2014, 7:19 AM EST

Posts: 35

Thanks Given: 16

Thanked 0 Times in 0 Posts

thank you for the suggestions frans and radoulov.

i'm not familiar with perl, can you please elaborate on what that perl script does. It looks like it compares 2 directories looking for duplicate files instead of duplicate filenames, is this correct?

I have created 2 scripts now trying to find duplicate filenames but they are so slow, that's why I really need to optimise the algorithm.

in all the methods I create a complete file list of the directory with full paths. My only problem is how time consuming the scripts are. All the methods work but which is the most time efficient for long lists?

Method 1
go through path list one entry at a time looking for matching filenames further down the list.

paths with matching filenames are removed from list so that the next filename has less entires to compare to.

Method 2
create a another list in addition to the paths, that is, a list of duplicate filenames using uniq (list 2). filter the path list using grep using these duplicate filenames (list 2) to get a smaller path list (list 1)
go through each duplicate filename (in list 2) looking for the matching paths in the path list (list 1).

remove matching paths so that the next duplicate filename has less entires to compare to.

The question is
1) is the added file operation required for removing previous matching paths worth it.
2) which algorithm is better in terms of speed, method 1, 2, or some other way
3) I'd like to add a progress bar but i do not want it in stdout since that will interfere with the actual output of duplicates. how do I do this? should I use stderr?

The scripts
The scripts for both methods are below and they both work but directories with many, many, files (I tested with 25,000) take considerable time, I'd really like to speed the script up.

if you want to test either one you can create a simple test text file with example paths to duplicate files, then use

./scriptname.sh -f List_of_file_paths.txt

if you want to actually look for duplicate filenames in a directory just run the script and it will look for duplicates in the current working directory. for another directory use.

./scriptname.sh directory

Method 1

Code:

#!/bin/sh

# Filenames in shared memory directory
filepathlist=/dev/shm/filelist
filepathlistcomp=/dev/shm/filelistcomp

# Usage help printed
usage="$0 [-f list_file] [Directory]"

# Option processing
while test $# -gt 0 ; do
 case "$1" in
  -f) usefilelist=true; filelist="$2" shift 2 ;;          
  --help) echo $usage; exit 1 ;;
  --*) break ;;
  -*) echo $usage; exit 1 ;;
  *)  break ;;
 esac
done

# store search directory given as command line argument
if [ ! -z "$2" ]; then
 finddir="$2"
else
 finddir="."
fi


if $usefilelist ;then
 cp "$filelist" "$filepathlist"
else
 find "$finddir" -type f > "$filepathlist"
fi

echo -n "" > "$filepathlistcomp"

while true ;do
 read path < "$filepathlist"
 filename=`basename "$path"`
 printfirst=true
 if [ "$path" = "" ];then
  exit
 fi
 while read pathcomp ;do
  if [ "$path" != "$pathcomp" ];then
   filenamecomp=`basename "$pathcomp"`
   if [ "$filename" = "$filenamecomp" ];then
     if [ $printfirst = true ];then
       echo "" #new line for new set
       echo "$path"
       printfirst=false
     fi
     echo "$pathcomp"    
   else
     echo "$pathcomp" >> "$filepathlistcomp"
   fi
  fi
 done < "$filepathlist"
 cp "$filepathlistcomp" "$filepathlist"

 echo -n "" > "$filepathlistcomp"
done

Method 2

Code:

#!/bin/sh

# Filenames in shared memory directory
filepathlist=/dev/shm/filelist
filepathlistcomp=/dev/shm/filelistcomp
filedupeslist=/dev/shm/filedupeslist

# Usage help printed
usage="$0 [-f list_file] [Directory]"

# Option processing
while test $# -gt 0 ; do
 case "$1" in
  -f) usefilelist=true; filelist="$2" shift 2 ;;          
  --help) echo $usage; exit 1 ;;
  --*) break ;;
  -*) echo $usage; exit 1 ;;
  *)  break ;;
 esac
done

# store search directory given as command line argument
if [ ! -z "$2" ]; then
 finddir="$2"
else
 finddir="."
fi

if $usefilelist ;then
 cp "$filelist" "$filepathlist"
else
 find "$finddir" -type f > "$filepathlist"
fi

cat "$filepathlist" | awk -F'/' '{print $NF}' | sort | uniq -d > "$filedupeslist"
grep -f "$filedupeslist" "$filepathlist" > "$filepathlistcomp"

while read filedupe ;do
 
 echo -n "" > "$filepathlist"

 while read path ;do
  if [ "$path" = "" ];then
   break
  fi
  filename=`basename "$path"`
  if [ "$filename" = "$filedupe" ];then
   echo "$path"
  else
   echo "$path" >> "$filepathlist"
  fi
 done < "$filepathlistcomp"
 
 cp "$filepathlist" "$filepathlistcomp"
 echo ""

done < "$filedupeslist"

Last edited by cue; 03-22-2010 at 05:10 AM..

cue

View Public Profile for cue

Find all posts by cue

03-22-2010

Banned

947, 38

Join Date: Apr 2009

Last Activity: 30 July 2012, 5:38 AM EDT

Location: /usr/bin/vim

Posts: 947

Thanks Given: 13

Thanked 38 Times in 36 Posts

If you want to look for file duplication -- content, you can take a look at my this tool: finddup | Get finddup at SourceForge.net which is in Perl.

thegeek

View Public Profile for thegeek

Find all posts by thegeek

03-22-2010

Registered User

35, 0

Join Date: Feb 2010

Last Activity: 27 February 2014, 7:19 AM EST

Posts: 35

Thanks Given: 16

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by thegeek

If you want to look for file duplication -- content, you can take a look at my this tool: finddup | Get finddup at SourceForge.net which is in Perl.

Thanks for creating that thegeek but is that not a content comparison. Can I ask how this differs from fdupes? The thing with fdupes is that it is a byte for byte content comparison. I used it to remove duplicate files (i.e. files that are exactly the same). However files that differed only slightly it would not list as a "duplicate", and rightly so. For example my filing system is in such a mess that I have multiple versions of the same file in different directories where I might of added something to the newer one. The files are probably 90% the same but they were not exact duplicates for fdupes to list them. I do not know of any tools (or how) to list files that are almost the same. Can this be done in finddup? if so that would be great.

This is why I'm comparing their filenames instead since I assume I probably didn't rename the files.

I've now solved the problem with effeciency too if anybody is interested. The extra file operations were not worth it and the "grep -f" line was extremely taxing. So I moved the grep into the loop and avoided the extra iterations of the loop with this too. The script before took hours to go through 25,000 files, this one takes less than 5 minutes. forgive the unecessary use of cat, file redirection gave me some trouble for some reason.

Code:

#!/bin/sh

# Filenames in shared memory directory
filepathlist=/dev/shm/filelist4
filepathlistcomp=/dev/shm/filelistcomp4
filedupeslist=/dev/shm/filedupeslist4

# Usage help printed
usage="$0 [-f list_file] [Directory]"

# Option processing
while test $# -gt 0 ; do
 case "$1" in
  -f) usefilelist=true; filelist="$2" shift 2 ;;          
  --help) echo $usage; exit 1 ;;
  --*) break ;;
  -*) echo $usage; exit 1 ;;
  *)  break ;;
 esac
done

# store search directory given as command line argument
if [ ! -z "$2" ]; then
 finddir="$2"
else
 finddir="."
fi

if $usefilelist ;then
 cp "$filelist" "$filepathlist"
else
 find "$finddir" -type f > "$filepathlist"
fi

cat "$filepathlist" | awk -F'/' '{print $NF}' | sort | uniq -d > "$filedupeslist"

while read filedupe ;do
 grep "$filedupe" "$filepathlist" > "$filepathlistcomp"
 
 while read path ;do
  if [ "$path" = "" ];then
   break
  fi
  filename=`basename "$path"`
  if [ "$filename" = "$filedupe" ];then
   echo "$path"
  fi
 done < "$filepathlistcomp"

 echo ""

done < "$filedupeslist"

Last edited by cue; 03-22-2010 at 09:11 AM..

cue

View Public Profile for cue

Find all posts by cue

03-22-2010

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Yes,
as already stated, the previous Perl solutions compare the content of the files.
Could you try this Perl code and compare its performance with your shell script?

Code:

perl -MFile::Find -e'
  $d = shift || die "$0 dir\n";
  find { 
    wanted => sub {
      -f and push @{$u{$_}}, $File::Find::name;
      }
    }, $d;
  @{$u{$_}} > 1 and printf "found %s in: \n\n%s\n\n", 
    $_, join $/, @{$u{$_}} for keys %u;    
  ' <dirname>

radoulov

View Public Profile for radoulov

Find all posts by radoulov

Shell Programming and Scripting

Duplicate filename algorithm

10 More Discussions You Might Find Interesting

1. Programming

to extract all the part of the filename before a particular word in the filename

Discussion started by: aealexanderraj

2. UNIX for Dummies Questions & Answers

to extract all the part of the filename before a particular word in the filename

Discussion started by: aealexanderraj

3. UNIX for Dummies Questions & Answers

banker's algorithm.. help

Discussion started by: syah

4. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Discussion started by: machomaddy

5. Programming

Please help me to develop algorithm

Discussion started by: solaris_user

6. Shell Programming and Scripting

Filename from splitting files to have the same filename of the original file with counter value

Discussion started by: natalie23

7. Shell Programming and Scripting

gzcat into awk and then change FILENAME and process new FILENAME

Discussion started by: timj123

8. UNIX for Dummies Questions & Answers

Report of duplicate files based on part of the filename

Discussion started by: sudheshnaiyer

9. Shell Programming and Scripting

algorithm

Discussion started by: filthymonk

10. Programming

Algorithm problem

Discussion started by: williamf