Duplicate filename algorithm


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Duplicate filename algorithm
# 1  
Old 03-20-2010
Duplicate filename algorithm

Over the years I've created a bit of a mess in my directories with duplicate files. I've used fdupes to remove complete duplicates but there are still files which are almost identical which fdupes doesn't look for.

These have the same (or very similar) filenames. So I have tried to create a script to look for them and list them like fdupes (sets of duplicates separated by a blank line). What i have so far is this very inelegant script.

Code:
#!/bin/sh

filepathlist="filepathlist.txt"
filepathlistcomp="filepathlistcomp.txt"
duplicatefilenamelist="duplicatefilenamelist.txt"

echo "" > "$filenamelist"
echo "" > "$filepathlistcomp"
find > "$filepathlist"

while read path ;do
 filename=`basename "$path"`
 dupes=0
 while read pathcomp ;do
  filenamecomp=`basename "$pathcomp"`
  if [ "$filename" = "$filecomp" ];then
   if [ $dupes -gt 0 ];then
    echo "$filename" >> "$duplicatefilenamelist"
   fi
   dupes=1
  else
   echo "$path" >> "$filepathlistcomp" 
  fi
 done < "$filepathlist"
 
 echo "" >> "$duplicatefilenamelist"
 "$filepathlist" < "$filepathlistcomp"
done < "$filepathlist"

I'm sure there is a better way of doing this. would this script even work since I'm trying to change the file in the loop that's reading it. My main concern is efficiency in the algorithm. I tried to remove duplicates already accounted for by removing them from the list as it progresses through it but I have a feeling this will actually make it less efficient because of the added file operations. Any ideas on how best to approach this problem?
# 2  
Old 03-20-2010
I'm searching for a similar tool.
I found a simple way to print the duplicate file names
Code:
#!/bin/bash
FILES=/dev/shm/filelist
find -type f | awk -F'/' '{print $NF}' | sort | uniq -d > $FILES
while read F
do
	find -type f -name "$F"
	echo
done < $FILES



---------- Post updated at 13:05 ---------- Previous update was at 12:04 ----------

A complete script which could be optimized
Code:
#!/bin/bash
# Usage: find-dup [Path [Name]]
if [ -z "$1" ]
then
    read -p "Path to scan: " DIR    # Ask for a base path if not given as argument
else
    DIR="$1"
    if [ -z "$2" ]
    then
        read -p "File Names: " NAME    # Ask for a file pattern if not given as argument
        [ -n "$NAME" ] && NAME="-name $NAME"
    fi
fi
cd $DIR || exit 1
LIST=/dev/shm/filelist    # to store the temp filelists (ramdisk)
find -type f $NAME | awk -F'/' '{print $NF}' | sort | uniq -d > $LIST-1
while read F
do
    find -type f -name "$F" > $LIST-2
    i=0
    unset FILE
    while read L    # Creates an array with duplicate files
    do    ((i++)); FILE[$i]="$L"
    done < $LIST-2
    FILE[0]="Do not delete"
    OPT=""
    for ((i=0; i<${#FILE[@]}; i++))    # Displays the files with numbers for deletion
    do    OPT+=$i; echo -e "$i. ${FILE[$i]}"
    done
    K1=""
    until [[ $K1 = [$OPT] ]]
    do    read -s -n1 K1 <&1
    done
    if (($K1))
    then
        read -s -n1 -p "Confirm deletion of ${FILE[$K1]} (Y/N): " K2 <&1
        [[ $K2 = [yY] ]] && { echo; rm -v "${FILE[$K1]}"; } || echo "No deletion"
    else
        echo "No deletion"
    fi
    echo
done < $LIST-1


Last edited by frans; 03-20-2010 at 08:10 AM..
This User Gave Thanks to frans For This Post:
# 3  
Old 03-20-2010
You can use the Perl non standard module File::Find:Duplicates,
if you need to compare the content:

Code:
perl -MFile::Find::Duplicates -e'
    @dupes = find_duplicate_files("dir1", "dir2");
    printf "Files %s (of size %d) hash to %s\n", 
      (join "," , @{$_->files}), $_->size, $_->md5
        for @dupes'

# 4  
Old 03-22-2010
thank you for the suggestions frans and radoulov.

i'm not familiar with perl, can you please elaborate on what that perl script does. It looks like it compares 2 directories looking for duplicate files instead of duplicate filenames, is this correct?

I have created 2 scripts now trying to find duplicate filenames but they are so slow, that's why I really need to optimise the algorithm.

in all the methods I create a complete file list of the directory with full paths. My only problem is how time consuming the scripts are. All the methods work but which is the most time efficient for long lists?

Method 1

go through path list one entry at a time looking for matching filenames further down the list.

paths with matching filenames are removed from list so that the next filename has less entires to compare to.

Method 2
create a another list in addition to the paths, that is, a list of duplicate filenames using uniq (list 2). filter the path list using grep using these duplicate filenames (list 2) to get a smaller path list (list 1)
go through each duplicate filename (in list 2) looking for the matching paths in the path list (list 1).

remove matching paths so that the next duplicate filename has less entires to compare to.


The question is
1) is the added file operation required for removing previous matching paths worth it.
2) which algorithm is better in terms of speed, method 1, 2, or some other way
3) I'd like to add a progress bar but i do not want it in stdout since that will interfere with the actual output of duplicates. how do I do this? should I use stderr?

The scripts

The scripts for both methods are below and they both work but directories with many, many, files (I tested with 25,000) take considerable time, I'd really like to speed the script up.

if you want to test either one you can create a simple test text file with example paths to duplicate files, then use

./scriptname.sh -f List_of_file_paths.txt

if you want to actually look for duplicate filenames in a directory just run the script and it will look for duplicates in the current working directory. for another directory use.

./scriptname.sh directory

Method 1
Code:
#!/bin/sh

# Filenames in shared memory directory
filepathlist=/dev/shm/filelist
filepathlistcomp=/dev/shm/filelistcomp

# Usage help printed
usage="$0 [-f list_file] [Directory]"

# Option processing
while test $# -gt 0 ; do
 case "$1" in
  -f) usefilelist=true; filelist="$2" shift 2 ;;          
  --help) echo $usage; exit 1 ;;
  --*) break ;;
  -*) echo $usage; exit 1 ;;
  *)  break ;;
 esac
done

# store search directory given as command line argument
if [ ! -z "$2" ]; then
 finddir="$2"
else
 finddir="."
fi


if $usefilelist ;then
 cp "$filelist" "$filepathlist"
else
 find "$finddir" -type f > "$filepathlist"
fi

echo -n "" > "$filepathlistcomp"

while true ;do
 read path < "$filepathlist"
 filename=`basename "$path"`
 printfirst=true
 if [ "$path" = "" ];then
  exit
 fi
 while read pathcomp ;do
  if [ "$path" != "$pathcomp" ];then
   filenamecomp=`basename "$pathcomp"`
   if [ "$filename" = "$filenamecomp" ];then
     if [ $printfirst = true ];then
       echo "" #new line for new set
       echo "$path"
       printfirst=false
     fi
     echo "$pathcomp"    
   else
     echo "$pathcomp" >> "$filepathlistcomp"
   fi
  fi
 done < "$filepathlist"
 cp "$filepathlistcomp" "$filepathlist"

 echo -n "" > "$filepathlistcomp"
done

Method 2
Code:
#!/bin/sh

# Filenames in shared memory directory
filepathlist=/dev/shm/filelist
filepathlistcomp=/dev/shm/filelistcomp
filedupeslist=/dev/shm/filedupeslist

# Usage help printed
usage="$0 [-f list_file] [Directory]"

# Option processing
while test $# -gt 0 ; do
 case "$1" in
  -f) usefilelist=true; filelist="$2" shift 2 ;;          
  --help) echo $usage; exit 1 ;;
  --*) break ;;
  -*) echo $usage; exit 1 ;;
  *)  break ;;
 esac
done

# store search directory given as command line argument
if [ ! -z "$2" ]; then
 finddir="$2"
else
 finddir="."
fi

if $usefilelist ;then
 cp "$filelist" "$filepathlist"
else
 find "$finddir" -type f > "$filepathlist"
fi

cat "$filepathlist" | awk -F'/' '{print $NF}' | sort | uniq -d > "$filedupeslist"
grep -f "$filedupeslist" "$filepathlist" > "$filepathlistcomp"

while read filedupe ;do
 
 echo -n "" > "$filepathlist"

 while read path ;do
  if [ "$path" = "" ];then
   break
  fi
  filename=`basename "$path"`
  if [ "$filename" = "$filedupe" ];then
   echo "$path"
  else
   echo "$path" >> "$filepathlist"
  fi
 done < "$filepathlistcomp"
 
 cp "$filepathlist" "$filepathlistcomp"
 echo ""

done < "$filedupeslist"


Last edited by cue; 03-22-2010 at 05:10 AM..
# 5  
Old 03-22-2010
If you want to look for file duplication -- content, you can take a look at my this tool: finddup | Get finddup at SourceForge.net which is in Perl.
# 6  
Old 03-22-2010
Quote:
Originally Posted by thegeek
If you want to look for file duplication -- content, you can take a look at my this tool: finddup | Get finddup at SourceForge.net which is in Perl.
Thanks for creating that thegeek but is that not a content comparison. Can I ask how this differs from fdupes? The thing with fdupes is that it is a byte for byte content comparison. I used it to remove duplicate files (i.e. files that are exactly the same). However files that differed only slightly it would not list as a "duplicate", and rightly so. For example my filing system is in such a mess that I have multiple versions of the same file in different directories where I might of added something to the newer one. The files are probably 90% the same but they were not exact duplicates for fdupes to list them. I do not know of any tools (or how) to list files that are almost the same. Can this be done in finddup? if so that would be great.

This is why I'm comparing their filenames instead since I assume I probably didn't rename the files.

I've now solved the problem with effeciency too if anybody is interested. The extra file operations were not worth it and the "grep -f" line was extremely taxing. So I moved the grep into the loop and avoided the extra iterations of the loop with this too. The script before took hours to go through 25,000 files, this one takes less than 5 minutes. forgive the unecessary use of cat, file redirection gave me some trouble for some reason.

Code:
#!/bin/sh

# Filenames in shared memory directory
filepathlist=/dev/shm/filelist4
filepathlistcomp=/dev/shm/filelistcomp4
filedupeslist=/dev/shm/filedupeslist4

# Usage help printed
usage="$0 [-f list_file] [Directory]"

# Option processing
while test $# -gt 0 ; do
 case "$1" in
  -f) usefilelist=true; filelist="$2" shift 2 ;;          
  --help) echo $usage; exit 1 ;;
  --*) break ;;
  -*) echo $usage; exit 1 ;;
  *)  break ;;
 esac
done

# store search directory given as command line argument
if [ ! -z "$2" ]; then
 finddir="$2"
else
 finddir="."
fi

if $usefilelist ;then
 cp "$filelist" "$filepathlist"
else
 find "$finddir" -type f > "$filepathlist"
fi

cat "$filepathlist" | awk -F'/' '{print $NF}' | sort | uniq -d > "$filedupeslist"

while read filedupe ;do
 grep "$filedupe" "$filepathlist" > "$filepathlistcomp"
 
 while read path ;do
  if [ "$path" = "" ];then
   break
  fi
  filename=`basename "$path"`
  if [ "$filename" = "$filedupe" ];then
   echo "$path"
  fi
 done < "$filepathlistcomp"

 echo ""

done < "$filedupeslist"


Last edited by cue; 03-22-2010 at 09:11 AM..
# 7  
Old 03-22-2010
Yes,
as already stated, the previous Perl solutions compare the content of the files.
Could you try this Perl code and compare its performance with your shell script?


Code:
perl -MFile::Find -e'
  $d = shift || die "$0 dir\n";
  find { 
    wanted => sub {
      -f and push @{$u{$_}}, $File::Find::name;
      }
    }, $d;
  @{$u{$_}} > 1 and printf "found %s in: \n\n%s\n\n", 
    $_, join $/, @{$u{$_}} for keys %u;    
  ' <dirname>

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Programming

to extract all the part of the filename before a particular word in the filename

Hi All, Thanks in Advance I am working on a shell script. I need some assistance. My code: if then set "subscriber" "promplan" "mapping" "dedicatedaccount" "faflistSub" "faflistAcc" "accumulator"\ "pam_account"; for i in 1 2 3 4 5 6 7 8;... (0 Replies)
Discussion started by: aealexanderraj
0 Replies

2. UNIX for Dummies Questions & Answers

to extract all the part of the filename before a particular word in the filename

Hi All, Thanks in Advance I am working on a shell script. I need some assistance. My Requirement: 1) There are some set of files in a directory like given below OTP_UFSC_20120530000000_acc.csv OTP_UFSC_20120530000000_faf.csv OTP_UFSC_20120530000000_prom.csv... (0 Replies)
Discussion started by: aealexanderraj
0 Replies

3. UNIX for Dummies Questions & Answers

banker's algorithm.. help

i'm doing banker's algorithm.. got some error there but i cant fix it.. please help!! #!/bin/bash echo "enter no.of resources: " read n1 echo -n "enter the max no .of resources for each type: " for(( i=0; i <$n1; i++ )) do read ${t} done echo -n "enter no .of... (1 Reply)
Discussion started by: syah
1 Replies

4. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Hi, In a file, I have to mark duplicate records as 'D' and the latest record alone as 'C'. In the below file, I have to identify if duplicate records are there or not based on Man_ID, Man_DT, Ship_ID and I have to mark the record with latest Ship_DT as "C" and other as "D" (I have to create... (7 Replies)
Discussion started by: machomaddy
7 Replies

5. Programming

Please help me to develop algorithm

Hi guys , in my study book from which I re-learn C is task to generate all possible characters combination from numbers entered by the user. I know this algorithm must use combinatorics to calculate all permutations. Problem is how to implement algortihm. // This program reads the four numbers... (0 Replies)
Discussion started by: solaris_user
0 Replies

6. Shell Programming and Scripting

Filename from splitting files to have the same filename of the original file with counter value

Hi all, I have a list of xml file. I need to split the files to a different files when see the <ko> tag. The list of filename are B20090908.1100-20090908.1200_CDMA=1,NO=2,SITE=3.xml B20090908.1200-20090908.1300_CDMA=1,NO=2,SITE=3.xml B20090908.1300-20090908.1400_CDMA=1,NO=2,SITE=3.xml ... (3 Replies)
Discussion started by: natalie23
3 Replies

7. Shell Programming and Scripting

gzcat into awk and then change FILENAME and process new FILENAME

I am trying to write a script that prompts users for date and time, then process the gzip file into awk. During the ksh part of the script another file is created and needs to be processed with a different set of pattern matches then I need to combine the two in the end. I'm stuck at the part... (6 Replies)
Discussion started by: timj123
6 Replies

8. UNIX for Dummies Questions & Answers

Report of duplicate files based on part of the filename

I have the files logged in the file system with names in the format of : filename_ordernumber_date_time eg: file_1_12012007_1101.txt file_2_12022007_1101.txt file_1_12032007_1101.txt I need to find out all the files that are logged multiple times with same order number. In the above eg, I... (1 Reply)
Discussion started by: sudheshnaiyer
1 Replies

9. Shell Programming and Scripting

algorithm

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 21444 tomusr 213M 61M sleep 29 10 1:20:46 0.1% java/43 21249 root 93M 44M sleep 29 10 1:07:19 0.2% java/56 is there anyway i can use a command to get the total of the SIZE? 306M (Derive from... (5 Replies)
Discussion started by: filthymonk
5 Replies

10. Programming

Algorithm problem

Looking for an algorithm to compute the number of days between two given dates I came across a professor's C program located here: http://cr.yp.to/2001-275/struct1.c I was wondering if anyone could tell me where the value 678882 in the line int d = dateday - 678882; comes from and also the... (1 Reply)
Discussion started by: williamf
1 Replies
Login or Register to Ask a Question