Search compare and determine duplicate files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Search compare and determine duplicate files
# 8  
Old 04-04-2011
@CHUBLER_XL:

Hello, the script above well determine of duplicate according to its byte size right, i would like to add the condition to compare also the filename reside in differrent sud dir.

output:
dir1/dir2/dir3/linux-ebook.pdf and /dir1/linux-ebook.pdf is identical in filename
or
dir1/dir2/C++_programming.pdf and /dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/C++_programming.pdf is identical filename

when i try the script and change the byte size of the linux-ebook.pdf or files the comparison failed.
# 9  
Old 04-04-2011
This should do it

Code:
if [ $# -ne 1 ] || [ ! -d $1 ]
then
    echo "usage: $0 <directory>"
    exit 1
fi
SHOWSAME=1
find $1 -type f -ls | awk '
  $8 > 0 {
     gsub("\\\\ ", SUBSEP); F=$12; gsub(SUBSEP, " ", F); # Deal with space(s) in filename
     if($8 in sizes) {
         sizes[$8]=sizes[$8] SUBSEP F;
         dup[$8]++
     } else sizes[$8]=F
     bn=F;
     sub(".*/", "", bn);
     if (bn in basenames) {
         basenames[bn]=basenames[bn] SUBSEP F;
         dupname[bn]++
     } else basenames[bn]=F;
  }
  END {for(i in dup) print sizes[i]; print "-NOSAME-" ; for(i in dupname) print basenames[i]; }' | while read
do
  # SUBSEP (34 Octal) between each filename that has same size
  # Change IFS to Load Array F with a group of 2 (or more) files
  OIFS="$IFS"
  IFS=$(printf \\034)
  F=( $REPLY )
  IFS="$OIFS"
  [ "${F[0]}" = "-NOSAME-" -a ${#F[@]} -eq 1 ] && SHOWSAME=0
  i=0
  while [ $i -lt ${#F[@]} ]
  do
     let j=i+1
     while [ $j -lt ${#F[@]} ]
     do
        if cmp -s "${F[i]}" "${F[j]}"
        then
           [ $SHOWSAME -eq 1 ] && echo "\"${F[i]}\"" and "\"${F[j]}\"" are identical
        else
           [ "$(basename "${F[i]}")" = "$(basename "${F[j]}")" ] &&
               echo "\"${F[i]}\"" and "\"${F[j]}\"" have same filename but are different
        fi
        let j=j+1
     done
     let i=i+1
  done
done

# 10  
Old 04-05-2011
Quote:
Originally Posted by Chubler_XL
This should do it

Code:
if [ $# -ne 1 ] || [ ! -d $1 ]
then
    echo "usage: $0 <directory>"
    exit 1
fi
SHOWSAME=1
find $1 -type f -ls | awk '
  $8 > 0 {
     gsub("\\\\ ", SUBSEP); F=$12; gsub(SUBSEP, " ", F); # Deal with space(s) in filename
     if($8 in sizes) {
         sizes[$8]=sizes[$8] SUBSEP F;
         dup[$8]++
     } else sizes[$8]=F
     bn=F;
     sub(".*/", "", bn);
     if (bn in basenames) {
         basenames[bn]=basenames[bn] SUBSEP F;
         dupname[bn]++
     } else basenames[bn]=F;
  }
  END {for(i in dup) print sizes[i]; print "-NOSAME-" ; for(i in dupname) print basenames[i]; }' | while read
do
  # SUBSEP (34 Octal) between each filename that has same size
  # Change IFS to Load Array F with a group of 2 (or more) files
  OIFS="$IFS"
  IFS=$(printf \\034)
  F=( $REPLY )
  IFS="$OIFS"
  [ "${F[0]}" = "-NOSAME-" -a ${#F[@]} -eq 1 ] && SHOWSAME=0
  i=0
  while [ $i -lt ${#F[@]} ]
  do
     let j=i+1
     while [ $j -lt ${#F[@]} ]
     do
        if cmp -s "${F[i]}" "${F[j]}"
        then
           [ $SHOWSAME -eq 1 ] && echo "\"${F[i]}\"" and "\"${F[j]}\"" are identical
        else
           [ "$(basename "${F[i]}")" = "$(basename "${F[j]}")" ] &&
               echo "\"${F[i]}\"" and "\"${F[j]}\"" have same filename but are different
        fi
        let j=j+1
     done
     let i=i+1
  done
done

Hi

i tested the new script and only output this multiple line

"" and "" have same filename but are different
"" and "" have same filename but are different
"" and "" have same filename but are different
"" and "" have same filename but are different
and so on
...........
...........
...........

thanks for the efforts
# 11  
Old 04-05-2011
Sorry, I tested it on a system that has a space in the group name so my field numbers were out.

New version works by getting size as 4th-last field, so should be much more robust. During testing I also found some systems don't have SUBSEP as \034, safer to use actual octal value instead of SUBSEP for output strings:

Code:
#!/bin/bash
if [ $# -ne 1 ] || [ ! -d $1 ]
then
    echo "usage: $0 <directory>"
    exit 1
fi
SHOWSAME=1
find $1 -type f -ls | awk '
  {
     gsub("\\\\ ", SUBSEP); F=$NF; gsub(SUBSEP, " ", F); # Deal with space(s) in filename
     $1=$1
     SZ=$(NF-4)
     if(SZ > 0 && SZ in sizes) {
         sizes[SZ]=sizes[SZ] "\034" F;
         dup[SZ]++
     } else sizes[SZ]=F
     bn=F;
     sub(".*/", "", bn);
     if (bn in basenames) {
         basenames[bn]=basenames[bn] "\034" F;
         dupname[bn]++
     } else basenames[bn]=F;
  }
  END {for(i in dup) print sizes[i]; print "-NOSAME-" ; for(i in dupname) print basenames[i]; }' | while read
do
  # SUBSEP (34 Octal) between each filename that has same size
  # Change IFS to Load Array F with a group of 2 (or more) files
  OIFS="$IFS"
  IFS=$'\034'
  F=( $REPLY )
  IFS="$OIFS"
  [ "${F[0]}" = "-NOSAME-" -a ${#F[@]} -eq 1 ] && SHOWSAME=0
  i=0
  while [ $i -lt ${#F[@]} ]
  do
     let j=i+1
     while [ $j -lt ${#F[@]} ]
     do
        if cmp -s "${F[i]}" "${F[j]}"
        then
           [ $SHOWSAME -eq 1 ] && echo "\"${F[i]}\"" and "\"${F[j]}\"" are identical
        else
           [ "$(basename "${F[i]}")" = "$(basename "${F[j]}")" ] &&
               echo "\"${F[i]}\"" and "\"${F[j]}\"" have same filename but are different
        fi
        let j=j+1
     done
     let i=i+1
  done
done


Last edited by Chubler_XL; 04-05-2011 at 07:59 PM..
This User Gave Thanks to Chubler_XL For This Post:
# 12  
Old 04-07-2011
@Chubler_XL


Thanks for the script its working now. for the two condition..

@Danmero

THanks also i find usefull the package fdupes for finding and deleting files in cmd.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Shell script to compare two files for duplicate..??

Hi , I had a requirement to compare two files whether the two files are same or different .... like(files contaisn of two columns each) file1.txt 121343432213 1234 64564564646 2345 343423424234 2456 file2.txt 121343432213 1234 64564564646 2345 31231313123 3455 how to... (2 Replies)
Discussion started by: hemanthsaikumar
2 Replies

2. Shell Programming and Scripting

To search duplicate sequence in file

Hi, I want to search only duplicate sequence number in file e.g 4757610 4757610 should display only duplicate sequence number in file. file contain is: 4757610 6zE:EXPNL ORDER_PRIORITY='30600022004757610' ORDER_IDENTIFIER='4257771056' MM_ASK_VOLUME='273' MM_ASK_PRICE='1033.0000' m='GBX'... (5 Replies)
Discussion started by: ashfaque
5 Replies

3. Shell Programming and Scripting

Search pattern on logfile and search for day/dates and skip duplicate lines if any

Hi, I've written a script to search for an Oracle ORA- error on a log file, print that line and the .trc file associated with it as well as the dateline of when I assumed the error occured. In most it is the first dateline previous to the error. Unfortunately, this is not a fool proof script.... (2 Replies)
Discussion started by: newbie_01
2 Replies

4. UNIX for Dummies Questions & Answers

Search for string in a file then compare it with excel files entry

All, i have a file text.log: cover6 cover3 cover2 cover4 other file is abc.log as : 0 0 1 0 Then I have a excel file result.xls that contains: Name Path Pass cover2 cover3 cover6 cover4 (1 Reply)
Discussion started by: Anamika08
1 Replies

5. Shell Programming and Scripting

Search and compare files from two paths

Hi All, I have a 2 path, one with oldfile path in which has several sub folders,each sub folders contains a config file(basically text file), likewise there will be another newfile path which will have sub folders, each sub folders contains a config file. Need to read files from oldfile... (6 Replies)
Discussion started by: Optimus81
6 Replies

6. Shell Programming and Scripting

Search duplicate field and replace one of them with new value

Dear All, I have file with 4 columns: 1 AA 0 21 2 BB 0 31 3 AA 0 21 4 CC 0 41 I would like to find the duplicate record based on column 2 and replace the 4th column of the duplicate by a new value. So, the output will be: 1 AA 0 21 2 BB 0 31 3 AA 0 -21 4 CC 0 41 Any suggestions... (3 Replies)
Discussion started by: ezhil01
3 Replies

7. Shell Programming and Scripting

compare two files and search keyword and print output

You have two files to compare by searching keyword from one file into another file File A 23 >pp_ANSWER 24 >aa hello 25 >jau head wear 66 >jss oops 872 >aqq olps ploww oww sss 722 >GG_KILLER ..... large files File B Beta done KILLER John Mayor calix meyers ... (5 Replies)
Discussion started by: cdfd123
5 Replies

8. Shell Programming and Scripting

How to search & compare paragraphs between two files

Hello Guys, Greetings to All. I am stuck in my work here today while trying to comapre paragraphs between two files, I need your help on urgent basis, without your inputs I can not proceed. Kindly find some time to answer my question, I'll be grateful to you for ever. My detailed issue is as... (10 Replies)
Discussion started by: NARESH1302
10 Replies

9. Shell Programming and Scripting

compare fields in a file with duplicate records

Hi: I've been searching the net but didnt find a clue. I have a file in which, for some records, some fields coincide. I want to compare one (or more) of the dissimilar fields and retain the one record that fulfills a certain condition. For example, on this file: 99 TR 1991 5 06 ... (1 Reply)
Discussion started by: rleal
1 Replies

10. UNIX for Dummies Questions & Answers

Compare and Remove duplicate lines from txt

Hello, I am a total linux newbie and I can't seem to find a solution to this little problem. I have two text files with a huge list of URLS. Let's call them file1.txt and file2.txt What I want to do is grab an URL from file2.txt, search file1.txt for the URL and if found, delete it from... (11 Replies)
Discussion started by: rmarcano
11 Replies
Login or Register to Ask a Question