Find duplicate files by file size


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Find duplicate files by file size
# 1  
Old 04-01-2011
Find duplicate files by file size

Hi!

I want to find duplicate files (criteria: file size) in my download folder.

I try it like this:
Code:
find /Users/frodo/Downloads \! -type d -exec du {} \; | sort > /Users/frodo/Desktop/duplicates_1.txt;
cut -f 1 /Users/frodo/Desktop/duplicates_1.txt | uniq -d | grep -hif - /Users/frodo/Desktop/duplicates_1.txt > /Users/frodo/Desktop/duplicates_2.txt;

But this doesn't work. Can anybody tell my what's wrong or provide a other/better solution? Thanks!

Dirk
# 2  
Old 04-01-2011
how about cksum - that is far easier to use. It gives a filesize. Or you can use the check sum, either way.
This code assumes your cksum implmentation gives:
Code:
cksum filename
checksum  filesize filename

Code:
cksum /path/to/files/* |
  awk ' { if( $2 in arr) 
            {print "duplicates ", $3, arr[$2], "duplicate filesize = ", $2} 
              else 
            {arr[$2]=$3} }'

# 3  
Old 04-01-2011
Hi!
Quote:
Originally Posted by jim mcnamara
how about cksum
Well, cksum is to slow. There can be files with > 2GB. And I want to scan also all subdirectories. The sum of the file size of all duplicated files is not important.

Dirk
# 4  
Old 04-01-2011
You probably don't want to use du for this either. du returns the space allocated for a file, not the size of a file.

Regards,
Alister
# 5  
Old 04-01-2011
Code:
#!/bin/sh

# We find all files in path, feed them into ls -l with xargs,
# and sort them on the size column.
# We can't depend on ls' own sort when using xargs since enough
# files will end up splitting between several ls calls.
# Then we read the lines in order, and check for duplicate sizes.
find /path/to/dir -type f -print0 | xargs --null ls -l | sort -k 5,6 |
while read PERMS LINKS USER GROUP SIZE M D Y FILE
do
        # Skip symbolic links
        [ -h "$FILE" ] && continue

        if [ "$SIZE" -eq "$LASTSIZE" ]
        then
                echo "$FILE same size as $LASTFILE"
        else
                LASTSIZE="$SIZE"        ;       LASTFILE="$FILE"
        fi
# find will spew errors when it can't access a file, so ignore /dev/null.
done 2> /dev/null

---------- Post updated at 05:35 PM ---------- Previous update was at 04:43 PM ----------

Here's an improved version that checks checksums. It can churn through about 4 gigs of random files in 7 seconds, uncached, on my not-so-great system.

The trick is, it only checks checksums against files of the same size, and does a quick checksum on their first 512 bytes to filter out files that're obviously different. Maybe the first 16K, or first 256K would be better.

Code:
#!/bin/bash

TMP=$(mktemp)

# Given a list of files of the same size, "$TMP",
# it will check which ones have the same checksums.
function checkgroup
{
        local FILE
        local LASTSUM
        local LASTFILE

        [ -s "$TMP" ] || return

        # Check first 512 bytes of files.
        # If that differs, who cares about the rest?
        while read FILE
        do
                SUM=$(dd count=1 < "$FILE" | md5sum)
                read G SUM <<<"$SUM"
                echo "$SUM $FILE"
        done < "$TMP" | sort | while read SUM FILE
        do
                if [ "$LASTSUM" != "$SUM" ]
                then
                        LASTSUM="$SUM"
                        LASTFILE="$FILE"
                        UNPRINTED=1
                        continue
                fi

                [ -z "$UNPRINTED" ] || echo "$LASTFILE"
                UNPRINTED=""
                echo "$FILE"
        done | xargs -d '\n' md5sum | sort |
        while read SUM FILE
        do
                if [ "$SUM" != "$LASTSUM" ]
                then
                        LASTSUM="$SUM"
                        LASTFILE="$FILE"
                else
                        echo "$FILE == $LASTFILE"
                fi
        done
}

# Find all files, feed them through ls, sort them on size.
# Can't depend on ls' own sorting when there's too man files,
# it could be run more than once.
# Once we have the output, loop through looking for files
# the same size and make a list to feed into checkgroup.
find ~/public_html -type f | xargs ls -l | sort -k 5,6 |
while read PERMS LINKS USER GROUP SIZE M D Y FILE
do
        # Skip symbolic links
        [ -h "$FILE" ] && continue

        if [ "$SIZE" -eq "$LASTSIZE" ]
        then
                [ -s "$TMP" ] || echo "$LASTFILE" > "$TMP"
                echo "$FILE" >> "$TMP"
        else
                checkgroup "$LASTSIZE"
                LASTSIZE="$SIZE"        ;       LASTFILE="$FILE"
                :>"$TMP"
        fi
done

checkgroup

rm -f "$TMP"

# 6  
Old 04-01-2011
Hi Dirk Einecke,

Another option:

Bytes precision is used trying to get a more precise size comparison.

Code:
#!/bin/bash
find . -type f -print0 | (
    while read -d "" FILE ; do FILES=("${FILES[@]}" "$FILE") ; done

    ls -la "${FILES[@]}" | awk '{$1=$2=$3=$4=$6=$7="";print}' > /Users/frodo/Desktop/Listed_Files.txt
    ls -la "${FILES[@]}" | awk '{print $5}' | sort -k1,1nr | uniq -d > /Users/frodo/Desktop/Repeated_Sizes.txt

)

awk 'BEGIN{print "Size (bytes)  Files"}FNR==NR{a[$1];next} $1 in a' Repeated_Sizes.txt Listed_Files.txt > Duplicates_Files.txt

rm /Users/frodo/Desktop/Listed_Files.txt
rm /Users/frodo/Desktop/Repeated_Sizes.txt


Hope it helps

Regards
# 7  
Old 04-02-2011
Why dont you try this?
Go to your Downloads dir and run this.
Code:
ls -l | awk '$1!~/^d/{if(size[$5]!=""){ print}size[$5]=$8}'

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

List duplicate files based on Name and size

Hello, I have a huge directory (with millions of files) and need to find out duplicates based on BOTH file name and File size. I know fdupes but it calculates MD5 which is very time-consuming and especially it takes forever as I have millions of files. Can anyone please suggest a script or... (7 Replies)
Discussion started by: prvnrk
7 Replies

2. Shell Programming and Scripting

Find duplicate rows between files

Hi champs, I have one of the requirement, where I need to compare two files line by line and ignore duplicates. Note, I hav files in sorted order. I have tried using the comm command, but its not working for my scenario. Input file1 srv1..development..employee..empname,empid,empdesg... (1 Reply)
Discussion started by: Selva_2507
1 Replies

3. Shell Programming and Scripting

Find duplicate files but with different extensions

Hi ! I wonder if anyone can help on this : I have a directory: /xyz that has the following files: chsLog.107.20130603.gz chsLog.115.20130603 chsLog.111.20130603.gz chsLog.107.20130603 chsLog.115.20130603.gz As you ca see there are two files that are the same but only with a minor... (10 Replies)
Discussion started by: fretagi
10 Replies

4. Shell Programming and Scripting

find duplicate string in many different files

I have more than 100 files like this: SVEAVLTGPYGYT 2 SVEGNFEETQY 10 SVELGQGYEQY 28 SVERTGTGYT 6 SVGLADYNEQF 21 SVGQGYEQY 32 SVKTVLGYEQF 2 SVNNEQF 12 SVRDGLTNSPLH 3 SVRRDREGLEQF 11 SVRTSGSYEQY 17 SVSVSGSPLQETQY 78 SVVHSTSPEAF 59 SVVPGNGYT 75 (4 Replies)
Discussion started by: xshang
4 Replies

5. Shell Programming and Scripting

Remove duplicate lines from a 50 MB file size

hi, Please help me to write a command to delete duplicate lines from a file. And the size of file is 50 MB. How to remove duplicate lins from such a big file. (6 Replies)
Discussion started by: vsachan
6 Replies

6. Shell Programming and Scripting

Find file size difference in two files using awk

Hi, Could anyone help me to solve this problem? I have two files "f1" and "f2" having 2 fields in each, a) file size and b) file name. The data are almost same in both the files except for few and new additional lines. Now, I have to find out and print the output as, the difference in the... (3 Replies)
Discussion started by: royalibrahim
3 Replies

7. Shell Programming and Scripting

Find duplicate files

What utility do you recommend for simply finding all duplicate files among all files? (4 Replies)
Discussion started by: kiasas
4 Replies

8. Shell Programming and Scripting

Find Duplicate files, not by name

I have a directory with images: -rw-r--r-- 1 root root 26216 Mar 19 21:00 020109.210001.jpg -rw-r--r-- 1 root root 21760 Mar 19 21:15 020109.211502.jpg -rw-r--r-- 1 root root 23144 Mar 19 21:30 020109.213002.jpg -rw-r--r-- 1 root root 31350 Mar 20 00:45 020109.004501.jpg -rw-r--r-- 1 root... (2 Replies)
Discussion started by: Ikon
2 Replies

9. Solaris

command to find out total size of a specific file size (spread over the server)

hi all, in my server there are some specific application files which are spread through out the server... these are spread in folders..sub-folders..chid folders... please help me, how can i find the total size of these specific files in the server... (3 Replies)
Discussion started by: abhinov
3 Replies

10. Shell Programming and Scripting

how to find duplicate files with find ?

hello all I like to make search on files , and the result need to be the files that are duplicated? (8 Replies)
Discussion started by: umen
8 Replies
Login or Register to Ask a Question