Find duplicate files by file size

04-01-2011

Registered User

4, 0

Join Date: Apr 2011

Last Activity: 14 April 2011, 6:22 AM EDT

Posts: 4

Thanks Given: 1

Thanked 0 Times in 0 Posts

Find duplicate files by file size

Hi!

I want to find duplicate files (criteria: file size) in my download folder.

I try it like this:

Code:

find /Users/frodo/Downloads \! -type d -exec du {} \; | sort > /Users/frodo/Desktop/duplicates_1.txt;
cut -f 1 /Users/frodo/Desktop/duplicates_1.txt | uniq -d | grep -hif - /Users/frodo/Desktop/duplicates_1.txt > /Users/frodo/Desktop/duplicates_2.txt;

But this doesn't work. Can anybody tell my what's wrong or provide a other/better solution? Thanks!

Dirk

Dirk Einecke

View Public Profile for Dirk Einecke

Find all posts by Dirk Einecke

04-01-2011

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

how about cksum - that is far easier to use. It gives a filesize. Or you can use the check sum, either way.
This code assumes your cksum implmentation gives:

Code:

cksum filename
checksum  filesize filename

Code:

cksum /path/to/files/* |
  awk ' { if( $2 in arr) 
            {print "duplicates ", $3, arr[$2], "duplicate filesize = ", $2} 
              else 
            {arr[$2]=$3} }'

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

04-01-2011

Registered User

4, 0

Join Date: Apr 2011

Last Activity: 14 April 2011, 6:22 AM EDT

Posts: 4

Thanks Given: 1

Thanked 0 Times in 0 Posts

Hi!

Quote:

Originally Posted by jim mcnamara

how about cksum

Well, cksum is to slow. There can be files with > 2GB. And I want to scan also all subdirectories. The sum of the file size of all duplicated files is not important.

Dirk

Dirk Einecke

View Public Profile for Dirk Einecke

Find all posts by Dirk Einecke

04-01-2011

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

You probably don't want to use du for this either. du returns the space allocated for a file, not the size of a file.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

04-01-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Code:

#!/bin/sh

# We find all files in path, feed them into ls -l with xargs,
# and sort them on the size column.
# We can't depend on ls' own sort when using xargs since enough
# files will end up splitting between several ls calls.
# Then we read the lines in order, and check for duplicate sizes.
find /path/to/dir -type f -print0 | xargs --null ls -l | sort -k 5,6 |
while read PERMS LINKS USER GROUP SIZE M D Y FILE
do
        # Skip symbolic links
        [ -h "$FILE" ] && continue

        if [ "$SIZE" -eq "$LASTSIZE" ]
        then
                echo "$FILE same size as $LASTFILE"
        else
                LASTSIZE="$SIZE"        ;       LASTFILE="$FILE"
        fi
# find will spew errors when it can't access a file, so ignore /dev/null.
done 2> /dev/null

---------- Post updated at 05:35 PM ---------- Previous update was at 04:43 PM ----------

Here's an improved version that checks checksums. It can churn through about 4 gigs of random files in 7 seconds, uncached, on my not-so-great system.

The trick is, it only checks checksums against files of the same size, and does a quick checksum on their first 512 bytes to filter out files that're obviously different. Maybe the first 16K, or first 256K would be better.

Code:

#!/bin/bash

TMP=$(mktemp)

# Given a list of files of the same size, "$TMP",
# it will check which ones have the same checksums.
function checkgroup
{
        local FILE
        local LASTSUM
        local LASTFILE

        [ -s "$TMP" ] || return

        # Check first 512 bytes of files.
        # If that differs, who cares about the rest?
        while read FILE
        do
                SUM=$(dd count=1 < "$FILE" | md5sum)
                read G SUM <<<"$SUM"
                echo "$SUM $FILE"
        done < "$TMP" | sort | while read SUM FILE
        do
                if [ "$LASTSUM" != "$SUM" ]
                then
                        LASTSUM="$SUM"
                        LASTFILE="$FILE"
                        UNPRINTED=1
                        continue
                fi

                [ -z "$UNPRINTED" ] || echo "$LASTFILE"
                UNPRINTED=""
                echo "$FILE"
        done | xargs -d '\n' md5sum | sort |
        while read SUM FILE
        do
                if [ "$SUM" != "$LASTSUM" ]
                then
                        LASTSUM="$SUM"
                        LASTFILE="$FILE"
                else
                        echo "$FILE == $LASTFILE"
                fi
        done
}

# Find all files, feed them through ls, sort them on size.
# Can't depend on ls' own sorting when there's too man files,
# it could be run more than once.
# Once we have the output, loop through looking for files
# the same size and make a list to feed into checkgroup.
find ~/public_html -type f | xargs ls -l | sort -k 5,6 |
while read PERMS LINKS USER GROUP SIZE M D Y FILE
do
        # Skip symbolic links
        [ -h "$FILE" ] && continue

        if [ "$SIZE" -eq "$LASTSIZE" ]
        then
                [ -s "$TMP" ] || echo "$LASTFILE" > "$TMP"
                echo "$FILE" >> "$TMP"
        else
                checkgroup "$LASTSIZE"
                LASTSIZE="$SIZE"        ;       LASTFILE="$FILE"
                :>"$TMP"
        fi
done

checkgroup

rm -f "$TMP"

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

04-01-2011

Registered User

290, 37

Join Date: Jan 2009

Last Activity: 28 June 2018, 4:18 PM EDT

Location: Tegucigalpa, Honduras

Posts: 290

Thanks Given: 8

Thanked 37 Times in 36 Posts

Hi Dirk Einecke,

Another option:

Bytes precision is used trying to get a more precise size comparison.

Code:

#!/bin/bash
find . -type f -print0 | (
    while read -d "" FILE ; do FILES=("${FILES[@]}" "$FILE") ; done

    ls -la "${FILES[@]}" | awk '{$1=$2=$3=$4=$6=$7="";print}' > /Users/frodo/Desktop/Listed_Files.txt
    ls -la "${FILES[@]}" | awk '{print $5}' | sort -k1,1nr | uniq -d > /Users/frodo/Desktop/Repeated_Sizes.txt

)

awk 'BEGIN{print "Size (bytes)  Files"}FNR==NR{a[$1];next} $1 in a' Repeated_Sizes.txt Listed_Files.txt > Duplicates_Files.txt

rm /Users/frodo/Desktop/Listed_Files.txt
rm /Users/frodo/Desktop/Repeated_Sizes.txt

Hope it helps

Regards

cgkmal

View Public Profile for cgkmal

Find all posts by cgkmal

04-02-2011

Registered User

132, 18

Join Date: May 2008

Last Activity: 23 January 2013, 12:06 AM EST

Location: Chennai

Posts: 132

Thanks Given: 0

Thanked 18 Times in 18 Posts

Why dont you try this?
Go to your Downloads dir and run this.

Code:

ls -l | awk '$1!~/^d/{if(size[$5]!=""){ print}size[$5]=$8}'

tene

View Public Profile for tene

Find all posts by tene

Shell Programming and Scripting

Find duplicate files by file size

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

List duplicate files based on Name and size

Discussion started by: prvnrk

2. Shell Programming and Scripting

Find duplicate rows between files

Discussion started by: Selva_2507

3. Shell Programming and Scripting

Find duplicate files but with different extensions

Discussion started by: fretagi

4. Shell Programming and Scripting

find duplicate string in many different files

Discussion started by: xshang

5. Shell Programming and Scripting

Remove duplicate lines from a 50 MB file size

Discussion started by: vsachan

6. Shell Programming and Scripting

Find file size difference in two files using awk

Discussion started by: royalibrahim

7. Shell Programming and Scripting

Find duplicate files

Discussion started by: kiasas

8. Shell Programming and Scripting

Find Duplicate files, not by name

Discussion started by: Ikon

9. Solaris

command to find out total size of a specific file size (spread over the server)

Discussion started by: abhinov

10. Shell Programming and Scripting

how to find duplicate files with find ?

Discussion started by: umen