Sponsored Content
Top Forums Shell Programming and Scripting Find duplicate files by file size Post 302510187 by Corona688 on Friday 1st of April 2011 07:35:03 PM
Old 04-01-2011
Code:
#!/bin/sh

# We find all files in path, feed them into ls -l with xargs,
# and sort them on the size column.
# We can't depend on ls' own sort when using xargs since enough
# files will end up splitting between several ls calls.
# Then we read the lines in order, and check for duplicate sizes.
find /path/to/dir -type f -print0 | xargs --null ls -l | sort -k 5,6 |
while read PERMS LINKS USER GROUP SIZE M D Y FILE
do
        # Skip symbolic links
        [ -h "$FILE" ] && continue

        if [ "$SIZE" -eq "$LASTSIZE" ]
        then
                echo "$FILE same size as $LASTFILE"
        else
                LASTSIZE="$SIZE"        ;       LASTFILE="$FILE"
        fi
# find will spew errors when it can't access a file, so ignore /dev/null.
done 2> /dev/null

---------- Post updated at 05:35 PM ---------- Previous update was at 04:43 PM ----------

Here's an improved version that checks checksums. It can churn through about 4 gigs of random files in 7 seconds, uncached, on my not-so-great system.

The trick is, it only checks checksums against files of the same size, and does a quick checksum on their first 512 bytes to filter out files that're obviously different. Maybe the first 16K, or first 256K would be better.

Code:
#!/bin/bash

TMP=$(mktemp)

# Given a list of files of the same size, "$TMP",
# it will check which ones have the same checksums.
function checkgroup
{
        local FILE
        local LASTSUM
        local LASTFILE

        [ -s "$TMP" ] || return

        # Check first 512 bytes of files.
        # If that differs, who cares about the rest?
        while read FILE
        do
                SUM=$(dd count=1 < "$FILE" | md5sum)
                read G SUM <<<"$SUM"
                echo "$SUM $FILE"
        done < "$TMP" | sort | while read SUM FILE
        do
                if [ "$LASTSUM" != "$SUM" ]
                then
                        LASTSUM="$SUM"
                        LASTFILE="$FILE"
                        UNPRINTED=1
                        continue
                fi

                [ -z "$UNPRINTED" ] || echo "$LASTFILE"
                UNPRINTED=""
                echo "$FILE"
        done | xargs -d '\n' md5sum | sort |
        while read SUM FILE
        do
                if [ "$SUM" != "$LASTSUM" ]
                then
                        LASTSUM="$SUM"
                        LASTFILE="$FILE"
                else
                        echo "$FILE == $LASTFILE"
                fi
        done
}

# Find all files, feed them through ls, sort them on size.
# Can't depend on ls' own sorting when there's too man files,
# it could be run more than once.
# Once we have the output, loop through looking for files
# the same size and make a list to feed into checkgroup.
find ~/public_html -type f | xargs ls -l | sort -k 5,6 |
while read PERMS LINKS USER GROUP SIZE M D Y FILE
do
        # Skip symbolic links
        [ -h "$FILE" ] && continue

        if [ "$SIZE" -eq "$LASTSIZE" ]
        then
                [ -s "$TMP" ] || echo "$LASTFILE" > "$TMP"
                echo "$FILE" >> "$TMP"
        else
                checkgroup "$LASTSIZE"
                LASTSIZE="$SIZE"        ;       LASTFILE="$FILE"
                :>"$TMP"
        fi
done

checkgroup

rm -f "$TMP"

 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

how to find duplicate files with find ?

hello all I like to make search on files , and the result need to be the files that are duplicated? (8 Replies)
Discussion started by: umen
8 Replies

2. Solaris

command to find out total size of a specific file size (spread over the server)

hi all, in my server there are some specific application files which are spread through out the server... these are spread in folders..sub-folders..chid folders... please help me, how can i find the total size of these specific files in the server... (3 Replies)
Discussion started by: abhinov
3 Replies

3. Shell Programming and Scripting

Find Duplicate files, not by name

I have a directory with images: -rw-r--r-- 1 root root 26216 Mar 19 21:00 020109.210001.jpg -rw-r--r-- 1 root root 21760 Mar 19 21:15 020109.211502.jpg -rw-r--r-- 1 root root 23144 Mar 19 21:30 020109.213002.jpg -rw-r--r-- 1 root root 31350 Mar 20 00:45 020109.004501.jpg -rw-r--r-- 1 root... (2 Replies)
Discussion started by: Ikon
2 Replies

4. Shell Programming and Scripting

Find duplicate files

What utility do you recommend for simply finding all duplicate files among all files? (4 Replies)
Discussion started by: kiasas
4 Replies

5. Shell Programming and Scripting

Find file size difference in two files using awk

Hi, Could anyone help me to solve this problem? I have two files "f1" and "f2" having 2 fields in each, a) file size and b) file name. The data are almost same in both the files except for few and new additional lines. Now, I have to find out and print the output as, the difference in the... (3 Replies)
Discussion started by: royalibrahim
3 Replies

6. Shell Programming and Scripting

Remove duplicate lines from a 50 MB file size

hi, Please help me to write a command to delete duplicate lines from a file. And the size of file is 50 MB. How to remove duplicate lins from such a big file. (6 Replies)
Discussion started by: vsachan
6 Replies

7. Shell Programming and Scripting

find duplicate string in many different files

I have more than 100 files like this: SVEAVLTGPYGYT 2 SVEGNFEETQY 10 SVELGQGYEQY 28 SVERTGTGYT 6 SVGLADYNEQF 21 SVGQGYEQY 32 SVKTVLGYEQF 2 SVNNEQF 12 SVRDGLTNSPLH 3 SVRRDREGLEQF 11 SVRTSGSYEQY 17 SVSVSGSPLQETQY 78 SVVHSTSPEAF 59 SVVPGNGYT 75 (4 Replies)
Discussion started by: xshang
4 Replies

8. Shell Programming and Scripting

Find duplicate files but with different extensions

Hi ! I wonder if anyone can help on this : I have a directory: /xyz that has the following files: chsLog.107.20130603.gz chsLog.115.20130603 chsLog.111.20130603.gz chsLog.107.20130603 chsLog.115.20130603.gz As you ca see there are two files that are the same but only with a minor... (10 Replies)
Discussion started by: fretagi
10 Replies

9. Shell Programming and Scripting

Find duplicate rows between files

Hi champs, I have one of the requirement, where I need to compare two files line by line and ignore duplicates. Note, I hav files in sorted order. I have tried using the comm command, but its not working for my scenario. Input file1 srv1..development..employee..empname,empid,empdesg... (1 Reply)
Discussion started by: Selva_2507
1 Replies

10. Shell Programming and Scripting

List duplicate files based on Name and size

Hello, I have a huge directory (with millions of files) and need to find out duplicates based on BOTH file name and File size. I know fdupes but it calculates MD5 which is very time-consuming and especially it takes forever as I have millions of files. Can anyone please suggest a script or... (7 Replies)
Discussion started by: prvnrk
7 Replies
All times are GMT -4. The time now is 04:33 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy