List duplicate files based on Name and size


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting List duplicate files based on Name and size
# 1  
Old 01-01-2014
List duplicate files based on Name and size

Hello,

I have a huge directory (with millions of files) and need to find out duplicates based on BOTH file name and File size.

I know
Code:
fdupes

but it calculates MD5 which is very time-consuming and especially it takes forever as I have millions of files.

Can anyone please suggest a script or tool to find duplicates just based on file name "and" file size. It would be nice to be able to filter based on minimum file size.

Thanks

Last edited by prvnrk; 01-01-2014 at 02:31 PM..
# 2  
Old 01-01-2014
Proceed in 2 steps :
1. Log size and filename in a tempfile (removing the path from the filename).
2. Then sort it and get the duplicates
Code:
find /huge_dir -type f -printf "%s %p\n" | sed 's:/.*/::' >/tmp/mytmp
sort /tmp/mytmp | uniq -d

Note that for processing such a number of objects it would be advisable to use a database instead.

Code:
find /huge_dir -type f -printf "%s %f\n" >/tmp/mytmp
sort /tmp/mytmp | uniq -d


Last edited by ctsgnb; 01-01-2014 at 05:32 PM.. Reason: Remove sed clause (Thx Rudi)
# 3  
Old 01-01-2014
I guess duplicate filenames means files in different directories? Do you need the full path of the dupes? Then - if your version of find and uniq allow for it - use printf "%h %f %s\n"and uniq -d --skip-fileds=1
This User Gave Thanks to RudiC For This Post:
# 4  
Old 01-01-2014
Ha ! yup! ... i missed the %h and %f ... Smilie
# 5  
Old 01-02-2014
I tried the below but it doesn't show correct results.

Code:
find . -type f -printf "%s %f\n" |sort |uniq -d -f 2

Here, I was just trying to get list of duplicates with "File size" only.

Code:
# ls -l
total 104696
-rwx------+ 1 Admin None 24867520 Jan  1 21:08 Anand-My_Career_1-SDVL.7z
-rwx------+ 1 Admin None 28732186 Jan  1 21:09 Anand-My_Career_2-SDVL.7z
-rwx------+ 1 Admin None 24867520 Jan  1 21:08 Anand-My-Career-1-SDVL.7z
-rwx------+ 1 Admin None 28732186 Jan  1 21:08 Anand-My-Career-2-SDVL.7z

# find . -type f -printf "%s %f\n" |sort |uniq -d

# find . -type f -printf "%s %f\n" |sort |uniq -d -f 2
24867520 Anand-My_Career_1-SDVL.7z

#

It is supposed to display 2 duplicates but shows only one.
What is the mistake am I doing here? Smilie
# 6  
Old 01-02-2014
You could give a try to something like :

Code:
find /huge_dir -type f -printf "%s %f %h\n" >/tmp/mytmp

Then to display those having the same size :

Code:
sort /tmp/mytmp | awk '{z=y;y=$1;w=x;x=$0;v=u;u=$2}(z==y){print w RS x}'

Then to display those having the same name :

Code:
sort -k 2,2 /tmp/mytmp | awk '{z=y;y=$1;w=x;x=$0;v=u;u=$2}(v==u){print w RS x}'

# 7  
Old 01-02-2014
Try
Code:
find . -printf "%h\t%f\t%s\n" | sort -k2 | uniq -Df1

if your tools allow for those options...
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Duplicate files and output list

Gents, I have a file like this. 1 1 1 2 2 3 2 4 2 5 3 6 3 7 4 8 5 9 I would like to get something like it 1 1 2 2 3 4 5 3 6 7 Thanks in advance for your support :b: (8 Replies)
Discussion started by: jiam912
8 Replies

2. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Hi, In a file, I have to mark duplicate records as 'D' and the latest record alone as 'C'. In the below file, I have to identify if duplicate records are there or not based on Man_ID, Man_DT, Ship_ID and I have to mark the record with latest Ship_DT as "C" and other as "D" (I have to create... (7 Replies)
Discussion started by: machomaddy
7 Replies

3. Shell Programming and Scripting

Delete Files based on size

Hello Community! Im newbie on shell programming and its my first post. Im trying to make a bash shell script that it removes files of subdirectory. it is called : rms -{g|l|b} size1 dir -g means : remove file or files in dir that is above size1 -l means: remove file or files in dir that... (1 Reply)
Discussion started by: BTKBaaMMM
1 Replies

4. Shell Programming and Scripting

Duplicate rows in CSV files based on values

I am new to this forum and this is my first post. I am looking at an old post with exactly the same name. Can not paste URL because I do not have 5 posts My requirement is exactly opposite. I want to get rid of duplicate rows and try to append the values of columns in those rows ... (10 Replies)
Discussion started by: vbhonde11
10 Replies

5. Shell Programming and Scripting

Find duplicate files by file size

Hi! I want to find duplicate files (criteria: file size) in my download folder. I try it like this: find /Users/frodo/Downloads \! -type d -exec du {} \; | sort > /Users/frodo/Desktop/duplicates_1.txt; cut -f 1 /Users/frodo/Desktop/duplicates_1.txt | uniq -d | grep -hif -... (9 Replies)
Discussion started by: Dirk Einecke
9 Replies

6. Shell Programming and Scripting

Deleting files based on their size

I have several files in a folder and I would like to delete the ones that do not contain all the required information (size) let say 1kb. Any ideas? (4 Replies)
Discussion started by: Xterra
4 Replies

7. Shell Programming and Scripting

Remove duplicate files based on text string?

Hi I have been struggling with a script for removing duplicate messages from a shared mailbox. I would like to search for duplicate messages based on the “Message-ID” string within the messages files. I have managed to find the duplicate “Message-ID” strings and (if I would like) delete... (1 Reply)
Discussion started by: spangberg
1 Replies

8. UNIX for Dummies Questions & Answers

split files based on size

I have a few txt files in some directory and I need to check their sizes one by one. If any of them are greater than 5mb then I need to split the file in two. Can someone help? Thanks. (6 Replies)
Discussion started by: khanvader
6 Replies

9. Shell Programming and Scripting

Duplicate rows in CSV files based on values

I want to duplicate a row if found two or more values in a particular column for corresponding row which is delimitted by comma. Input abc,line one,value1 abc,line two, value1, value2 abc,line three,value1 needs to converted to abc,line one,value1 abc,line two, value1 abc,line... (8 Replies)
Discussion started by: Incrediblian
8 Replies

10. UNIX for Dummies Questions & Answers

Report of duplicate files based on part of the filename

I have the files logged in the file system with names in the format of : filename_ordernumber_date_time eg: file_1_12012007_1101.txt file_2_12022007_1101.txt file_1_12032007_1101.txt I need to find out all the files that are logged multiple times with same order number. In the above eg, I... (1 Reply)
Discussion started by: sudheshnaiyer
1 Replies
Login or Register to Ask a Question