Remove Duplicate Filenames in 2 very large directories


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Remove Duplicate Filenames in 2 very large directories
# 1  
Old 09-24-2009
Remove Duplicate Filenames in 2 very large directories

Hello Gurus,

O/S RHEL4
I have a requirement to compare two linux based directories for duplicate filenames and remove them. These directories are close to 2 TB each. I have tried running a:

Code:
Prompt>diff -r data1/ data2/

I have tried this as well:

Code:
jason@jason-desktop:~$ cat script.sh 
#!/bin/bash



for files in $(diff -r data1 data2/ | awk -F":" '{print $2}'); do

echo $files
done
jason@jason-desktop:~$



I wanted to get the output of the above command and place in a variable for a deletion. This scenario does not work and the machines load goes to high for production. I have also thought of trying a rsync with the delete flag, and I am unsure if this will compare both directories successfully.

Can someone please point me in the right direction as to what commands or scenarios will best accomplish my task.

I have also tried to google this on unix.com as well as the web.

Your support and assistance is greatly appreciated.

Jaysunn

Last edited by jaysunn; 09-24-2009 at 11:56 AM.. Reason: Added O/S
# 2  
Old 09-24-2009
Your solutions don't work because they don't work or because they increase the load too much on the production system?

I would probably test whatever solution on a test pair of directories that you hand build so you can see if the solution is working or not.

If it is a load issue, try using the 'nice' command to lower the priority of your process.

rsync would probably work as well, but I would test a lot on sample data to make sure it's doing what you want.
# 3  
Old 09-24-2009
Your example of "diff -r" actually compares the contents of each file rather than the filename.

How many files are there in each directory tree?

Please expand and explain what constitutes a "duplicate filename". Is it a file in the same relative position in the tree as a file with the same name, or something more complex?

Please explain when a "duplicate filename" is found, which one (if any) you prefer to keep.
# 4  
Old 09-24-2009
Thanks for your reply.

Quote:
How many files are there in each directory tree?
I have never tried to perform a wc -l cause it takes so long. I would estimate around 2 million files in each partition ranging no larger than 2MB.

The directory structure is 2 separate partitions that reside on a serial attached storage system.

The files are all *.mp3 or *.flv files. We are running out of space on this system and I have confirmed that there are duplicate files e.g.

Code:
/data1/586950.mp3
/data2/586950.mp3

Every file file has seven numbers followed by either the .mp3 or .flv extension. I would like to have a script to look at each partition, if it finds a copy of itself, remove it from /data1 partition freeing up space on /data2.

I hope I explained my scenario well enough.

Thanks Again,

Jaysunn
# 5  
Old 09-24-2009
Quote:
remove it from /data1 partition freeing up space on /data2
The above sentence does not make sense to me.

Also, is there a directory hierarchy or is there just /data1 and /data2 with no subdirectories?
# 6  
Old 09-24-2009
Wow,
I realized from your questions that I really did not provide much detail. Thanks for attempting to decipher.

Once the script identifies that there is a duplicate file residing on the /data1 partition I would like to then pass a rm argument to remove the file from /data2 cleaning up space on that partition.


Quote:
Also, is there a directory hierarchy or is there just /data1 and /data2 with no subdirectories?
Yes there is a hierarchy involved. Here is a snippet of it for you. Each partition has a 4 to 6 letter subdirectory that is mirrored on each partition. Files in that structure could be the same.

Code:
/data1/wcnn/*.mp3
/data1/wxxr/*.mp3
/data1/trrn/*.mp3

/data2/wcnn/*.mp3
/data2/wxxr/*.mp3
/data2/trrn/*.mp3

So there may be the same mp3 file in the station abbreviation on /data1 and /data2. I only need that file in one partition.

If I can provide any output commands please let me know.

Jaysunn
# 7  
Old 10-20-2009
I check back your post's and I find this one.
Suggestion: use fdupes(1) to find duplicate file Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Delete duplicate directories?

Is there a way via some bash script or just cmd to find duplicate directories? i have main folders: TEST1 TEST2 In folder TEST1 is some amount of same folders as in folder TEST2 can be this done? i tried fdupe but it only search for dupe files not whle dirs thx! (8 Replies)
Discussion started by: ZerO13
8 Replies

2. Shell Programming and Scripting

Remove spaces in filenames

Hi, I have files like below, In files coming as spaces. Before transfering those files into ftp server. I want to remove the spaces and then can transfer the files into unix server. e.g: filenames are 1) SHmail _profile001_20120908.txt 2) SHmail_profile001 _20120908.txt 3) sh... (3 Replies)
Discussion started by: kirankumar
3 Replies

3. UNIX for Dummies Questions & Answers

Duplicate directories

I have noticed that the same folder (and contents) lives in /u/public and /usr/public Question was this put here intentionally or by accident? Its 31Gb in size and on a 72Gb HDD that leaves little room for apps. It is a nework shared drive for all to access e.g. p: points to... (0 Replies)
Discussion started by: moondogi
0 Replies

4. Shell Programming and Scripting

Find duplicate filenames and remove in different mount point

Hi Gurus, Do any kind souls encounter have the same script as mentioned here. Find and compare filenames in different mount point and remove duplicates. Thanks a million!!! wanna13e (7 Replies)
Discussion started by: wanna13e
7 Replies

5. Shell Programming and Scripting

Shellscript to rearrange filenames in directories

below is the script to rename filenames ending with .pdf extension. I want the script to enter directories and search for all pdf and then if it is in the format file_amb_2008.pdf , then change it to 2008_amb_file.pdf, and this script should work only for .pdf files. help required to make the... (12 Replies)
Discussion started by: deaddevil
12 Replies

6. Shell Programming and Scripting

read list of filenames from text file and remove these files in multiple directories

I have a large list of filenames from an Excel sheet, which I then translate into a simple text file. I'd like to use this list, which contains various file extensions , to archive these files and then remove them recursively through multiple directories and subdirectories. So far, it looks like... (5 Replies)
Discussion started by: fxvisions
5 Replies

7. Shell Programming and Scripting

duplicate directories

Hi, I have file which users like filename ->"readfile", following entries peter john alaska abcd xyz and i have directory /var/ i want to do first cat of "readfile" line by line and first read peter in variable and also cross check with /var/ how many directories are avaialble... (8 Replies)
Discussion started by: learnbash
8 Replies

8. UNIX for Dummies Questions & Answers

tar contains duplicate filenames

I have a problem where tar is somehow creating duplicate filenames when tarring a directory. Doing an ls on the directory does not show any duplicate filenames, yet when the directory is tarred, you can see that there are duplicates: bash-2.03# pwd /var/log/cricket bash-2.03# ls -1 | sort |... (2 Replies)
Discussion started by: dangral
2 Replies

9. Shell Programming and Scripting

Remove prefix from filenames

I'm trying to put together a shell script that will append specific prefixes based on the content of filenames. I think I have this part down. However, I want to append before that part a process that will remove the current prefix before it renames the files with the new prefix. For example,... (6 Replies)
Discussion started by: HLee1981
6 Replies

10. UNIX for Dummies Questions & Answers

Fastest way to traverse through large directories

Hi! I have thousands of sub-directories, and hundreds of thousands of files in them. What is the fast way to find out which files are older than a certain date? Is the "find" command the fastest? Or is there some other way? Right now I have a C script that traverses through and checks... (5 Replies)
Discussion started by: sreedharange
5 Replies
Login or Register to Ask a Question