Delete duplicate files from one of two directory structures


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Delete duplicate files from one of two directory structures
# 1  
Old 10-15-2009
Delete duplicate files from one of two directory structures

Hello everyone,

I have been struggling to clean up a back-up mess I created when manually duplicating a directory structure and then working in both of them..
The structures now are significantly different and contain in the order of 15 k files of which most are duplicates.
Now I am trying to merge those dirs and had a look at FSlint, Meld, diff, and fdupes.
While all of those are good tools, I have not found them able to do what I need so I am looking for a way to reduce manual work to a minimum by deleting duplicates from the second dir structure. I will have to sort/merge the remaining files by hand.
The closest to doing that is with fslint's findup (Linux.com :: Tidy up your filesystem with FSlint) which returns a list of duplicate files separated by empty lines.
Since the duplicates listed may also be within a single one of the directory structures, I cannot be sure to delete files that are also present in path 1.
I can't make myself really clear, I'm afraid, so here's an example:

dir1/path/dup1
dir2/somepath/dup1 <-- delete
dir2/path/to/dup1 <-- delete

One attempt may be to first delete duplicates from dir2 and afterwards compare with dir1.

Any help appreciated!
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Script needed to delete to the list of files in a directory based on last created & delete them

Hi My directory structure is as below. dir1, dir2, dir3 I have the list of files to be deleted in the below path as below. /staging/retain_for_2years/Cleanup/log $ ls -lrt total 0 drwxr-xr-x 2 nobody nobody 256 Mar 01 16:15 01-MAR-2015_SPDBS2 drwxr-xr-x 2 root ... (2 Replies)
Discussion started by: prasadn
2 Replies

2. Shell Programming and Scripting

Delete all files if another files in the same directory has a matching occurrence of a specific word

he following are the files available in my directory RSK_123_20141113_031500.txt RSK_123_20141113_081500.txt RSK_126_20141113_041500.txt RSK_126_20141113_081800.txt RSK_128_20141113_091600.txt Here, "RSK" is file prefix and 123 is a code name and rest is just timestamp of the file when its... (7 Replies)
Discussion started by: kridhick
7 Replies

3. Ubuntu

delete duplicate rows with awk files

Hi every body I have some text file with a lots of duplicate rows like this: 165.179.568.197 154.893.836.174 242.473.396.153 165.179.568.197 165.179.568.197 165.179.568.197 154.893.836.174 how can I delete the repeated rows? Thanks Saeideh (2 Replies)
Discussion started by: sashtari
2 Replies

4. Shell Programming and Scripting

Remove duplicate files in same directory

Hi all. Am doing continuous backup of mailboxes using rsync. So whenever a new mail arrives it is automatically copied on backup server. When a new mail arrives it is named as xyz:2, when it is read by the email client an S is appended xyz:2,S Eventually , 2 copies of the same file exist on... (7 Replies)
Discussion started by: coolatt
7 Replies

5. Shell Programming and Scripting

Delete all files if another files in the same directory has a matching occurence of a specific word

Hello, I have several files in a specific directory. A specific string in one file can occur in another files. If this string is in other files. Then all the files in which this string occured should be deleted and only 1 file should remain with the string. Example. file1 ShortName "Blue... (2 Replies)
Discussion started by: premier_de
2 Replies

6. UNIX for Dummies Questions & Answers

Production Directory Structures

We (our company) has just purchased a new IBM unix machine. We have been doing some research and have found that it is NOT a good idea to put your own in-house-written applications under the existing file folders such as /usr or /bin ect. Instead you should place these applications in directories... (7 Replies)
Discussion started by: jbrubaker
7 Replies

7. Shell Programming and Scripting

Delete Some Old files from Particular Directory

Hi Team, I am new to scripting. I want to create a script, which needs to keep only 5 days directories and want to remove the old directory from a particular directory. Can Somebody help me with starting this script. All my directories will be created in the name <YYYYMMDD>. Thanks... (2 Replies)
Discussion started by: siva80_cit
2 Replies

8. Shell Programming and Scripting

script that detects duplicate files in directory

I need help with a script which accepts one argument and goes through all the files under a directory and prints a list of possible duplicate files As its output, it prints zero or more lines, each one containing a space-separated list of filenames. All the files listed on one line have the same... (1 Reply)
Discussion started by: trueman82
1 Replies

9. Shell Programming and Scripting

remove duplicate files in a directory

Hi ppl. I have to check for duplicate files in a directory . the directory has following files /the/folder /containing/the/file a1.yyyymmddhhmmss a1.yyyyMMddhhmmss b1.yyyymmddhhmmss b2.yyyymmddhhmmss c.yyyymmddhhmmss d.yyyymmddhhmmss d.yyyymmddhhmmss where the date time stamp can be... (1 Reply)
Discussion started by: asinha63
1 Replies

10. Shell Programming and Scripting

help:comparing two directory tree structures only

Hi I what, a script snippet for "comparing two directory tree structures only " not the contents of directories(like files..etc). Thanking you a lot. Regards Rajesh (7 Replies)
Discussion started by: raj_thota
7 Replies
Login or Register to Ask a Question
rdfind(1)							      rdfind								 rdfind(1)

NAME
rdfind - finds duplicate files SYNOPSIS
rdfind [ options ] directory1 | file1 [ directory2 | file2 ] ... DESCRIPTION
rdfind finds duplicate files across and/or within several directories. It calculates checksum only if necessary. rdfind runs in O(Nlog(N)) time with N being the number of files. If two (or more) equal files are found, the program decides which of them is the original and the rest are considered duplicates. This is done by ranking the files to each other and deciding which has the highest rank. See section RANKING for details. If you need better control over the ranking than given, you can use some preprocessor which sorts the file names in desired order and then run the program using xargs. See examples below for how to use find and xargs in conjunction with rdfind. To include files or directories that have names starting with -, use rdfind ./- to not confuse them with options. RANKING
Given two or more equal files, the one with the highest rank is selected to be the original and the rest are duplicates. The rules of rank- ing are given below, where the rules are executed from start until an original has been found. Given two files A and B which have equal content, the ranking is as follows: If A was found while scanning an input argument earlier than than B, A is higher ranked. If A was found at a depth lower than B, A is higher ranked (A closer to the root) If A was found earlier than B, A is higher ranked. The last rule is needed when two files are found in the same directory (obviously not given in separate arguments, otherwise the first rule applies) and gives the same order between the files as the operating system delivers the files while listing the directory. This is operat- ing system specific behaviour. OPTIONS
Searching options etc: -ignoreempty true|false Ignore empty files. (default) -followsymlinks true|false Follow symlinks. Default is false. -removeidentinode true|false removes items found which have identical inode and device ID. Default is true. -checksum md5|sha1 what type of checksum to be used: md5 or sha1. Default is md5. Action options: -makesymlinks true|false Replace duplicate files with symbolic links -makehardlinks true|false Replace duplicate files with hard links -makeresultsfile true|false Make a results file results.txt (default) in the current directory. -outputname name Make the results file name to be "name" instead of the default results.txt. -deleteduplicates true|false Delete (unlink) files. General options: -sleep Xms sleeps X milliseconds between reading each file, to reduce load. Default is 0 (no sleep). Note that only a few values are supported at present: 0,1-5,10,25,50,100 milliseconds. -n -dryrun displays what should have been done, dont actually delete or link anything. -h, -help, --help displays brief help message. -v, -version, --version displays version number. EXAMPLES
Search for duplicate files in home directory and a backup directory: rdfind ~ /mnt/backup Delete duplicate in a backup directory: rdfind -deletefiles true /mnt/backup Search for duplicate files in directories called foo: find . -type d -name foo -print0 |xargs -0 rdfind FILES
results.txt (the default name is results.txt and can be changed with option outputname, see above) The results file results.txt will con- tain one row per duplicate file found, along with a header row explaining the columns. A text describes why the file is considered a duplicate: DUPTYPE_UNKNOWN some internal error DUPTYPE_FIRST_OCCURRENCE the file that is considered to be the original. DUPTYPE_WITHIN_SAME_TREE files in the same tree (found when processing the directory in the same input argument as the original) DUPTYPE_OUTSIDE_TREE the file is found during processing another input argument than the original. ENVIRONMENT
DIAGNOSTICS
EXIT VALUES
0 on success, nonzero otherwise. BUGS
/FEATURES When specifying the same directory twice, it keeps the first encountered as the most important (original), and the rest as duplicates. This might not be what you want. The symlink creates absolute links. This might not be what you want. To create relative links instead, you may use the symlinks (2) com- mand, which is able to convert absolute links to relative links. Older versions unfortunately contained a misspelling on the word occurrence. This is now corrected (since 1.3), which might affect user scripts parsing the output file written by rdfind. There are lots of enhancements left to do. Please contribute! SECURITY CONSIDERATIONS
Avoid manipulating the directories while rdfind is reading. rdfind is quite brittle in that case. Especially, when deleting or making links, rdfind can be subject to a symlink attack. Use with care! AUTHOR
Paul Dreik 2006, reachable at rdfind@pauldreik.se Rdfind can be found at http://rdfind.pauldreik.se/ Do you find rdfind useful? Drop me a line! It is always fun to hear from people who actually use it and what data collections they run it on. THANKS
Several persons have helped with suggestions and improvements: Niels Moller, Carl Payne and Salvatore Ansani. Thanks also to you who tested the program and sent me feedback. VERSION
1.3.1 (release date 2012-05-07) svn id: $Id: rdfind.1 766 2012-05-07 17:26:17Z pauls $ COPYRIGHT
This program is distributed under GPLv2 or later, at your option. SEE ALSO
md5sum(1), find(1), symlinks(2) May 2012 1.3.1 rdfind(1)