Remove duplicate files based on text string?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Remove duplicate files based on text string?
# 1  
Old 07-07-2009
Remove duplicate files based on text string?

Hi

I have been struggling with a script for removing duplicate messages from a shared mailbox.
I would like to search for duplicate messages based on the “Message-ID” string within the messages files.

I have managed to find the duplicate “Message-ID” strings and (if I would like) delete the files in which they where found.
My problem is who to preserve one of each file.

My script so far:

--------------------
#!/bin/tcsh
set dir=/my/maildir

foreach file (`grep -h "Message-ID: <" $dir/* | uniq -d |xargs -i \grep -l "{}" $dir/*`)

rm -f "$file"

end

--------------------

Any ideas?

Thanks // Tomas

---------- Post updated at 06:02 PM ---------- Previous update was at 10:18 AM ----------

Fyi, solved
-------------------
#!/bin/tcsh
set maildir=/my/maildir
foreach dupstring ("`grep -m 1 -h -R "^Message-ID:" $maildir/ | sort | uniq -d`")
grep -l -R "$dupstring" $maildir/ |sed 1d |xargs -i \rm -f "{}"
end
-------------------

// Tomas
# 2  
Old 08-28-2009
I need the exact same solution to the same problem. But I get an error when I run your script:

Code:
#!/bin/tcsh
set maildir=/my/maildir
foreach dupstring ("`grep -m 1 -h -R "^Message-ID:" $maildir/ | sort | uniq -d`")
grep -l -R "$dupstring" $maildir/ |sed 1d |xargs -i \rm -f "{}"
end

I did set the mail archive directory correctly, but that is not the issue here.

Code:
Scripts ]$ ./remove_dupes.sh 
./remove_dupes.sh: line 4: syntax error near unexpected token `"('

Not sure if this is because I use the bash shell as opposed to tcsh. Or is this a nested double quote issue? I have tried fixing, but my syntax skills are still developing. Help appreciated.

---------- Post updated 08-28-09 at 02:10 AM ---------- Previous update was 08-27-09 at 12:54 PM ----------

Code:
blake [ ~/scratch ]$ xargs -i
xargs: illegal option -- i

Code:
#!/bin/tcsh
set maildir=/Users/blake/Library/Mail/Mailboxes/Archive.mbox/Messages
foreach dupstring ("`grep -m 1 -h -R ^Message-ID: $maildir/ | sort | uniq -d`")
grep -l -R "$dupstring" $maildir/ |sed 1d |xargs \rm -f "{}"
end

I removed the -i from the xargs command and tested it on a directory with a few sample emails with duplicate Message-ID and it worked, so now I applied this script to my backed up archive of 76,000+ emails with tons of duplicates and started it last night. It's still running. And it may continue running for days because the way this is constructed, it will compare the "Message-ID:" string from 76,000 emails to 76,000 other emails. That's 5.7 billion greps!

Looking at my data I see that in the vast majority of cases (in every case I found), the emails with duplicate Message-IDs are literally right next to each other. Here is a small section of the directory:

Code:
-rw-r--r--  1 blake  staff     76634 Jan 30  2008 101576.emlx **
-rw-r--r--  1 blake  staff     76627 Jan 30  2008 101577.emlx **
-rw-r--r--  1 blake  staff     12083 Jan 30  2008 101587.emlx
-rw-r--r--  1 blake  staff    104673 Jan 30  2008 101588.emlx
-rw-r--r--  1 blake  staff     67374 Jan 30  2008 101597.emlx **
-rw-r--r--  1 blake  staff     67374 Jan 30  2008 101598.emlx **

** = duplicate Message-IDs

How do I modify this script so that it only compares a Message-ID in one file to the Message-ID in the next file (or next n files)? That should dramatically speed up this process. Thanks.

Last edited by sitney; 08-27-2009 at 03:06 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove sections based on duplicate first line

Hi, I have a file with many sections in it. Each section is separated by a blank line. The first line of each section would determine if the section is duplicate or not. if the section is duplicate then remove the entire section from the file. below is the example of input and output.... (5 Replies)
Discussion started by: ahmedwaseem2000
5 Replies

2. Shell Programming and Scripting

Remove duplicate lines from file based on fields

Dear community, I have to remove duplicate lines from a file contains a very big ammount of rows (milions?) based on 1st and 3rd columns The data are like this: Region 23/11/2014 09:11:36 41752 Medio 23/11/2014 03:11:38 4132 Info 23/11/2014 05:11:09 4323... (2 Replies)
Discussion started by: Lord Spectre
2 Replies

3. Windows & DOS: Issues & Discussions

Remove duplicate lines from text files.

So, I have text files, one "fail.txt" And one "color.txt" I now want to use a command line (DOS) to remove ANY line that is PRESENT IN BOTH from each text file. Afterwards there shall be no duplicate lines. (1 Reply)
Discussion started by: pasc
1 Replies

4. Shell Programming and Scripting

Remove duplicate rows based on one column

Dear members, I need to filter a file based on the 8th column (that is id), and does not mather the other columns, because I want just one id (1 line of each id) and remove the duplicates lines based on this id (8th column), and does not matter wich duplicate will be removed. example of my file... (3 Replies)
Discussion started by: clarissab
3 Replies

5. Shell Programming and Scripting

Remove not only the duplicate string but also the keyword of the string in Perl

Hi Perl users, I have another problem with text processing in Perl. I have a file below: Linux Unix Linux Windows SUN MACOS SUN SUN HP-AUX I want the result below: Unix Windows SUN MACOS HP-AUX so the duplicate string will be removed and also the keyword of the string on... (2 Replies)
Discussion started by: askari
2 Replies

6. Shell Programming and Scripting

Remove duplicate entries based on the range

I have file like this: chr start end chr15 99874874 99875874 chr15 99875173 99876173 aa1 chr15 99874923 99875923 chr15 99875173 99876173 aa1 chr15 99874962 99875962 chr15 99875173 99876173 aa1 chr1 ... (7 Replies)
Discussion started by: raj_k
7 Replies

7. Shell Programming and Scripting

How To Remove Duplicate Based on the Value?

Hi , Some time i got duplicated value in my files , bundle_identifier= B Sometext=ABC bundle_identifier= A bundle_unit=500 Sometext123=ABCD bundle_unit=400 i need to check if there is a duplicated values or not if yes , i need to check if the value is A or B when Bundle_Identified ,... (2 Replies)
Discussion started by: OTNA
2 Replies

8. Shell Programming and Scripting

Remove duplicate value based on two field $4 and $5

Hi All, i have input file like below... CA009156;20091003;M;AWBKCA72;123;;CANADIAN WESTERN BANK;EDMONTON;;2300, 10303, JASPER AVENUE;;T5J 3X6;; CA009156;20091003;M;AWBKCA72;321;;CANADIAN WESTERN BANK;EDMONTON;;2300, 10303, JASPER AVENUE;;T5J 3X6;; CA009156;20091003;M;AWBKCA72;231;;CANADIAN... (2 Replies)
Discussion started by: mohan sharma
2 Replies

9. Shell Programming and Scripting

Remove duplicate based on Group

Hi, How can I remove duplicates from a file based on group on other column? for example: Test1|Test2|Test3|Test4|Test5 Test1|Test6|Test7|Test8|Test5 Test1|Test9|Test10|Test11|Test12 Test1|Test13|Test14|Test15|Test16 Test17|Test18|Test19|Test20|Test21 Test17|Test22|Test23|Test24|Test5 ... (2 Replies)
Discussion started by: yale_work
2 Replies

10. UNIX for Dummies Questions & Answers

How to get remove duplicate of a file based on many conditions

Hii Friends.. I have a huge set of data stored in a file.Which is as shown below a.dat: RAO 1869 12 19 0 0 0.00 17.9000 82.3000 10.0 0 0.00 0 3.70 0.00 0.00 0 0.00 3.70 4 NULL LEE 1870 4 11 1 0 0.00 30.0000 99.0000 0.0 0 0.00 0 0.00 0.00 0.00 0 ... (3 Replies)
Discussion started by: reva
3 Replies
Login or Register to Ask a Question