remove all duplicate lines from all files in one folder


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting remove all duplicate lines from all files in one folder
# 1  
Old 05-30-2009
remove all duplicate lines from all files in one folder

Hi,

is it possible to remove all duplicate lines from all txt files in a specific folder?

This is too hard for me maybe someone could help.

lets say we have an amount of textfiles 1 or 2 or 3 or... maximum 50
each textfile has lines with text.

I want all lines of all textfiles together to be unique. but the not duplicate lines must remain in txt file where they are.

it does not matter, in what txt-file the dupicate lines are deleted, but one occurance has to stay in least one txt file... An even better solution would delete the duplicate occourances first in textfile 1 then in 2 then in 3, so that the amount of lines deleted are spread to all txt files.


example with 4 textfiles (amount can vary, up to 50) we also do not know how many lines.

txt1:
aaaaaaa
bbbbbbb
ccccccc

txt2:
aaaaaaa
ccccccc
ddddddd

txt3
ccccccc
ddddddd
eeeeeee

txt4
ggggggg
hhhhhhh
kkkkkkkk

a result could look for example like this:

txt1:
aaaaaaa
bbbbbbb
ccccccc

txt2:
ddddddd

txt3
eeeeeee

txt4
ggggggg
hhhhhhh
kkkkkkkk

a perfect result (if possible) looks like this:

txt1:
aaaaaaa
bbbbbbb

txt2:
ccccccc
ddddddd

txt3
eeeeeee

txt4
ggggggg
hhhhhhh
kkkkkkkk
# 2  
Old 05-30-2009
try using commands like sort and uniq to get rid of duplicate lines refer man pages and give it a try if not possible revert back will help you

-vidya
# 3  
Old 05-30-2009
I am not sure why you would need that..

why dont you just combine all the files and then use sort to get the uniq ones and then finally split the files..

Code:
cat *.txt | sort -u > newfile

Code:
man split

-Devaraj Takhellambam
# 4  
Old 05-30-2009
@dev, requirement is different. OP still needs to keep his files.
# 5  
Old 05-30-2009
This is a tough one.

Could be done with awk, but to simplify its work I believe it would be a good idea to first combine all the files in one file in such a way that all the original information is retained:
Code:
txt1 aaaaaaa
txt1 bbbbbbb
txt1 ccccccc
txt2 aaaaaaa
txt2 ccccccc
txt2 ddddddd
txt3 ccccccc
txt3 ddddddd
txt3 eeeeeee
txt4 ggggggg
txt4 hhhhhhh
txt4 kkkkkkkk

This way you have the original filename in the first column. The file can be sorted on the second colum, then you can apply an awk program that appends each field $2 as a line to a file named after field $1 but only if field $2 did not appear on the previous input line.
The delete operations would be automatically spread over the filenames.
.
.
.
.
.
.
.
.
.
.
.
.
.
Now you are wondering how to combine the files, sort the result, process it.
Code:
awk '{print FILENAME,$0}' * | sort -k 2 > bigfile
awk 'someprogram' bigfile

Of course you don't need to have an intermediate file:
Code:
awk '{print FILENAME,$0}' * | sort -k 2 | awk 'someprogram'

.
.
.
.
.
.
.
.
.
.
someprogram:
Code:
$2 != p { print $2 >> $1 }
{ p = $2 }

.
.
.
.
.
.
.
.
.
Perhaps better (to avoid the risk to exceed the system limit on the number of open files in one process):
Code:
awk '{print FILENAME,$0}' * | sort -k 2 | awk '$2!=p;{p=$2}' | sort -k 1 | awk '{$1!=p&&p{close(p)};{print$2>>$1;p=$1}'

This could be schematized as:
muxer | sort | filter | sort | splitter

Last edited by colemar; 05-30-2009 at 05:29 AM..
# 6  
Old 05-30-2009
Code:
awk '{s=FILENAME}!a[$0]++ && FILENAME == s{print $0>FILENAME }' txt*


-Devaraj Takhellambam
# 7  
Old 05-30-2009
Quote:
Originally Posted by devtakh
Code:
awk '{s=FILENAME}!a[$0]++ && FILENAME == s{print $0>FILENAME }' txt*

-Devaraj Takhellambam
FILENAME==s is always true since it can happen only just after s=FILENAME.

>FILENAME is writing to the same file that awk is reading, I believe this is not a good idea. Plus, to append to a file you need to use >>.

The code can be reworked as:
Code:
mkdir tmp
awk '!a[$0]++{print$0>>"tmp/"FILENAME}' txt*

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to remove duplicate lines?

Hi All, I am storing the result in the variable result_text using the below code. result_text=$(printf "$result_text\t\n$name") The result_text is having the below text. Which is having duplicate lines. file and time for the interval 03:30 - 03:45 file and time for the interval 03:30 - 03:45 ... (4 Replies)
Discussion started by: nalu
4 Replies

2. Windows & DOS: Issues & Discussions

Remove duplicate lines from text files.

So, I have text files, one "fail.txt" And one "color.txt" I now want to use a command line (DOS) to remove ANY line that is PRESENT IN BOTH from each text file. Afterwards there shall be no duplicate lines. (1 Reply)
Discussion started by: pasc
1 Replies

3. Shell Programming and Scripting

Remove duplicate lines from a file

Hi, I have a csv file which contains some millions of lines in it. The first line(Header) repeats at every 50000th line. I want to remove all the duplicate headers from the second occurance(should not remove the first line). I don't want to use any pattern from the Header as I have some... (7 Replies)
Discussion started by: sudhakar T
7 Replies

4. UNIX for Dummies Questions & Answers

Remove Duplicate Lines

Hi I need this output. Thanks. Input: TAZ YET FOO FOO VAK TAZ BAR Output: YET VAK BAR (10 Replies)
Discussion started by: tara123
10 Replies

5. Shell Programming and Scripting

How do I remove the duplicate lines in this file?

Hey guys, need some help to fix this script. I am trying to remove all the duplicate lines in this file. I wrote the following script, but does not work. What is the problem? The output file should only contain five lines: Later! (5 Replies)
Discussion started by: Ernst
5 Replies

6. Shell Programming and Scripting

[uniq + awk?] How to remove duplicate blocks of lines in files?

Hello again, I am wanting to remove all duplicate blocks of XML code in a file. This is an example: input: <string-array name="threeItems"> <item>item1</item> <item>item2</item> <item>item3</item> </string-array> <string-array name="twoItems"> <item>item1</item> <item>item2</item>... (19 Replies)
Discussion started by: raidzero
19 Replies

7. Shell Programming and Scripting

Remove duplicate lines

Hi, I have a huge file which is about 50GB. There are many lines. The file format likes 21 rs885550 0 9887804 C C T C C C C C C C 21 rs210498 0 9928860 0 0 C C 0 0 0 0 0 0 21 rs303304 0 9941889 A A A A A A A A A A 22 rs303304 0 9941890 0 A A A A A A A A A The question is that there are a few... (4 Replies)
Discussion started by: zhshqzyc
4 Replies

8. Shell Programming and Scripting

perl/shell need help to remove duplicate lines from files

Dear All, I have multiple files having number of records, consist of more than 10 columns some column values are duplicate and i want to remove these duplicate values from these files. Duplicate values may come in different files.... all files laying in single directory.. Need help to... (3 Replies)
Discussion started by: arvindng
3 Replies

9. UNIX for Dummies Questions & Answers

Remove Duplicate lines from File

I have a log file "logreport" that contains several lines as seen below: 04:20:00 /usr/lib/snmp/snmpdx: Agent snmpd appeared dead but responded to ping 06:38:08 /usr/lib/snmp/snmpdx: Agent snmpd appeared dead but responded to ping 07:11:05 /usr/lib/snmp/snmpdx: Agent snmpd appeared dead but... (18 Replies)
Discussion started by: Nysif Steve
18 Replies

10. Shell Programming and Scripting

how to remove duplicate lines

I have following file content (3 fields each line): 23 888 10.0.0.1 dfh 787 10.0.0.2 dssf dgfas 10.0.0.3 dsgas dg 10.0.0.4 df dasa 10.0.0.5 df dag 10.0.0.5 dfd dfdas 10.0.0.5 dfd dfd 10.0.0.6 daf nfd 10.0.0.6 ... as can be seen, that the third field is ip address and sorted. but... (3 Replies)
Discussion started by: fredao
3 Replies
Login or Register to Ask a Question