remove all duplicate lines from all files in one folder

05-30-2009

Registered User

41, 1

Join Date: May 2009

Last Activity: 14 September 2019, 10:19 PM EDT

Posts: 41

Thanks Given: 2

Thanked 1 Time in 1 Post

remove all duplicate lines from all files in one folder

Hi,

is it possible to remove all duplicate lines from all txt files in a specific folder?

This is too hard for me maybe someone could help.

lets say we have an amount of textfiles 1 or 2 or 3 or... maximum 50
each textfile has lines with text.

I want all lines of all textfiles together to be unique. but the not duplicate lines must remain in txt file where they are.

it does not matter, in what txt-file the dupicate lines are deleted, but one occurance has to stay in least one txt file... An even better solution would delete the duplicate occourances first in textfile 1 then in 2 then in 3, so that the amount of lines deleted are spread to all txt files.

example with 4 textfiles (amount can vary, up to 50) we also do not know how many lines.

txt1:
aaaaaaa
bbbbbbb
ccccccc

txt2:
aaaaaaa
ccccccc
ddddddd

txt3
ccccccc
ddddddd
eeeeeee

txt4
ggggggg
hhhhhhh
kkkkkkkk

a result could look for example like this:

txt1:
aaaaaaa
bbbbbbb
ccccccc

txt2:
ddddddd

txt3
eeeeeee

txt4
ggggggg
hhhhhhh
kkkkkkkk

a perfect result (if possible) looks like this:

txt1:
aaaaaaa
bbbbbbb

txt2:
ccccccc
ddddddd

txt3
eeeeeee

txt4
ggggggg
hhhhhhh
kkkkkkkk

lowmaster

View Public Profile for lowmaster

Find all posts by lowmaster

05-30-2009

Registered User

2,050, 105

Join Date: Jun 2008

Last Activity: 9 June 2020, 12:25 AM EDT

Location: INDIA, Bangalore

Posts: 2,050

Thanks Given: 16

Thanked 105 Times in 102 Posts

try using commands like sort and uniq to get rid of duplicate lines refer man pages and give it a try if not possible revert back will help you

-vidya

vidyadhar85

View Public Profile for vidyadhar85

Find all posts by vidyadhar85

05-30-2009

Registered User

738, 7

Join Date: Oct 2007

Last Activity: 21 August 2013, 5:20 AM EDT

Location: Bangalore

Posts: 738

Thanks Given: 0

Thanked 7 Times in 7 Posts

I am not sure why you would need that..

why dont you just combine all the files and then use sort to get the uniq ones and then finally split the files..

Code:

cat *.txt | sort -u > newfile

Code:

man split

-Devaraj Takhellambam

devtakh

View Public Profile for devtakh

Find all posts by devtakh

05-30-2009

Registered User

2,669, 20

Join Date: Sep 2006

Last Activity: 28 January 2015, 8:30 AM EST

Posts: 2,669

Thanks Given: 0

Thanked 20 Times in 20 Posts

@dev, requirement is different. OP still needs to keep his files.

ghostdog74

View Public Profile for ghostdog74

Find all posts by ghostdog74

05-30-2009

Registered User

120, 0

Join Date: Apr 2009

Last Activity: 6 January 2014, 9:40 AM EST

Location: Trento, Italy

Posts: 120

Thanks Given: 0

Thanked 0 Times in 0 Posts

This is a tough one.

Could be done with awk, but to simplify its work I believe it would be a good idea to first combine all the files in one file in such a way that all the original information is retained:

Code:

txt1 aaaaaaa
txt1 bbbbbbb
txt1 ccccccc
txt2 aaaaaaa
txt2 ccccccc
txt2 ddddddd
txt3 ccccccc
txt3 ddddddd
txt3 eeeeeee
txt4 ggggggg
txt4 hhhhhhh
txt4 kkkkkkkk

This way you have the original filename in the first column. The file can be sorted on the second colum, then you can apply an awk program that appends each field $2 as a line to a file named after field $1 but only if field $2 did not appear on the previous input line.
The delete operations would be automatically spread over the filenames.
.
.
.
.
.
.
.
.
.
.
.
.
.
Now you are wondering how to combine the files, sort the result, process it.

Code:

awk '{print FILENAME,$0}' * | sort -k 2 > bigfile
awk 'someprogram' bigfile

Of course you don't need to have an intermediate file:

Code:

awk '{print FILENAME,$0}' * | sort -k 2 | awk 'someprogram'

.
.
.
.
.
.
.
.
.
.
someprogram:

Code:

$2 != p { print $2 >> $1 }
{ p = $2 }

.
.
.
.
.
.
.
.
.
Perhaps better (to avoid the risk to exceed the system limit on the number of open files in one process):

Code:

awk '{print FILENAME,$0}' * | sort -k 2 | awk '$2!=p;{p=$2}' | sort -k 1 | awk '{$1!=p&&p{close(p)};{print$2>>$1;p=$1}'

This could be schematized as:
muxer | sort | filter | sort | splitter

Last edited by colemar; 05-30-2009 at 05:29 AM..

colemar

View Public Profile for colemar

Find all posts by colemar

05-30-2009

Registered User

738, 7

Join Date: Oct 2007

Last Activity: 21 August 2013, 5:20 AM EDT

Location: Bangalore

Posts: 738

Thanks Given: 0

Thanked 7 Times in 7 Posts

Code:

awk '{s=FILENAME}!a[$0]++ && FILENAME == s{print $0>FILENAME }' txt*

-Devaraj Takhellambam

devtakh

View Public Profile for devtakh

Find all posts by devtakh

05-30-2009

Registered User

120, 0

Join Date: Apr 2009

Last Activity: 6 January 2014, 9:40 AM EST

Location: Trento, Italy

Posts: 120

Thanks Given: 0

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by devtakh

Code:

awk '{s=FILENAME}!a[$0]++ && FILENAME == s{print $0>FILENAME }' txt*

-Devaraj Takhellambam

FILENAME==s is always true since it can happen only just after s=FILENAME.

>FILENAME is writing to the same file that awk is reading, I believe this is not a good idea. Plus, to append to a file you need to use >>.

The code can be reworked as:

Code:

mkdir tmp
awk '!a[$0]++{print$0>>"tmp/"FILENAME}' txt*

colemar

View Public Profile for colemar

Find all posts by colemar

Shell Programming and Scripting

remove all duplicate lines from all files in one folder

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to remove duplicate lines?

Discussion started by: nalu

2. Windows & DOS: Issues & Discussions

Remove duplicate lines from text files.

Discussion started by: pasc

3. Shell Programming and Scripting

Remove duplicate lines from a file

Discussion started by: sudhakar T

4. UNIX for Dummies Questions & Answers

Remove Duplicate Lines

Discussion started by: tara123

5. Shell Programming and Scripting

How do I remove the duplicate lines in this file?

Discussion started by: Ernst

6. Shell Programming and Scripting

[uniq + awk?] How to remove duplicate blocks of lines in files?

Discussion started by: raidzero

7. Shell Programming and Scripting

Remove duplicate lines

Discussion started by: zhshqzyc

8. Shell Programming and Scripting

perl/shell need help to remove duplicate lines from files

Discussion started by: arvindng

9. UNIX for Dummies Questions & Answers

Remove Duplicate lines from File

Discussion started by: Nysif Steve

10. Shell Programming and Scripting

how to remove duplicate lines

Discussion started by: fredao