Remove lines containing 2 or more duplicate strings


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Remove lines containing 2 or more duplicate strings
# 1  
Old 01-18-2016
Question Remove lines containing 2 or more duplicate strings

Within my text file i have several thousand lines of text with some lines containing duplicate strings/words. I would like to entirely remove those lines which contain the duplicate strings.

Eg;
Code:
One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word duplicate

Output;
Code:
One and a Two
Unix.com is the Best

The letter case doesn't matter.

Much Thanks as always for your help Smilie
# 2  
Old 01-18-2016
From what length is a "string" a string? Is "a" a string?

---------- Post updated at 11:40 ---------- Previous update was at 11:40 ----------

Does case matter?

---------- Post updated at 11:43 ---------- Previous update was at 11:40 ----------

As a starting point:
Code:
awk '{for (i=1; i<=NF; i++) for (j=i+1; j<=NF; j++) if ($i == $j) next}1' file
One and a Two
Unix.com is the Best

This User Gave Thanks to RudiC For This Post:
# 3  
Old 01-18-2016
Hello martinsmith,

Could you please try this and let me know if this helps. I am ignoring case sensitivity here so it will match all kind of same words either they are in capital or small letters.
So let's say following is the Input_file:
Code:
One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word DUPLICATE
UNIX is very good GOOD

Now following is the code for same.
Code:
awk 'BEGIN{IGNORECASE = 1} {for(i=1;i<=NF;i++){for(j=1;j<=NF;j++){if($j==$i){A[$i]++;}};if(A[$i]>1){for(i in A){delete A[i];next}}};print;for(i in A){delete A[i]}}'  Input_file

Output will be as follows.
Code:
One and a Two
Unix.com is the Best

Thanks,
R. Singh

Last edited by RavinderSingh13; 01-18-2016 at 06:51 AM.. Reason: Added a comment for more clarification about solution now.
This User Gave Thanks to RavinderSingh13 For This Post:
# 4  
Old 01-18-2016
Quote:
Originally Posted by RudiC
From what length is a "string" a string? Is "a" a string?

---------- Post updated at 11:40 ---------- Previous update was at 11:40 ----------

Does case matter?

---------- Post updated at 11:43 ---------- Previous update was at 11:40 ----------

As a starting point:
Code:
awk '{for (i=1; i<=NF; i++) for (j=i+1; j<=NF; j++) if ($i == $j) next}1' file
One and a Two
Unix.com is the Best

The string length was between 3 to 12 characters. ( words which were identical ).

I tried your solution and it works like a charm. Thank you Rudi Smilie

---------- Post updated at 07:58 PM ---------- Previous update was at 07:53 PM ----------

Quote:
Originally Posted by RavinderSingh13
Hello martinsmith,

Could you please try this and let me know if this helps. I am ignoring case sensitivity here so it will match all kind of same words either they are in capital or small letters.
So let's say following is the Input_file:
Code:
One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word DUPLICATE
UNIX is very good GOOD

Now following is the code for same.
Code:
awk 'BEGIN{IGNORECASE = 1} {for(i=1;i<=NF;i++){for(j=1;j<=NF;j++){if($j==$i){A[$i]++;}};if(A[$i]>1){for(i in A){delete A[i];next}}};print;for(i in A){delete A[i]}}'  Input_file

Output will be as follows.
Code:
One and a Two
Unix.com is the Best

Thanks,
R. Singh
Thanks R. Singh. It worked but seems to have taken some extra lines out. I believe Rudi's solution matched the patterns/words exactly since some words were similar spelling but different.

Anyways Much Thanks as usual. Cheers
# 5  
Old 01-18-2016
Adapting RudiC's suggestion for case independence and minimum string length:
Code:
awk '{for (i=1; i<NF; i++) for (j=i+1; j<=NF; j++) if ((tolower($i) == tolower($j)) && length($i)>=3) next}1' file


--
Note: IGNORECASE is GNU awk only.

Last edited by Scrutinizer; 01-18-2016 at 07:21 AM..
This User Gave Thanks to Scrutinizer For This Post:
# 6  
Old 01-18-2016
Hello martinsmith,

Adding one more solution here without 2 time loops, which may help you here. Let's say following is the Input_file.
Code:
One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word DUPLICATE
UNIX is very good GOOD

Then following is the code.
Code:
awk '{for(i=1;i<=NF;i++){A[tolower($i)]};if(NF == length(A)){print};delete A}'  Input_file

Output will be as follows.
Code:
One and a Two
Unix.com is the Best

EDIT: Above solution should work fine but each time condition will be invoked in for loop so a little change as follows will avoid that also.
Code:
 awk '{for(i=1;i<=NF;i++){A[tolower($i)]}};{if(NF == length(A)){print};delete A}'  Input_file

So above I am closing the loop before and after completion of it I am executing the condition part then.

Thanks,
R. Singh

Last edited by RavinderSingh13; 01-18-2016 at 09:42 AM.. Reason: Added a very tiny changed solution, because of if condition have put it out of for loop now.
These 2 Users Gave Thanks to RavinderSingh13 For This Post:
# 7  
Old 01-18-2016
Interesting solution. In short it is
Code:
awk '{for(i=1; i<=NF; i++) A[tolower($i)]} (NF==length(A)); {delete A}'

But works only with a recent GNU awk.
Other awk versions say "fatal: attempt to use array `A' in a scalar context" or "syntax error" or do not display anything.
This User Gave Thanks to MadeInGermany For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to remove duplicate lines?

Hi All, I am storing the result in the variable result_text using the below code. result_text=$(printf "$result_text\t\n$name") The result_text is having the below text. Which is having duplicate lines. file and time for the interval 03:30 - 03:45 file and time for the interval 03:30 - 03:45 ... (4 Replies)
Discussion started by: nalu
4 Replies

2. Shell Programming and Scripting

Remove duplicate lines from a file

Hi, I have a csv file which contains some millions of lines in it. The first line(Header) repeats at every 50000th line. I want to remove all the duplicate headers from the second occurance(should not remove the first line). I don't want to use any pattern from the Header as I have some... (7 Replies)
Discussion started by: sudhakar T
7 Replies

3. Shell Programming and Scripting

Getting lines between two strings with duplicate set of data

if I have the following lines in a file app.log some lines here <AAAA> abc <id>123456789</id> ddd </AAAA>some lines here too <BBBB> abc <id>123456789</id> ddd </BBBB>some lines here too <AAAA> xyz <id>987654321</id> ssss </AAAA>some lines here again... How do I get the... (5 Replies)
Discussion started by: nariwithu
5 Replies

4. UNIX for Dummies Questions & Answers

Remove Duplicate Lines

Hi I need this output. Thanks. Input: TAZ YET FOO FOO VAK TAZ BAR Output: YET VAK BAR (10 Replies)
Discussion started by: tara123
10 Replies

5. Shell Programming and Scripting

remove duplicate lines with condition

hi to all Does anyone know if there's a way to remove duplicate lines which we consider the same only if they have the first and the second column the same? For example I have : us2333 bbb 5 us2333 bbb 3 us2333 bbb 2 and I want to get us2333 bbb 10 The thing is I cannot... (2 Replies)
Discussion started by: vlm
2 Replies

6. Shell Programming and Scripting

Need to remove the duplicate lines from a log!!

Hello Folks, Can some one help me with the removal of duplicate lines from a log file and send it to another log file. It's bit complicated as two lines are same but only difference is the timestamp, but some lines are uniq. Line has been seperated by colon's. Log file:... (5 Replies)
Discussion started by: sim_je
5 Replies

7. Shell Programming and Scripting

Delete lines in file containing duplicate strings, keeping longer strings

The question is not as simple as the title... I have a file, it looks like this <string name="string1">RZ-LED</string> <string name="string2">2.0</string> <string name="string2">Version 2.0</string> <string name="string3">BP</string> I would like to check for duplicate entries of... (11 Replies)
Discussion started by: raidzero
11 Replies

8. Shell Programming and Scripting

Remove duplicate lines

Hi, I have a huge file which is about 50GB. There are many lines. The file format likes 21 rs885550 0 9887804 C C T C C C C C C C 21 rs210498 0 9928860 0 0 C C 0 0 0 0 0 0 21 rs303304 0 9941889 A A A A A A A A A A 22 rs303304 0 9941890 0 A A A A A A A A A The question is that there are a few... (4 Replies)
Discussion started by: zhshqzyc
4 Replies

9. UNIX for Dummies Questions & Answers

Delete lines with duplicate strings based on date

Hey all, a relative bash/script newbie trying solve a problem. I've got a text file with lots of lines that I've been able to clean up and format with awk/sed/cut, but now I'd like to remove the lines with duplicate usernames based on time stamp. Here's what the data looks like 2007-11-03... (3 Replies)
Discussion started by: mattv
3 Replies

10. Shell Programming and Scripting

how to remove duplicate lines

I have following file content (3 fields each line): 23 888 10.0.0.1 dfh 787 10.0.0.2 dssf dgfas 10.0.0.3 dsgas dg 10.0.0.4 df dasa 10.0.0.5 df dag 10.0.0.5 dfd dfdas 10.0.0.5 dfd dfd 10.0.0.6 daf nfd 10.0.0.6 ... as can be seen, that the third field is ip address and sorted. but... (3 Replies)
Discussion started by: fredao
3 Replies
Login or Register to Ask a Question