Remove lines containing 2 or more duplicate strings

01-18-2016

Registered User

31, 0

Join Date: Jun 2010

Last Activity: 30 October 2018, 6:00 AM EDT

Location: Osaka

Posts: 31

Thanks Given: 34

Thanked 0 Times in 0 Posts

Remove lines containing 2 or more duplicate strings

Within my text file i have several thousand lines of text with some lines containing duplicate strings/words. I would like to entirely remove those lines which contain the duplicate strings.

Eg;

Code:

One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word duplicate

Output;

Code:

One and a Two
Unix.com is the Best

The letter case doesn't matter.

Much Thanks as always for your help

martinsmith

View Public Profile for martinsmith

Find all posts by martinsmith

01-18-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

From what length is a "string" a string? Is "a" a string?

---------- Post updated at 11:40 ---------- Previous update was at 11:40 ----------

Does case matter?

---------- Post updated at 11:43 ---------- Previous update was at 11:40 ----------

As a starting point:

Code:

awk '{for (i=1; i<=NF; i++) for (j=i+1; j<=NF; j++) if ($i == $j) next}1' file
One and a Two
Unix.com is the Best

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

01-18-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello martinsmith,

Could you please try this and let me know if this helps. I am ignoring case sensitivity here so it will match all kind of same words either they are in capital or small letters.
So let's say following is the Input_file:

Code:

One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word DUPLICATE
UNIX is very good GOOD

Now following is the code for same.

Code:

awk 'BEGIN{IGNORECASE = 1} {for(i=1;i<=NF;i++){for(j=1;j<=NF;j++){if($j==$i){A[$i]++;}};if(A[$i]>1){for(i in A){delete A[i];next}}};print;for(i in A){delete A[i]}}'  Input_file

Output will be as follows.

Code:

One and a Two
Unix.com is the Best

Thanks,
R. Singh

Last edited by RavinderSingh13; 01-18-2016 at 06:51 AM.. Reason: Added a comment for more clarification about solution now.

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

01-18-2016

Registered User

31, 0

Join Date: Jun 2010

Last Activity: 30 October 2018, 6:00 AM EDT

Location: Osaka

Posts: 31

Thanks Given: 34

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by RudiC

Code:

awk '{for (i=1; i<=NF; i++) for (j=i+1; j<=NF; j++) if ($i == $j) next}1' file
One and a Two
Unix.com is the Best

The string length was between 3 to 12 characters. ( words which were identical ).

I tried your solution and it works like a charm. Thank you Rudi

---------- Post updated at 07:58 PM ---------- Previous update was at 07:53 PM ----------

Quote:

Originally Posted by RavinderSingh13

Code:

One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word DUPLICATE
UNIX is very good GOOD

Now following is the code for same.

Code:

awk 'BEGIN{IGNORECASE = 1} {for(i=1;i<=NF;i++){for(j=1;j<=NF;j++){if($j==$i){A[$i]++;}};if(A[$i]>1){for(i in A){delete A[i];next}}};print;for(i in A){delete A[i]}}'  Input_file

Output will be as follows.

Code:

One and a Two
Unix.com is the Best

Thanks,
R. Singh

Thanks R. Singh. It worked but seems to have taken some extra lines out. I believe Rudi's solution matched the patterns/words exactly since some words were similar spelling but different.

Anyways Much Thanks as usual. Cheers

martinsmith

View Public Profile for martinsmith

Find all posts by martinsmith

01-18-2016

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Adapting RudiC's suggestion for case independence and minimum string length:

Code:

awk '{for (i=1; i<NF; i++) for (j=i+1; j<=NF; j++) if ((tolower($i) == tolower($j)) && length($i)>=3) next}1' file

--
Note: IGNORECASE is GNU awk only.

Last edited by Scrutinizer; 01-18-2016 at 07:21 AM..

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

01-18-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello martinsmith,

Adding one more solution here without 2 time loops, which may help you here. Let's say following is the Input_file.

Code:

One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word DUPLICATE
UNIX is very good GOOD

Then following is the code.

Code:

awk '{for(i=1;i<=NF;i++){A[tolower($i)]};if(NF == length(A)){print};delete A}'  Input_file

Output will be as follows.

Code:

One and a Two
Unix.com is the Best

EDIT: Above solution should work fine but each time condition will be invoked in for loop so a little change as follows will avoid that also.

Code:

 awk '{for(i=1;i<=NF;i++){A[tolower($i)]}};{if(NF == length(A)){print};delete A}'  Input_file

So above I am closing the loop before and after completion of it I am executing the condition part then.

Thanks,
R. Singh

Last edited by RavinderSingh13; 01-18-2016 at 09:42 AM.. Reason: Added a very tiny changed solution, because of if condition have put it out of for loop now.

These 2 Users Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

01-18-2016

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Interesting solution. In short it is

Code:

awk '{for(i=1; i<=NF; i++) A[tolower($i)]} (NF==length(A)); {delete A}'

But works only with a recent GNU awk.
Other awk versions say "fatal: attempt to use array `A' in a scalar context" or "syntax error" or do not display anything.

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

Shell Programming and Scripting

Remove lines containing 2 or more duplicate strings

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to remove duplicate lines?

Discussion started by: nalu

2. Shell Programming and Scripting

Remove duplicate lines from a file

Discussion started by: sudhakar T

3. Shell Programming and Scripting

Getting lines between two strings with duplicate set of data

Discussion started by: nariwithu

4. UNIX for Dummies Questions & Answers

Remove Duplicate Lines

Discussion started by: tara123

5. Shell Programming and Scripting

remove duplicate lines with condition

Discussion started by: vlm

6. Shell Programming and Scripting

Need to remove the duplicate lines from a log!!

Discussion started by: sim_je

7. Shell Programming and Scripting

Delete lines in file containing duplicate strings, keeping longer strings

Discussion started by: raidzero

8. Shell Programming and Scripting

Remove duplicate lines

Discussion started by: zhshqzyc

9. UNIX for Dummies Questions & Answers

Delete lines with duplicate strings based on date

Discussion started by: mattv

10. Shell Programming and Scripting

how to remove duplicate lines

Discussion started by: fredao