Remove lines containing 2 or more duplicate strings

01-19-2016

Registered User

31, 0

Join Date: Jun 2010

Last Activity: 30 October 2018, 6:00 AM EDT

Location: Osaka

Posts: 31

Thanks Given: 34

Thanked 0 Times in 0 Posts

Thank you everyone! Lot's of awesome solutions for this problem. Very much appreciated!

martinsmith

View Public Profile for martinsmith

Find all posts by martinsmith

01-19-2016

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Quote:

Originally Posted by MadeInGermany

Aia, your solution does not work. First, there is a g too many. Second, the \b is not replicated to the \1. But even if I improve it like
[...]it won't print the following line

Code:

Unix.unix should be printed

It is working as designed
Unix.unix should be printed NOT

---------- Post updated at 11:19 AM ---------- Previous update was at 07:38 AM ----------

Quote:

Originally Posted by MadeInGermany

[...] First, there is a g too many.[...]

Please, refer to the perldoc to know what \g1 does.

Aia

View Public Profile for Aia

Find all posts by Aia

01-19-2016

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Ok, one more experts posting.
The \g1 was introduced in Perl 5.10 and behaves like \1 (I tested with Perl 5.8 only, my bad).
The perl solution treats Unix.unix as two words while the awk solution treats it as one word.
Regarding my \b comment, only my version prints both

Code:

No duplicat sentence with the word duplicate
No duplicate sentence with the word duplicat

(Now I have tested with perl 5.8 and 5.18)

Last edited by MadeInGermany; 01-19-2016 at 03:48 PM..

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

01-19-2016

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Quote:

Originally Posted by MadeInGermany

[...]
The perl solution treats Unix.unix as two words while the awk solution treats it as one word.[...]

Could that be a bug or oversight in the AWK sugestion? Maybe is enough for the OP intention, however, a word normally is not only defined by characters separated by spaces.

Aia

View Public Profile for Aia

Find all posts by Aia

01-19-2016

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Seeing all these elaborate awk solutions i wonder if sed wouldn't be easier:

Code:

sed '/\([^ ]*\) \1/d' file

It is little known that back references ("\1") can be used not only in the replacement string but also in the search regexp.

Btw.: "word" here is something surrounded by whitespace, not a certain number of characters. It is easy to put such a further restriction in if it is indeed needed.

I hope this helps.

bakunin

bakunin

View Public Profile for bakunin

Find all posts by bakunin

01-19-2016

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

@Bakunin, that would only work with adjacent words and would also match partial patterns:

Code:

$ echo foo foobar | sed '/\([^ ]*\) \1/d'
$

And because of the zero or more match:

Code:

$ echo abc def ghi | sed '/\([^ ]*\) \1/d'
$

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

01-19-2016

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Quote:

Originally Posted by MadeInGermany

[...]
Regarding my \b comment, only my version prints both

Code:

No duplicat sentence with the word duplicate
No duplicate sentence with the word duplicat

(Now I have tested with perl 5.8 and 5.18)

Yes, the boundary \b metacharacter is an anchor and I did not stop to think that it will not be saved as part of the group match.
From perl -ne 'print unless /(\b\w+\b).*\g1/i' to perl -ne 'print unless /\b(\w+)\b.*\b\g1\b/i' would had been a more appropriated suggestion. If your Perl version does not support the \g{} then, there's other bugs to consider.

Last edited by Aia; 01-19-2016 at 04:25 PM..

Aia

View Public Profile for Aia

Find all posts by Aia

Shell Programming and Scripting

Remove lines containing 2 or more duplicate strings

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to remove duplicate lines?

Discussion started by: nalu

2. Shell Programming and Scripting

Remove duplicate lines from a file

Discussion started by: sudhakar T

3. Shell Programming and Scripting

Getting lines between two strings with duplicate set of data

Discussion started by: nariwithu

4. UNIX for Dummies Questions & Answers

Remove Duplicate Lines

Discussion started by: tara123

5. Shell Programming and Scripting

remove duplicate lines with condition

Discussion started by: vlm

6. Shell Programming and Scripting

Need to remove the duplicate lines from a log!!

Discussion started by: sim_je

7. Shell Programming and Scripting

Delete lines in file containing duplicate strings, keeping longer strings

Discussion started by: raidzero

8. Shell Programming and Scripting

Remove duplicate lines

Discussion started by: zhshqzyc

9. UNIX for Dummies Questions & Answers

Delete lines with duplicate strings based on date

Discussion started by: mattv

10. Shell Programming and Scripting

how to remove duplicate lines

Discussion started by: fredao