Delete duplicate lines... with a twist!

11-22-2011

Registered User

6, 0

Join Date: Nov 2011

Last Activity: 6 January 2014, 7:09 PM EST

Posts: 6

Thanks Given: 4

Thanked 0 Times in 0 Posts

Delete duplicate lines... with a twist!

Hi, I'm sorry I'm no coder so I came here, counting on your free time and good will to beg for spoonfeeding some good code. I'll try to be quick and concise!

Got file with 50k lines like this:

Code:

"Heh, heh. Those darn ninjas. They're _____."*wacky
The "canebrake", "timber" & "pygmy" are types of what?*rattlesnakes
Science : The second space shuttle was named ------*challenger

Problem is that somewhere (anywhere) in file may appear a similar line (but usually not exactly the same), which needs to be recognized as duplicate and deleted!

My example - of what could be found and should be recognized (and deleted) as duplicate:

Code:

the 'canebrake', 'timber' & 'pygmy' are types of what*rattleSNAKES
SCIENCE::: the;second;space;shuttle;was;named ??????*challenger

So I guess algorithm should basically do this:

1. from each line read only letters [a-z], [A-Z] and numbers [0-9] and disregard any possible spacing or special characters or punctuation

2. compare with every other line (in same manner a-Z, 0-9) and if same arrangement of letters and numbers is found (ignoring spacing, case, special chars...) delete one of the lines (doesn't matter which one)

Scripting language doesn't matter... perl, python, ruby, vi, awk, sed... anything goes =) (using archlinux box)

Much appreaciated!

shadowww

View Public Profile for shadowww

Find all posts by shadowww

11-22-2011

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

Code:

awk '{s=tolower($0);gsub("[^a-z]","",s);x[s]=$0} END {for(i in x) print x[i]}' file

This User Gave Thanks to shamrock For This Post:

shamrock

View Public Profile for shamrock

Find all posts by shamrock

11-22-2011

Registered User

6, 0

Join Date: Nov 2011

Last Activity: 6 January 2014, 7:09 PM EST

Posts: 6

Thanks Given: 4

Thanked 0 Times in 0 Posts

Thanks, it worked.

But slight observation: I had some 200 lines in file that would differentiate only by numbers and this code would (incorrectly) count them as duplicate.

shadowww

View Public Profile for shadowww

Find all posts by shadowww

11-23-2011

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

Quote:

Originally Posted by shadowww

Thanks, it worked.

But slight observation: I had some 200 lines in file that would differentiate only by numbers and this code would (incorrectly) count them as duplicate.

Not sure what you mean...can you post a sample of how that input file looks like...

shamrock

View Public Profile for shamrock

Find all posts by shamrock

11-23-2011

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

Code:

$
$ cat f42
"Heh, heh. Those darn ninjas. They're _____."*wacky
The "canebrake", "timber" & "pygmy" are types of what?*rattlesnakes
Science : The second space shuttle was named ------*challenger
the quick brown 123 fox jumps over the lazy ?@! dog
the 456 quick brown fox jumps over the ~*%# lazy dog
123 the quick brown @%#$!^ fox jumps over the lazy ~()& dog
$
$
$
$ perl -lne '$h=$_; s/[^\w]|_//g; tr/A-Z/a-z/; s/(.)(?=.*?\1)//g;
             $_=join "",sort split "";
             print $h if not defined $x{$_}; $x{$_}++
            ' f42
"Heh, heh. Those darn ninjas. They're _____."*wacky
The "canebrake", "timber" & "pygmy" are types of what?*rattlesnakes
Science : The second space shuttle was named ------*challenger
the quick brown 123 fox jumps over the lazy ?@! dog
the 456 quick brown fox jumps over the ~*%# lazy dog
$
$
$

tyler_durden

This User Gave Thanks to durden_tyler For This Post:

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

11-23-2011

Registered User

6, 0

Join Date: Nov 2011

Last Activity: 6 January 2014, 7:09 PM EST

Posts: 6

Thanks Given: 4

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by shamrock

Not sure what you mean...can you post a sample of how that input file looks like...

Sure, It is 5mb compilation of trivia questions. One question per row with * for separator from answer (file will be used by irc trivia bot). Aim is to weed out automatically as much duplicate questions as possible. There is sample in my first post but here is bigger chunk of file: www.pastebin.com/u1a1ZGHr which also shows entries that get selected as duplicates and deleted with your code - these are the ones starting with "Algebra : "

thx, tyler_durden, will try this perl code in moment

edit:
tyler_durden's perl code shrunk questions from 55983 lines to 20915
shamrock's awk code shrunk questions from 55983 lines to 40724

I have yet to compare in detail (manually? :<) but I think perl code ate too much 'duplicates'. Can't believe its more then half, but don't know yet, I may be wrong, have to confirm.

Last edited by shadowww; 11-23-2011 at 01:31 PM..

shadowww

View Public Profile for shadowww

Find all posts by shadowww

11-23-2011

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

Quote:

Originally Posted by shadowww

Is * the only non alphanumeric character in the input file as that makes it easy...but is that really the case as your original post had others...so if you define it clearly a better awk solution can be given...

This User Gave Thanks to shamrock For This Post:

shamrock

View Public Profile for shamrock

Find all posts by shamrock

Shell Programming and Scripting

Delete duplicate lines... with a twist!

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Delete duplicate like pattern lines

Discussion started by: tech_frk

2. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Discussion started by: sajmar

3. Shell Programming and Scripting

Delete duplicate rows

Discussion started by: jacobs.smith

4. Shell Programming and Scripting

Delete lines in file containing duplicate strings, keeping longer strings

Discussion started by: raidzero

5. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Discussion started by: krishnix

6. UNIX for Dummies Questions & Answers

How to delete partial duplicate lines unix

Discussion started by: C|KiLLeR|S

7. UNIX for Dummies Questions & Answers

Delete lines with duplicate strings based on date

Discussion started by: mattv

8. UNIX for Dummies Questions & Answers

How to delete or remove duplicate lines in a file

Discussion started by: reva

9. UNIX for Dummies Questions & Answers

Delete duplicate lines and print to file

Discussion started by: bfurlong

10. Shell Programming and Scripting

delete semi-duplicate lines from file?

Discussion started by: paqman