Help with removing duplicate entries with awk or Perl

10-29-2012

Registered User

3,149, 702

Join Date: Apr 2010

Last Activity: 10 July 2019, 11:33 PM EDT

Posts: 3,149

Thanks Given: 46

Thanked 702 Times in 677 Posts

save the below code as a.awk

Code:

!($2 in a){printf("%s %s ",$1,$2);a[$2]}
!($3 in b){printf("%s ",$3);b[$3]}
!($5 in c){printf("%s %s ",$4,$5);c[$5]}
!($6 in d){printf("%s %s ",$5,$6);d[$6]}
!($7 in e){printf("%s ",$7);e[$7]}
!($9 in f){printf("%s %s ",$8,$9);f[$9]}
!($10 in g){printf("%s %s",$10,$11);g[$10]}
{printf("\n")}

execute the awk command by

Code:

awk -f a.awk input.txt

itkamaraj

View Public Profile for itkamaraj

Find all posts by itkamaraj

10-29-2012

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

Code:

$
$ cat input
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131908        11132010        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131908        11132010        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130992        11131108        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130992        11131108        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11128311        11128447        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11128311        11128447        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130630        11130711        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130630        11130711        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130729        11130979        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130729        11130979        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131263        11131553        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131263        11131553        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131587        11131709        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131587        11131709        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11132034        11132488        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11132034        11132488        chr1    11131583        11131618        1
$
$
$ perl -F"\t" -lane 'for ($i=0; $i<=$#F; $i++) {
                       if (not defined $tokens{$i.":".$F[$i]}) {push @x, $F[$i]}
                       else {push @x, ""}
                     }
                     if (join("",@x) ne "") {
                       for ($i=0; $i<=$#x; $i++) { $line .= sprintf ("%-10s", $x[$i]) }
                       print $line;
                     }
                     for ($i=0; $i<=$#F; $i++) { $tokens{$i.":".$F[$i]}++ };
                     $line = ""; @x = ();
                    ' input
chr1      11127067  11132181  89        chr1      11128023  11128311  chr1      11130990  11131025  5
                                                                                11131583  11131618  1
                                                  11131908  11132010
                                                  11130992  11131108
                                                  11128311  11128447
                                                  11130630  11130711
                                                  11130729  11130979
                                                  11131263  11131553
                                                  11131587  11131709
                                                  11132034  11132488
$
$
$

tyler_durden

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

10-30-2012

Registered User

19, 1

Join Date: Oct 2012

Last Activity: 19 March 2020, 11:02 AM EDT

Posts: 19

Thanks Given: 9

Thanked 1 Time in 1 Post

Sorry it doesn't work

Amit Pande

View Public Profile for Amit Pande

Find all posts by Amit Pande

10-30-2012

Registered User

1,650, 478

Join Date: Mar 2012

Last Activity: 11 September 2019, 8:06 AM EDT

Posts: 1,650

Thanks Given: 58

Thanked 478 Times in 474 Posts

try

Code:

awk '{for(i=1;i<=NF;i++){if(X[$i,i]++){$i=""}}}1' OFS="\t" file

pamu

View Public Profile for pamu

Find all posts by pamu

10-30-2012

Registered User

19, 1

Join Date: Oct 2012

Last Activity: 19 March 2020, 11:02 AM EDT

Posts: 19

Thanks Given: 9

Thanked 1 Time in 1 Post

Doesn't work...I have attached the result as output.txt. Kindly have a look.

Amit Pande

View Public Profile for Amit Pande

Find all posts by Amit Pande

10-30-2012

Registered User

1,650, 478

Join Date: Mar 2012

Last Activity: 11 September 2019, 8:06 AM EDT

Posts: 1,650

Thanks Given: 58

Thanked 478 Times in 474 Posts

Quote:

Originally Posted by Amit Pande

Doesn't work...I have attached the result as output.txt. Kindly have a look.

Your expected output doesn't replicate what you say.

please look

Code:

chr1    11127067    11132181    89    chr1    11128023    11128311    chr1    11130990    11131025    5
                        chr1    11131908    11132010    chr1    11131583    11131618    1
                        chr1    11130992    11131108    
                        chr1    11128311    11128447

Code:

duplicate lines in column 1,2,3,5,6,7,8,9,10 should be removed while those that are not duplicate lines should be retained.

1) From column 6 and 7 only 4 lines are printed in expected output.(you can see there few more)
2) See red chr1 this also duplicates.(why they are printed.
3) And if you don't want to consider column 4 it should be present for all the lines right.?

Assuming you don't want consider column 4 for duplicates.

Code:

$ cat file
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131908        11132010        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131908        11132010        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130992        11131108        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130992        11131108        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11128311        11128447        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11128311        11128447        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130630        11130711        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130630        11130711        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130729        11130979        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130729        11130979        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131263        11131553        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131263        11131553        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131587        11131709        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131587        11131709        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11132034        11132488        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11132034        11132488        chr1    11131583        11131618        1

$ awk '{for(i=1;i<=NF;i++){if((X[$i,i]++) && i!=4){$i=""}}}1' OFS="\t" file
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11130990        11131025        5
                        89                                      11131583        11131618        1
                        89              11131908        11132010
                        89
                        89              11130992        11131108
                        89
                        89              11128311        11128447
                        89
                        89              11130630        11130711
                        89
                        89              11130729        11130979
                        89
                        89              11131263        11131553
                        89
                        89              11131587        11131709
                        89
                        89              11132034        11132488
                        89

And considering all the columns..

Code:

]$ awk '{for(i=1;i<=NF;i++){if(X[$i,i]++){$i=""}}}1' OFS="\t" file
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11130990        11131025        5
                                                                11131583        11131618        1
                                        11131908        11132010

                                        11130992        11131108

                                        11128311        11128447

                                        11130630        11130711

                                        11130729        11130979

                                        11131263        11131553

                                        11131587        11131709

                                        11132034        11132488

Hope this helps you

pamu

View Public Profile for pamu

Find all posts by pamu

10-30-2012

Registered User

19, 1

Join Date: Oct 2012

Last Activity: 19 March 2020, 11:02 AM EDT

Posts: 19

Thanks Given: 9

Thanked 1 Time in 1 Post

Thanks a lot....this one works...but unfortunately..the first column is not reported for each entry...I have attached the output...kindly do through it..

Last edited by Amit Pande; 03-19-2013 at 02:30 PM..

Amit Pande

View Public Profile for Amit Pande

Find all posts by Amit Pande

Shell Programming and Scripting

Help with removing duplicate entries with awk or Perl

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing duplicate entries from edge-lists

Discussion started by: Sanchari

2. Shell Programming and Scripting

How to delete duplicate entries without using awk command?

Discussion started by: sandeepcm

3. Shell Programming and Scripting

Removing duplicate terms in a file

Discussion started by: Behrouzx77

4. Shell Programming and Scripting

Removing Dupes from huge file- awk/perl/uniq

Discussion started by: makn

5. Linux

Need awk script for removing duplicate records

Discussion started by: Rastamed

6. Post Here to Contact Site Administrators and Moderators

Removing or Merging some duplicate threads

Discussion started by: k.a.docpp

7. Shell Programming and Scripting

Command to remove duplicate lines with perl,sed,awk

Discussion started by: cola

8. Shell Programming and Scripting

Counting duplicate entries in a file using awk

Discussion started by: sajal.bhatia

9. Shell Programming and Scripting

Removing duplicate records from 2 files

Discussion started by: zooby

10. Linux

Need awk script for removing duplicate records

Discussion started by: nmumbarkar