Help with removing duplicate entries with awk or Perl


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Help with removing duplicate entries with awk or Perl
# 15  
Old 10-29-2012
save the below code as a.awk
Code:
!($2 in a){printf("%s %s ",$1,$2);a[$2]}
!($3 in b){printf("%s ",$3);b[$3]}
!($5 in c){printf("%s %s ",$4,$5);c[$5]}
!($6 in d){printf("%s %s ",$5,$6);d[$6]}
!($7 in e){printf("%s ",$7);e[$7]}
!($9 in f){printf("%s %s ",$8,$9);f[$9]}
!($10 in g){printf("%s %s",$10,$11);g[$10]}
{printf("\n")}

execute the awk command by

Code:
awk -f a.awk input.txt

# 16  
Old 10-29-2012
Code:
$
$ cat input
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131908        11132010        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131908        11132010        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130992        11131108        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130992        11131108        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11128311        11128447        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11128311        11128447        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130630        11130711        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130630        11130711        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130729        11130979        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130729        11130979        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131263        11131553        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131263        11131553        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131587        11131709        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131587        11131709        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11132034        11132488        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11132034        11132488        chr1    11131583        11131618        1
$
$
$ perl -F"\t" -lane 'for ($i=0; $i<=$#F; $i++) {
                       if (not defined $tokens{$i.":".$F[$i]}) {push @x, $F[$i]}
                       else {push @x, ""}
                     }
                     if (join("",@x) ne "") {
                       for ($i=0; $i<=$#x; $i++) { $line .= sprintf ("%-10s", $x[$i]) }
                       print $line;
                     }
                     for ($i=0; $i<=$#F; $i++) { $tokens{$i.":".$F[$i]}++ };
                     $line = ""; @x = ();
                    ' input
chr1      11127067  11132181  89        chr1      11128023  11128311  chr1      11130990  11131025  5
                                                                                11131583  11131618  1
                                                  11131908  11132010
                                                  11130992  11131108
                                                  11128311  11128447
                                                  11130630  11130711
                                                  11130729  11130979
                                                  11131263  11131553
                                                  11131587  11131709
                                                  11132034  11132488
$
$
$

tyler_durden
# 17  
Old 10-30-2012
Sorry it doesn't work
# 18  
Old 10-30-2012
try

Code:
awk '{for(i=1;i<=NF;i++){if(X[$i,i]++){$i=""}}}1' OFS="\t" file

# 19  
Old 10-30-2012
Doesn't work...I have attached the result as output.txt. Kindly have a look.
# 20  
Old 10-30-2012
Quote:
Originally Posted by Amit Pande
Doesn't work...I have attached the result as output.txt. Kindly have a look.
Your expected output doesn't replicate what you say.

please look

Code:
chr1    11127067    11132181    89    chr1    11128023    11128311    chr1    11130990    11131025    5
                        chr1    11131908    11132010    chr1    11131583    11131618    1
                        chr1    11130992    11131108    
                        chr1    11128311    11128447

Code:
duplicate lines in column 1,2,3,5,6,7,8,9,10 should be removed while those that are not duplicate lines should be retained.

1) From column 6 and 7 only 4 lines are printed in expected output.(you can see there few more)
2) See red chr1 this also duplicates.(why they are printed.
3) And if you don't want to consider column 4 it should be present for all the lines right.?

Assuming you don't want consider column 4 for duplicates.

Code:
$ cat file
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131908        11132010        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131908        11132010        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130992        11131108        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130992        11131108        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11128311        11128447        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11128311        11128447        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130630        11130711        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130630        11130711        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130729        11130979        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130729        11130979        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131263        11131553        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131263        11131553        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131587        11131709        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131587        11131709        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11132034        11132488        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11132034        11132488        chr1    11131583        11131618        1

$ awk '{for(i=1;i<=NF;i++){if((X[$i,i]++) && i!=4){$i=""}}}1' OFS="\t" file
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11130990        11131025        5
                        89                                      11131583        11131618        1
                        89              11131908        11132010
                        89
                        89              11130992        11131108
                        89
                        89              11128311        11128447
                        89
                        89              11130630        11130711
                        89
                        89              11130729        11130979
                        89
                        89              11131263        11131553
                        89
                        89              11131587        11131709
                        89
                        89              11132034        11132488
                        89

And considering all the columns..

Code:
]$ awk '{for(i=1;i<=NF;i++){if(X[$i,i]++){$i=""}}}1' OFS="\t" file
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11130990        11131025        5
                                                                11131583        11131618        1
                                        11131908        11132010

                                        11130992        11131108

                                        11128311        11128447

                                        11130630        11130711

                                        11130729        11130979

                                        11131263        11131553

                                        11131587        11131709

                                        11132034        11132488

Hope this helps youSmilie

pamu
# 21  
Old 10-30-2012
Thanks a lot....this one works...but unfortunately..the first column is not reported for each entry...I have attached the output...kindly do through it..

Last edited by Amit Pande; 03-19-2013 at 02:30 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing duplicate entries from edge-lists

I have a file which has connections given as: A B 0.1 B C 5.8 C B 5.8 E F 0.67 B A 0.1 A B and B A are same, so I want to remove one of them. Same with BC and CB. Desired output: A B 0.1 B C 5.8 E F 0.67 (2 Replies)
Discussion started by: Sanchari
2 Replies

2. Shell Programming and Scripting

How to delete duplicate entries without using awk command?

Hello.. I am trying to remove the duplicate entries in a log files and used the the below shell script to do the same. awk '!x++' <filename> Can I do without using the awk command and the regex? I do not want to start the search from the beginning of the line in the log file as it contains... (9 Replies)
Discussion started by: sandeepcm
9 Replies

3. Shell Programming and Scripting

Removing duplicate terms in a file

Hi everybody I have a .txt file that contains some assembly code for optimizing it i need to remove some replicated parts. for example I have:e_li r0,-1 e_li r25,-1 e_lis r25,0000 add r31, r31 ,r0 e_li r28,-1 e_lis r28,0000 add r31, r31 ,r0 e_li r28,-1 ... (3 Replies)
Discussion started by: Behrouzx77
3 Replies

4. Shell Programming and Scripting

Removing Dupes from huge file- awk/perl/uniq

Hi, I have the following command in place nawk -F, '!a++' file > file.uniq It has been working perfectly as per requirements, by removing duplicates by taking into consideration only first 3 fields. Recently it has started giving below error: bash-3.2$ nawk -F, '!a++'... (17 Replies)
Discussion started by: makn
17 Replies

5. Linux

Need awk script for removing duplicate records

I have log file having Traffic line 2011-05-21 15:11:50.356599 TCP (6), length: 52) 10.10.10.1.3020 > 10.10.10.254.50404: 2011-05-21 15:11:50.652739 TCP (6), length: 52) 10.10.10.254.50404 > 10.10.10.1.3020: 2011-05-21 15:11:50.652558 TCP (6), length: 89) 10.10.10.1.3020 >... (1 Reply)
Discussion started by: Rastamed
1 Replies

6. Post Here to Contact Site Administrators and Moderators

Removing or Merging some duplicate threads

I have made some threads that were identical and were about the same question :( I've made them in 3 forums , the moderator has moved and merged one of these threads. There is one thread left and it need to be merged or deleted. Is there any way I can delete it or merge it myself ? I have delete... (1 Reply)
Discussion started by: k.a.docpp
1 Replies

7. Shell Programming and Scripting

Command to remove duplicate lines with perl,sed,awk

Input: hello hello hello hello monkey donkey hello hello drink dance drink Output should be: hello hello monkey donkey drink dance (9 Replies)
Discussion started by: cola
9 Replies

8. Shell Programming and Scripting

Counting duplicate entries in a file using awk

Hi, I have a very big (with around 1 million entries) txt file with IPv4 addresses in the standard format, i.e. a.b.c.d The file looks like 10.1.1.1 10.1.1.1 10.1.1.1 10.1.2.4 10.1.2.4 12.1.5.6 . . . . and so on.... There are duplicate/multiple entries for some IP... (3 Replies)
Discussion started by: sajal.bhatia
3 Replies

9. Shell Programming and Scripting

Removing duplicate records from 2 files

Can anyone help me to removing duplicate records from 2 separate files in UNIX? Please find the sample records for both the files cat Monday.dat 3FAHP0JA1AR319226MOHMED ATEK 966504453742 SAU2010DE 3LNHL2GC6AR636361HEA DEUK CHOI 821057314531 KOR2010LE 3MEHM0JG7AR652083MUTLAB NAL-NAFISAH... (4 Replies)
Discussion started by: zooby
4 Replies

10. Linux

Need awk script for removing duplicate records

I have huge txt file having millions of trade data. For e.g Trade.txt (first 8 lines in the file is header info) COB_DATE,TRADE_ID,SOURCE_SYSTEM_TRADE_ID,TRADE_GROUP_ID, TRADE_TYPE,DEALER_NAME,EXTERNAL_COUNTERPARTY_ID, EXTERNAL_COUNTERPARTY_NAME,DB_COUNTERPARTY_ID,... (6 Replies)
Discussion started by: nmumbarkar
6 Replies
Login or Register to Ask a Question