remove duplicate lines based on two columns and judging from a third one


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers remove duplicate lines based on two columns and judging from a third one
# 1  
Old 09-14-2011
remove duplicate lines based on two columns and judging from a third one

hello all,

I have an input file with four columns like this with a lot of lines

Quote:
2GOX03.output:Apol-Pol 10.64 -.79 (B)ALA3TRP
1R6Q20.output:Char-Pol 13.40 -.78 (B)ASP14SER
3SGB19.output:Char-Pol 13.40 -.58 (A)GLU177ATYR
2GOX13.output:Char-Pol 10.40 -.55 (B)ARG65GLN
2GOX14.output:Apol-Pol 10.40 -.55 (B)ALA3TRP
...
...
and for example, line 1 and line 5 match because the first 4 characters match and the fourth column matches too. I want to keep the line that has the lowest number in the third column. So I discard line 5. Is there a way to do this with awk for every possible match? note that in the file i might have more than two matches.

thanks
# 2  
Old 09-14-2011
Code:
2GOX03.output:Apol-Pol     10.64        -.79        (B)ALA3TRP 
1R6Q20.output:Char-Pol     13.40 -.78        (B)ASP14SER 
3SGB19.output:Char-Pol     13.40        -.58        (A)GLU177ATYR 
2GOX13.output:Char-Pol     10.40        -.55        (B)ARG65GLN 
2GOX14.output:Apol-Pol    10.40        -.55        (B)ALA3TRP

why 5th line should be deleted if you want to keep the lower number in that column?
are -.55 and -.78 negative numbers ?

Last edited by sk1418; 09-14-2011 at 10:59 AM..
# 3  
Old 09-14-2011
Try this beautiful code:
Code:
perl -ane '/^.{4}/;if ($m{$&}{$F[3]}>$F[2]||$m{$&}{$F[3]}==undef){$m{$&}{$F[3]}=$F[2];$a{$&}{$F[3]}=$_}END{for $i (keys %a) {for $j (keys %{$a{$i}}){print $a{$i}{$j}}}}' file

# 4  
Old 09-14-2011
awk:
Code:
kent$  awk '{f=substr($1,1,4)$4;if((f in a && $4<b[f])||(! (f in a))){ a[f] = $0;b[f]=$3; } } END{for (x in a)print a[x]} yourFile

note, in your example the 5th line was kept, because 55 is lower than 79. Smilie
# 5  
Old 09-14-2011
no, it is a negative number, just it does not have zero after the minus sign. So it is not the lower number.

Bartus, thanks a lot. I will try it.

Cheers
# 6  
Old 09-14-2011
for negative number in column3:



Code:
kent$  awk '{f=substr($1,1,4)$4;if((f in a && $3<b[f])||(! (f in a))){ a[f] = $0;b[f]=$3; } } END{for (x in a)print a[x]} yourFile

 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove duplicate lines from file based on fields

Dear community, I have to remove duplicate lines from a file contains a very big ammount of rows (milions?) based on 1st and 3rd columns The data are like this: Region 23/11/2014 09:11:36 41752 Medio 23/11/2014 03:11:38 4132 Info 23/11/2014 05:11:09 4323... (2 Replies)
Discussion started by: Lord Spectre
2 Replies

2. Shell Programming and Scripting

Remove columns with duplicate entries

I have a 13gb file. It has the following columns: The 3rd column is basically correlation values. I want to delete those rows which are repeated between the columns: A B 0.04 B C 0.56 B B 1 A A 1 C D 1 C C 1 Desired Output: (preferably in a .csv format A,B,0.04 B,C,0.56 C,D,1... (3 Replies)
Discussion started by: Sanchari
3 Replies

3. Shell Programming and Scripting

How To Remove Duplicate Based on the Value?

Hi , Some time i got duplicated value in my files , bundle_identifier= B Sometext=ABC bundle_identifier= A bundle_unit=500 Sometext123=ABCD bundle_unit=400 i need to check if there is a duplicated values or not if yes , i need to check if the value is A or B when Bundle_Identified ,... (2 Replies)
Discussion started by: OTNA
2 Replies

4. Shell Programming and Scripting

Remove duplicate value based on two field $4 and $5

Hi All, i have input file like below... CA009156;20091003;M;AWBKCA72;123;;CANADIAN WESTERN BANK;EDMONTON;;2300, 10303, JASPER AVENUE;;T5J 3X6;; CA009156;20091003;M;AWBKCA72;321;;CANADIAN WESTERN BANK;EDMONTON;;2300, 10303, JASPER AVENUE;;T5J 3X6;; CA009156;20091003;M;AWBKCA72;231;;CANADIAN... (2 Replies)
Discussion started by: mohan sharma
2 Replies

5. Shell Programming and Scripting

Remove Duplicate by considering multiple columns

hi friends, my input chr1 exon 35204 35266 gene_id "GOLGB1"; transcript_id "GOLGB1"; chr1 exon 42357 42473 gene_id "GOLGB1"; transcript_id "GOLGB1"; chr1 exon 45261 45404 gene_id "GOLGB1"; transcript_id "GOLGB1"; chr1 exon 50701 50778 gene_id "GOLGB1"; transcript_id "GOLGB1";... (2 Replies)
Discussion started by: jacobs.smith
2 Replies

6. Shell Programming and Scripting

Remove duplicate based on Group

Hi, How can I remove duplicates from a file based on group on other column? for example: Test1|Test2|Test3|Test4|Test5 Test1|Test6|Test7|Test8|Test5 Test1|Test9|Test10|Test11|Test12 Test1|Test13|Test14|Test15|Test16 Test17|Test18|Test19|Test20|Test21 Test17|Test22|Test23|Test24|Test5 ... (2 Replies)
Discussion started by: yale_work
2 Replies

7. Shell Programming and Scripting

Remove duplicate lines based on field and sort

I have a csv file that I would like to remove duplicate lines based on field 1 and sort. I don't care about any of the other fields but I still wanna keep there data intact. I was thinking I could do something like this but I have no idea how to print the full line with this. Please show any method... (8 Replies)
Discussion started by: cokedude
8 Replies

8. Shell Programming and Scripting

Remove duplicate columns in input file

hello, I have an input file which looks like this: 2 C:G 17 -0.14 8.75 33.35 3 G:C 16 -2.28 0.98 28.22 4 C:G 15 0.39 11.06 29.31 5 G:C 14 2.64 5.17 36.07 6 G:C 13 -0.65 2.05 21.94 7 C:G 11 138.96 21.64 14.40 9 C:G 27 -2.40 6.95 27.98 10 C:G 26 2.89 15.60 34.33 11 G:C... (7 Replies)
Discussion started by: linux_usr
7 Replies

9. UNIX for Dummies Questions & Answers

Duplicate columns and lines

Hi all, I have a tab-delimited file and want to remove identical lines, i.e. all of line 1,2,4 because the columns are the same as the columns in other lines. Any input is appreciated. abc gi4597 9997 cgcgtgcg $%^&*()()* abc gi4597 9997 cgcgtgcg $%^&*()()* ttt ... (1 Reply)
Discussion started by: dr_sabz
1 Replies

10. Shell Programming and Scripting

Remove lines, Sorted with Time based columns using AWK & SORT

Hi having a file as follows MediaErr.log 84 Server1 Policy1 Schedule1 master1 05/08/2008 02:12:16 84 Server1 Policy1 Schedule1 master1 05/08/2008 02:22:47 84 Server1 Policy1 Schedule1 master1 05/08/2008 03:41:26 84 Server1 Policy1 ... (1 Reply)
Discussion started by: karthikn7974
1 Replies
Login or Register to Ask a Question