help! script to select line with greatest value 2 between columns


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers help! script to select line with greatest value 2 between columns
# 1  
Old 10-20-2011
help! script to select line with greatest value 2 between columns

Hi,
I’m trying to do something I haven’t done before and I’m struggling with how to even create the command or script.
I have the following space delim file:

Code:
gene    accession    chr     chr_st    begin    end       
NN1        NC_024540    chr3    -    14000    14020    
NN1        NC_024543    chr3    -    14050    14060    
ATG        NC_01        chr12    +    12000    12100    
ATG        NC_02        chr12    +    12100    12300

This is an example file
I want to modify the file so that I only get the gene with the greatest length written to one file.
So want to calculate the length of each form of the genes in each row then: where length = end value- begin value

Code:
gene    accession    chr     chr_st    begin    end        length 
NN1        NC_024540    chr3    -    14000    14020    20
NN1        NC_024543    chr3    -    14050    14060    10
ATG        NC_01        chr12    +    12000    12100    100
ATG        NC_02        chr12    +    12100    12300    200

I want to be able to create a file with only the genes with greatest length which would look like this:
Code:
gene    accession    chr      chr_st    begin    end         length
NN1        NC_024540    chr3    -        14000    14020    20
ATG        NC_02        chr12    +        12100    12300    300

Any help is really aprreciated, I have 30,000 genes with different forms like this!
Thanks!

Last edited by zxmaus; 10-21-2011 at 04:41 AM.. Reason: added code tags
# 2  
Old 10-20-2011
Code:
nawk 'NR>1{print $0,($NF-$(NF-1))}' yourfile

Code:
nawk 'NR<2{next}{c=($NF-$(NF-1))}!($1 in A)||(c>m[$1]&&($1 in A)){m[$1]=c;A[$1]=$0 FS m[$1]}END{for(i in A) print A[i]}' yourfile


NR>1 and the NR<2{next} are just here to skip the heading line 'gene accession chr chr_st begin end' ... just feel free to remove it if there is no heading line in your files.

Last edited by ctsgnb; 10-20-2011 at 03:04 PM..
# 3  
Old 10-20-2011
clarification please?

thanks! I got the file but could you please explain what these commands are doing? I just used the code for the second command to generate a new file. am I doing it correctly that way? Thanks so much!
# 4  
Old 10-21-2011
Code:
 
nawk 'NR>1{print $0,($NF-$(NF-1))}' yourfile

NR > 1 ---> to Skip the first line ( header line )
print $0 --- > $0 represents the entire line ( eg. NN1 NC_024540 chr3 - 14000 14020 )
$NF ---> value of the last field ( 14020 )
NF ---> Last field ( in your case its 6 )
$(NF-1) --> 5th field value ( 14000 )
# 5  
Old 10-21-2011
not working

I'm still getting a file with duplicate genes.

when I enter code:
Code:
nawk 'NR<2{next}{c=($NF-$(NF-1))}!($1 in A)||(c>m[$1]&&($1 in A)){m[$1]=c;A[$1]=$0 FS m[$1]}END{for(i in A) print A[i]}' yourfile

for a file like this:
Code:
gene   accession    chr   chr_st   begin   end   length
NN1    NC_024540  chr3    -       14000 14020  20
NN1    NC_024543  chr3    -       14050 14060  10
ATG    NC_01        chr12  +       12000 12100  100
ATG    NC_02        chr12  +       12100 12300  200

I end up getting genes that are written into the file as duplicates.
When I just want the longest gene with the greatest length to be written to one file.
I'm really grateful for your help.

Last edited by Franklin52; 10-21-2011 at 09:21 AM.. Reason: Please use code tags, thank you
# 6  
Old 10-21-2011
... for we it works :
Code:
$ cat tst.txt
gene accession chr chr_st begin end length
NN1 NC_024540 chr3 - 14000 14020
NN1 NC_024543 chr3 - 14010 14060
NN1 NC_024543 chr3 - 14050 14060
ATG NC_01 chr12 + 12000 12100
ATG NC_02 chr12 + 12100 12900
ATG NC_02 chr12 + 12100 12300

Code:
$ nawk 'NR<2{next}{c=($NF-$(NF-1))}(!($1 in A))||(c>m[$1]&&($1 in A)){m[$1]=c;A[$1]=$0 FS m[$1]}END{for(i in A) print A[i]}' tst.txt
NN1 NC_024543 chr3 - 14010 14060 50
ATG NC_02 chr12 + 12100 12900 800

Please upload your entire input file also provide the output you get as well as the command you ran (make sure you didn't forget any parenthesis).
# 7  
Old 10-21-2011
still have duplicates

Thanks.
It works but I realized that there are genes that are in the output file more than once because some isoforms happen to have the same length.
So now I would have to take that file and create a new one with only one gene per line.
I used the code below on the original file.

[Code]
nawk 'NR<2{next}{c=($NF-$(NF-1))}(!($1 in A))||(c>m[$1]&&($1 in A)){m[$1]=c;A[$1]=$0 FS m[$1]}END{for(i in A) print A[i]}' tst.txt

Output file:
test.txt

gene accession chr chr_st begin end length
TNFRSF18 NM_004195 chr1 - 1138887 1142089 3202
TNFRSF18 NM_148902 chr1 - 1138887 1142089 3202
TNFRSF18 NM_148901 chr1 - 1138887 1142089 3202
MIB2 NM_080875 chr1 + 1550794 1565990 15196
MIB2 NM_001170688 chr1 + 1550794 1565990 15196
MIB2 NM_001170687 chr1 + 1550794 1565990 15196
MIB2 NM_001170686 chr1 + 1550794 1565990 15196
CDK11A NM_024011 chr1 - 1634169 1655791 21622
CDK11A NM_033529 chr1 - 1634169 1655791 21622
WASH7P NR_024540 chr1 - 14361 29370 15009
FAM138F NR_026820 chr1 - 34610 36081 1471
FAM138A NR_026818 chr1 - 34610 36081 1471


So from this final file, how can I get it to make a file that has only one gene per line?
So I would want output.txt to be modified as:

Desired final file
gene accession chr chr_st begin end length
TNFRSF18 NM_004195 chr1 - 1138887 1142089 3202
MIB2 NM_080875 chr1 + 1550794 1565990 15196
CDK11A NM_024011 chr1 - 1634169 1655791 21622
WASH7P NR_024540 chr1 - 14361 29370 15009
FAM138F NR_026820 chr1 - 34610 36081 1471
FAM138A NR_026818 chr1 - 34610 36081 1471

I hope this is clearer.
Thanks!

Last edited by wolf_blue; 10-21-2011 at 04:27 PM..
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to select rows that have opposite values (A vs B, or B vs A) on first two columns?

I have a dateset like this: Gly1 Gly2 2 1 0 Gly3 Gly4 3 4 5 Gly3 Gly5 1 3 2 Gly2 Gly1 3 6 2 Gly4 Gly3 2 2 1 Gly6 Gly4 4 2 1what I expected is: Gly1 Gly2 2 1 0 Gly2 Gly1 3 6 2 Gly3 Gly4 3 4 5 Gly4 Gly3 2 2 1 A vs B, or B vs A are the same... (7 Replies)
Discussion started by: nengcheng
7 Replies

2. Shell Programming and Scripting

How do I select certain columns with matching pattern and rest of the lines?

I want to select 2nd, 3rd columns if line has "key3" and print rest of the lines as is. # This is my sample input key1="val1" key2="val2" key3="val3" key4="val4" some text some text some text some text key1="val1" key2="val2" key3="val3" key4="val4" some text some text some text some... (3 Replies)
Discussion started by: kchinnam
3 Replies

3. Shell Programming and Scripting

Select all the even columns from a file

Hi, I can select all the even columns from a file like this: awk '{ for (i=1;i<=NF;i+=2) $i="" }1' file > new file How can I select the 1st and all the even columns using awk? Thanks! (1 Reply)
Discussion started by: forU
1 Replies

4. Shell Programming and Scripting

Comparing Select Columns from two CSV files in UNIX and create a third file based on comparision

Hi , I want to compare first 3 columns of File A and File B and create a new file File C which will have all rows from File B and will include rows that are present in File A and not in File B based on First 3 column comparison. Thanks in advance for your help. File A A,B,C,45,46... (2 Replies)
Discussion started by: ady_koolz
2 Replies

5. Shell Programming and Scripting

Take greatest value from second column

Dear All, Please help me, I have file input like this, 1 2142 215 2162 217 2842 285 2862 287 4002 401 4022 403 4822 1 2142 215 2162 217 2842 285 2862 287 4002 401 4022 403 4882 1 4801 (8 Replies)
Discussion started by: attila
8 Replies

6. Shell Programming and Scripting

Select lines where at least x columns above threshold value

I have a file with 20 columns. I'd like to retain only the lines for which the values in at least x columns, looking only at columns 6-20, are above a threshold. For example, I'd like to retain only the lines in the file below that have at least 8 columns (again, looking only at columns 6-20)... (3 Replies)
Discussion started by: pathunkathunk
3 Replies

7. Shell Programming and Scripting

Select columns from a matrix given within a range in BASH

I have a huge matrix file which looks like this (example matrix): 1 2 3 5 4 5 6 7 7 6 8 9 1 2 4 2 7 6 5 1 3 2 1 9 As one can see, this matrix has 4 columns and 6 rows. But my original matrix has some 3 million rows and 6000 columns. For example, on this matrix I can define my task as... (2 Replies)
Discussion started by: shoaibjameel123
2 Replies

8. Shell Programming and Scripting

[Solved] Select the columns which have value greater than particular number

i have a file of the form 9488 14392 1 1.8586e-07 5702 7729 1 1.8586e-07 9048 14018 1 1.8586e-07 5992 12556 1 1.8586e-07 9488 14393 1 1.8586e-07 9048 14019 1 1.8586e-07 5992 12557 1 1.8586e-07 9488 14394 ... (1 Reply)
Discussion started by: vaibhavkorde
1 Replies

9. Shell Programming and Scripting

Select and display sum depending upon even columns

Select and display sum depending upon even columns i have a input as : 2898 | homy | pune | 7/4/09 1 :6298 | anna | chennai | 7/4/08 2 :3728 | gonna | kol | 8/2/10 3 :3987 | hogja | mumbai | 8/5/09 4 :6187 | galma | london | 9/5/01 5 :9167 | tamina | ny | 8/3/10 6 :3981 | dastan | bagh |... (1 Reply)
Discussion started by: adityamitra
1 Replies

10. UNIX for Dummies Questions & Answers

Select and display sum depending upon even columns

i have a input as : 2898 | homy | pune | 7/4/09 1 :6298 | anna | chennai | 7/4/08 2 :3728 | gonna | kol | 8/2/10 3 :3987 | hogja | mumbai | 8/5/09 4 :6187 | galma | london | 9/5/01 5 :9167 | tamina | ny | 8/3/10 6 :3981 | dastan | bagh | 8/2/07 7 :4617 | vazir | ny now,i want to get... (2 Replies)
Discussion started by: adityamitra
2 Replies
Login or Register to Ask a Question