help! script to select line with greatest value 2 between columns

10-20-2011

Registered User

8, 0

Join Date: Oct 2011

Last Activity: 1 December 2011, 4:02 PM EST

Posts: 8

Thanks Given: 4

Thanked 0 Times in 0 Posts

help! script to select line with greatest value 2 between columns

Hi,
I’m trying to do something I haven’t done before and I’m struggling with how to even create the command or script.
I have the following space delim file:

Code:

gene    accession    chr     chr_st    begin    end       
NN1        NC_024540    chr3    -    14000    14020    
NN1        NC_024543    chr3    -    14050    14060    
ATG        NC_01        chr12    +    12000    12100    
ATG        NC_02        chr12    +    12100    12300

This is an example file
I want to modify the file so that I only get the gene with the greatest length written to one file.
So want to calculate the length of each form of the genes in each row then: where length = end value- begin value

Code:

gene    accession    chr     chr_st    begin    end        length 
NN1        NC_024540    chr3    -    14000    14020    20
NN1        NC_024543    chr3    -    14050    14060    10
ATG        NC_01        chr12    +    12000    12100    100
ATG        NC_02        chr12    +    12100    12300    200

I want to be able to create a file with only the genes with greatest length which would look like this:

Code:

gene    accession    chr      chr_st    begin    end         length
NN1        NC_024540    chr3    -        14000    14020    20
ATG        NC_02        chr12    +        12100    12300    300

Any help is really aprreciated, I have 30,000 genes with different forms like this!
Thanks!

Last edited by zxmaus; 10-21-2011 at 04:41 AM.. Reason: added code tags

wolf_blue

View Public Profile for wolf_blue

Find all posts by wolf_blue

10-20-2011

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

Code:

nawk 'NR>1{print $0,($NF-$(NF-1))}' yourfile

Code:

nawk 'NR<2{next}{c=($NF-$(NF-1))}!($1 in A)||(c>m[$1]&&($1 in A)){m[$1]=c;A[$1]=$0 FS m[$1]}END{for(i in A) print A[i]}' yourfile

NR>1 and the NR<2{next} are just here to skip the heading line 'gene accession chr chr_st begin end' ... just feel free to remove it if there is no heading line in your files.

Last edited by ctsgnb; 10-20-2011 at 03:04 PM..

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

10-20-2011

Registered User

8, 0

Join Date: Oct 2011

Last Activity: 1 December 2011, 4:02 PM EST

Posts: 8

Thanks Given: 4

Thanked 0 Times in 0 Posts

clarification please?

thanks! I got the file but could you please explain what these commands are doing? I just used the code for the second command to generate a new file. am I doing it correctly that way? Thanks so much!

wolf_blue

View Public Profile for wolf_blue

Find all posts by wolf_blue

10-21-2011

Registered User

3,149, 702

Join Date: Apr 2010

Last Activity: 10 July 2019, 11:33 PM EDT

Posts: 3,149

Thanks Given: 46

Thanked 702 Times in 677 Posts

Code:

 
nawk 'NR>1{print $0,($NF-$(NF-1))}' yourfile

NR > 1 ---> to Skip the first line ( header line )
print $0 --- > $0 represents the entire line ( eg. NN1 NC_024540 chr3 - 14000 14020 )
$NF ---> value of the last field ( 14020 )
NF ---> Last field ( in your case its 6 )
$(NF-1) --> 5th field value ( 14000 )

itkamaraj

View Public Profile for itkamaraj

Find all posts by itkamaraj

10-21-2011

Registered User

8, 0

Join Date: Oct 2011

Last Activity: 1 December 2011, 4:02 PM EST

Posts: 8

Thanks Given: 4

Thanked 0 Times in 0 Posts

not working

I'm still getting a file with duplicate genes.

when I enter code:

Code:

nawk 'NR<2{next}{c=($NF-$(NF-1))}!($1 in A)||(c>m[$1]&&($1 in A)){m[$1]=c;A[$1]=$0 FS m[$1]}END{for(i in A) print A[i]}' yourfile

for a file like this:

Code:

gene   accession    chr   chr_st   begin   end   length
NN1    NC_024540  chr3    -       14000 14020  20
NN1    NC_024543  chr3    -       14050 14060  10
ATG    NC_01        chr12  +       12000 12100  100
ATG    NC_02        chr12  +       12100 12300  200

I end up getting genes that are written into the file as duplicates.
When I just want the longest gene with the greatest length to be written to one file.
I'm really grateful for your help.

Last edited by Franklin52; 10-21-2011 at 09:21 AM.. Reason: Please use code tags, thank you

wolf_blue

View Public Profile for wolf_blue

Find all posts by wolf_blue

10-21-2011

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

... for we it works :

Code:

$ cat tst.txt
gene accession chr chr_st begin end length
NN1 NC_024540 chr3 - 14000 14020
NN1 NC_024543 chr3 - 14010 14060
NN1 NC_024543 chr3 - 14050 14060
ATG NC_01 chr12 + 12000 12100
ATG NC_02 chr12 + 12100 12900
ATG NC_02 chr12 + 12100 12300

Code:

$ nawk 'NR<2{next}{c=($NF-$(NF-1))}(!($1 in A))||(c>m[$1]&&($1 in A)){m[$1]=c;A[$1]=$0 FS m[$1]}END{for(i in A) print A[i]}' tst.txt
NN1 NC_024543 chr3 - 14010 14060 50
ATG NC_02 chr12 + 12100 12900 800

Please upload your entire input file also provide the output you get as well as the command you ran (make sure you didn't forget any parenthesis).

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

10-21-2011

Registered User

8, 0

Join Date: Oct 2011

Last Activity: 1 December 2011, 4:02 PM EST

Posts: 8

Thanks Given: 4

Thanked 0 Times in 0 Posts

still have duplicates

Thanks.
It works but I realized that there are genes that are in the output file more than once because some isoforms happen to have the same length.
So now I would have to take that file and create a new one with only one gene per line.
I used the code below on the original file.

[Code]
nawk 'NR<2{next}{c=($NF-$(NF-1))}(!($1 in A))||(c>m[$1]&&($1 in A)){m[$1]=c;A[$1]=$0 FS m[$1]}END{for(i in A) print A[i]}' tst.txt

Output file:
test.txt

gene accession chr chr_st begin end length
TNFRSF18 NM_004195 chr1 - 1138887 1142089 3202
TNFRSF18 NM_148902 chr1 - 1138887 1142089 3202
TNFRSF18 NM_148901 chr1 - 1138887 1142089 3202
MIB2 NM_080875 chr1 + 1550794 1565990 15196
MIB2 NM_001170688 chr1 + 1550794 1565990 15196
MIB2 NM_001170687 chr1 + 1550794 1565990 15196
MIB2 NM_001170686 chr1 + 1550794 1565990 15196
CDK11A NM_024011 chr1 - 1634169 1655791 21622
CDK11A NM_033529 chr1 - 1634169 1655791 21622
WASH7P NR_024540 chr1 - 14361 29370 15009
FAM138F NR_026820 chr1 - 34610 36081 1471
FAM138A NR_026818 chr1 - 34610 36081 1471

So from this final file, how can I get it to make a file that has only one gene per line?
So I would want output.txt to be modified as:

Desired final file
gene accession chr chr_st begin end length
TNFRSF18 NM_004195 chr1 - 1138887 1142089 3202
MIB2 NM_080875 chr1 + 1550794 1565990 15196
CDK11A NM_024011 chr1 - 1634169 1655791 21622
WASH7P NR_024540 chr1 - 14361 29370 15009
FAM138F NR_026820 chr1 - 34610 36081 1471
FAM138A NR_026818 chr1 - 34610 36081 1471

I hope this is clearer.
Thanks!

Last edited by wolf_blue; 10-21-2011 at 04:27 PM..

wolf_blue

View Public Profile for wolf_blue

Find all posts by wolf_blue

UNIX for Dummies Questions & Answers

help! script to select line with greatest value 2 between columns

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to select rows that have opposite values (A vs B, or B vs A) on first two columns?

Discussion started by: nengcheng

2. Shell Programming and Scripting

How do I select certain columns with matching pattern and rest of the lines?

Discussion started by: kchinnam

3. Shell Programming and Scripting

Select all the even columns from a file

Discussion started by: forU

4. Shell Programming and Scripting

Comparing Select Columns from two CSV files in UNIX and create a third file based on comparision

Discussion started by: ady_koolz

5. Shell Programming and Scripting

Take greatest value from second column

Discussion started by: attila

6. Shell Programming and Scripting

Select lines where at least x columns above threshold value

Discussion started by: pathunkathunk

7. Shell Programming and Scripting

Select columns from a matrix given within a range in BASH

Discussion started by: shoaibjameel123

8. Shell Programming and Scripting

[Solved] Select the columns which have value greater than particular number

Discussion started by: vaibhavkorde

9. Shell Programming and Scripting

Select and display sum depending upon even columns

Discussion started by: adityamitra

10. UNIX for Dummies Questions & Answers

Select and display sum depending upon even columns

Discussion started by: adityamitra