Script to search and extract the gene sub-location from gff file. | Unix Linux Forums | Shell Programming and Scripting

  Go Back    


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Script to search and extract the gene sub-location from gff file.

Shell Programming and Scripting


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 06-27-2011
reena2305 reena2305 is offline
Registered User
 
Join Date: Jun 2011
Last Activity: 4 July 2011, 6:09 AM EDT
Posts: 2
Thanks: 0
Thanked 0 Times in 0 Posts
Data Script to search and extract the gene sub-location from gff file.

Hi, my problem is that I have two files. File no. 1 is a gff text file (say gi1) that has gene information like :

********************

Code:
   gene            39389788..39395643
                     /gene="RPSA"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /db_xref="GeneID:3921"
                     /db_xref="HGNC:6502"
                     /db_xref="HPRD:01038"
                     /db_xref="MIM:150370"
     mRNA            join(39389788..39389839,39390696..39390861,
                     39391681..39391799,39393855..39394100,39394750..39394878,
                     39394997..39395162,39395375..39395643)
                     /gene="RPSA"
                     /product="ribosomal protein SA, transcript variant 1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /transcript_id="NM_002295.4"
                     /db_xref="GI:70609879"
                     /db_xref="GeneID:3921"
                     /db_xref="HGNC:6502"
                     /db_xref="HPRD:01038"
                     /db_xref="MIM:150370"
     mRNA            join(39390696..39390861,39391681..39391799,
                     39393855..39394100,39394750..39394878,39394997..39395162,
                     39395375..39395643)
                     /gene="RPSA"
                     /product="ribosomal protein SA, transcript variant 2"
                     /exception="unclassified transcription discrepancy"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /transcript_id="NM_001012321.1"
                     /db_xref="GI:59859884"
                     /db_xref="GeneID:3921"
                     /db_xref="HGNC:6502"
                     /db_xref="HPRD:01038"
                     /db_xref="MIM:150370"
     CDS             join(39390729..39390861,39391681..39391799,
                     39393855..39394100,39394750..39394878,39394997..39395162,
                     39395375..39395469)
                     /gene="RPSA"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /codon_start=1
                     /product="40S ribosomal protein SA"
                     /protein_id="NP_001012321.1"
                     /db_xref="GI:59859885"
                     /db_xref="CCDS:CCDS2686.1"
                     /db_xref="GeneID:3921"
                     /db_xref="HGNC:6502"
                     /db_xref="HPRD:01038"
                     /db_xref="MIM:150370"
     CDS             join(39390729..39390861,39391681..39391799,
                     39393855..39394100,39394750..39394878,39394997..39395162,
                     39395375..39395469)
                     /gene="RPSA"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /codon_start=1
                     /product="40S ribosomal protein SA"
                     /protein_id="NP_002286.2"
                     /db_xref="GI:9845502"
                     /db_xref="CCDS:CCDS2686.1"
                     /db_xref="GeneID:3921"
                     /db_xref="HGNC:6502"
                     /db_xref="HPRD:01038"
                     /db_xref="MIM:150370"
     gene            39391466..39391614
                     /gene="SNORA6"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /db_xref="GeneID:574040"
                     /db_xref="HGNC:32591"
     ncRNA           39391466..39391614
                     /gene="SNORA6"
                     /ncRNA_class="snoRNA"
                     /product="small nucleolar RNA, H/ACA box 6"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /transcript_id="NR_002325.1"
                     /db_xref="GI:68510025"
                     /db_xref="GeneID:574040"
                     /db_xref="HGNC:32591"
     gene            39394155..39394308
                     /gene="SNORA62"
                     /note="Derived by automated computational analysis using...

*****************************************

now, file no. 2 is a mapped txt file like:

*********************************

Code:
 Gene_input_file: f3

sno_input_file: chr3


319 found_in_gene 52698648..52707224 at 52704105 and_count: 5457
68 found_in_gene 52698648..52707224 at 52705463 and_count: 6815
82 found_in_gene 52698648..52707224 at 52701967 and_count: 3319
124 found_in_gene 39793218..40244467 at 40222682 and_count: 429464
202 found_in_gene 9443305..10558922 at 10110734 and_count: 667429
228 found_in_gene 46262602..46896241 at 46629723 and_count: 367121
..and so on.

**************************************

so, I need to extract the region from file 2 say, 52698648..52707224 for id-319, which begins from position 52704105 in gff file. And then search it in a file 1, for the sub-location of this gene, say, whether its in cDNA, mRNA etc. If its not fount the output should be:


Code:
'319 not found Intron'

else, if its found, output should be

'
Code:
319 found_in mRNA.'


please help me with the shell scripting or perl (or both)..I am new to this linux world.

Last edited by pludi; 06-27-2011 at 02:55 PM..
Sponsored Links
    #2  
Old 06-27-2011
panyam panyam is offline Forum Advisor  
Registered User
 
Join Date: Sep 2008
Last Activity: 24 July 2014, 3:48 AM EDT
Posts: 1,156
Thanks: 20
Thanked 104 Times in 99 Posts
Neither your statments , nor you sample data explains the problem fully.

Please use code tags when you post the sample data.
Sponsored Links
    #3  
Old 06-28-2011
reena2305 reena2305 is offline
Registered User
 
Join Date: Jun 2011
Last Activity: 4 July 2011, 6:09 AM EDT
Posts: 2
Thanks: 0
Thanked 0 Times in 0 Posts
@panyam

sorry this was my first post, so I didn't have much idea.

Regarding problem:


Code:
gene            39389788..39395643

It is a particular gene position in the whole genome, now this gene is madeup of CDS, mRNA, Introns etc..the information is right below it like:


Code:
mRNA            join(39389788..39389839,39390696
 CDS             join(39390729..39390861,39391681..39..

etc..until the information of next gene comes..
like: (say gene2)

Code:
gene            39391466..39391614

So I have file with these 'gene' location, now I need to extract its sub-location, like whether its in CDS, mRNA or Intron(in case no match found).

The location of gene(that we need to find) is in separate file:


Code:
Gene_input_file: f3 
 sno_input_file: chr3  
 319 found_in_gene  52698648..52707224 at 52704105 and_count: 5457
 68 found_in_gene  52698648..52707224 at 52705463 and_count: 6815 
82 found_in_gene  52698648..52707224 at 52701967 and_count: 3319
 124 found_in_gene  39793218..40244467 at 40222682 and_count: 429464
 202 found_in_gene  9443305..10558922 at 10110734 and_count: 667429
 228 found_in_gene  46262602..46896241 at 46629723 and_count: 367121 ..and so on.

I have to read it line by line, extract gene position, then search it in the main gene info. (gff) file. like:


Code:
52698648..52707224 (of file2) match it in file1 and print its sub-location.

note: '..' denotes FROM postion 52698648 TO 52707224.

Last edited by reena2305; 06-28-2011 at 12:32 AM..
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Shell Script for Copy files from one location to another location allways4u21 Shell Programming and Scripting 2 01-21-2010 03:26 PM
Perl script to search and extract using wildcards. CammyD Shell Programming and Scripting 3 04-20-2009 02:17 PM
using sed to conditionally extract stanzas of a file based on a search string aitayemi Shell Programming and Scripting 0 11-25-2008 05:16 PM
excutable script to copy a file to a different location. nazehcalil UNIX for Dummies Questions & Answers 4 12-21-2006 08:17 AM
how to find Script file location inside script asami Shell Programming and Scripting 10 03-14-2006 11:57 PM



All times are GMT -4. The time now is 05:06 AM.