Script to search and extract the gene sub-location from gff file.


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Script to search and extract the gene sub-location from gff file.
# 1  
Old 06-27-2011
Data Script to search and extract the gene sub-location from gff file.

Hi, my problem is that I have two files. File no. 1 is a gff text file (say gi1) that has gene information like :

********************
Code:
   gene            39389788..39395643
                     /gene="RPSA"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /db_xref="GeneID:3921"
                     /db_xref="HGNC:6502"
                     /db_xref="HPRD:01038"
                     /db_xref="MIM:150370"
     mRNA            join(39389788..39389839,39390696..39390861,
                     39391681..39391799,39393855..39394100,39394750..39394878,
                     39394997..39395162,39395375..39395643)
                     /gene="RPSA"
                     /product="ribosomal protein SA, transcript variant 1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /transcript_id="NM_002295.4"
                     /db_xref="GI:70609879"
                     /db_xref="GeneID:3921"
                     /db_xref="HGNC:6502"
                     /db_xref="HPRD:01038"
                     /db_xref="MIM:150370"
     mRNA            join(39390696..39390861,39391681..39391799,
                     39393855..39394100,39394750..39394878,39394997..39395162,
                     39395375..39395643)
                     /gene="RPSA"
                     /product="ribosomal protein SA, transcript variant 2"
                     /exception="unclassified transcription discrepancy"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /transcript_id="NM_001012321.1"
                     /db_xref="GI:59859884"
                     /db_xref="GeneID:3921"
                     /db_xref="HGNC:6502"
                     /db_xref="HPRD:01038"
                     /db_xref="MIM:150370"
     CDS             join(39390729..39390861,39391681..39391799,
                     39393855..39394100,39394750..39394878,39394997..39395162,
                     39395375..39395469)
                     /gene="RPSA"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /codon_start=1
                     /product="40S ribosomal protein SA"
                     /protein_id="NP_001012321.1"
                     /db_xref="GI:59859885"
                     /db_xref="CCDS:CCDS2686.1"
                     /db_xref="GeneID:3921"
                     /db_xref="HGNC:6502"
                     /db_xref="HPRD:01038"
                     /db_xref="MIM:150370"
     CDS             join(39390729..39390861,39391681..39391799,
                     39393855..39394100,39394750..39394878,39394997..39395162,
                     39395375..39395469)
                     /gene="RPSA"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /codon_start=1
                     /product="40S ribosomal protein SA"
                     /protein_id="NP_002286.2"
                     /db_xref="GI:9845502"
                     /db_xref="CCDS:CCDS2686.1"
                     /db_xref="GeneID:3921"
                     /db_xref="HGNC:6502"
                     /db_xref="HPRD:01038"
                     /db_xref="MIM:150370"
     gene            39391466..39391614
                     /gene="SNORA6"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /db_xref="GeneID:574040"
                     /db_xref="HGNC:32591"
     ncRNA           39391466..39391614
                     /gene="SNORA6"
                     /ncRNA_class="snoRNA"
                     /product="small nucleolar RNA, H/ACA box 6"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq."
                     /transcript_id="NR_002325.1"
                     /db_xref="GI:68510025"
                     /db_xref="GeneID:574040"
                     /db_xref="HGNC:32591"
     gene            39394155..39394308
                     /gene="SNORA62"
                     /note="Derived by automated computational analysis using...

*****************************************

now, file no. 2 is a mapped txt file like:

*********************************
Code:
 Gene_input_file: f3

sno_input_file: chr3


319 found_in_gene 52698648..52707224 at 52704105 and_count: 5457
68 found_in_gene 52698648..52707224 at 52705463 and_count: 6815
82 found_in_gene 52698648..52707224 at 52701967 and_count: 3319
124 found_in_gene 39793218..40244467 at 40222682 and_count: 429464
202 found_in_gene 9443305..10558922 at 10110734 and_count: 667429
228 found_in_gene 46262602..46896241 at 46629723 and_count: 367121
..and so on.

**************************************

so, I need to extract the region from file 2 say, 52698648..52707224 for id-319, which begins from position 52704105 in gff file. And then search it in a file 1, for the sub-location of this gene, say, whether its in cDNA, mRNA etc. If its not fount the output should be:

Code:
'319 not found Intron'

else, if its found, output should be

'
Code:
319 found_in mRNA.'


please help me with the shell scripting or perl (or both)..I am new to this linux world. Smilie

Last edited by pludi; 06-27-2011 at 03:55 PM..
# 2  
Old 06-27-2011
Neither your statments , nor you sample data explains the problem fully.

Please use code tags when you post the sample data.
# 3  
Old 06-28-2011
@panyam

sorry this was my first post, so I didn't have much idea.

Regarding problem:

Code:
gene            39389788..39395643

It is a particular gene position in the whole genome, now this gene is madeup of CDS, mRNA, Introns etc..the information is right below it like:

Code:
mRNA            join(39389788..39389839,39390696
 CDS             join(39390729..39390861,39391681..39..

etc..until the information of next gene comes..
like: (say gene2)
Code:
gene            39391466..39391614

So I have file with these 'gene' location, now I need to extract its sub-location, like whether its in CDS, mRNA or Intron(in case no match found).

The location of gene(that we need to find) is in separate file:

Code:
Gene_input_file: f3 
 sno_input_file: chr3  
 319 found_in_gene  52698648..52707224 at 52704105 and_count: 5457
 68 found_in_gene  52698648..52707224 at 52705463 and_count: 6815 
82 found_in_gene  52698648..52707224 at 52701967 and_count: 3319
 124 found_in_gene  39793218..40244467 at 40222682 and_count: 429464
 202 found_in_gene  9443305..10558922 at 10110734 and_count: 667429
 228 found_in_gene  46262602..46896241 at 46629723 and_count: 367121 ..and so on.

I have to read it line by line, extract gene position, then search it in the main gene info. (gff) file. like:

Code:
52698648..52707224 (of file2) match it in file1 and print its sub-location.

note: '..' denotes FROM postion 52698648 TO 52707224.

Last edited by reena2305; 06-28-2011 at 01:32 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Search a string and display its location on the entire string and make a text file

I want to search a small string in a large string and find the locations of the string. For this I used grep "string" -ob <file name where the large string is stored>. Now this gives me the locations of that string. Now how do I store these locations in a text file. Please use CODE tags as... (7 Replies)
Discussion started by: ANKIT ROY
7 Replies

2. UNIX for Advanced & Expert Users

Map snps into a ref gene file

I have the following data set about the snps ID txt file POS ID 78599583 rs987435 33395779 rs345783 189807684 rs955894 33907909 rs6088791 75664046 rs11180435 218890658 rs17571465 127630276 rs17011450 90919465 rs6919430 and a gene... (7 Replies)
Discussion started by: marwah
7 Replies

3. Shell Programming and Scripting

Need to extract characters between two search words in a script!!

Hi, I have a log file which is the output from a xml script : <?xml version="1.0" ?> <!DOCTYPE svc_result SYSTEM "MLP_SVC_RESULT_320.DTD"> <svc_result ver="3.2.0"> <slia ver="3.0.0"> <pos> <msid type="MSISDN" enc="ASC">8093078040</msid> <poserr> ... (4 Replies)
Discussion started by: arjunstarz
4 Replies

4. Shell Programming and Scripting

How to find a existing file location and directory location in Solaris box?

Hi This is my third past and very impressed with previous post replies Hoping the same for below query How to find a existing file location and directory location in solaris box (1 Reply)
Discussion started by: buzzme
1 Replies

5. Shell Programming and Scripting

Search for a pattern,extract value(s) from next line, extract lines having those extracted value(s)

I have hundreds of files to process. In each file I need to look for a pattern then extract value(s) from next line and then search for value(s) selected from point (2) in the same file at a specific position. HEADER ELECTRON TRANSPORT 18-MAR-98 1A7V TITLE CYTOCHROME... (7 Replies)
Discussion started by: AshwaniSharma09
7 Replies

6. Shell Programming and Scripting

Search for string in a file, extract two another strings and concatenate to a variable

I have a file with <suit:run date="Trump Tue 06/19/2012 11:41 AM EDT" machine="garg-ln" build="19921" level="beta" release="6.1.5" os="Linux"> Need to find word "build" then extract build number, which is 19921 also release number, which is 6.1.5 then concatenate them to one variable as... (6 Replies)
Discussion started by: garg
6 Replies

7. UNIX for Dummies Questions & Answers

Breaking a fasta formatted file into multiple files containing each gene separately

Hey, I've been trying to break a massive fasta formatted file into files containing each gene separately. Could anyone help me? I've tried to use the following code but i've recieved errors every time: for i in *.rtf.out do awk '/^>/{f=++d".fasta"} {print > $i.out}' $i done (1 Reply)
Discussion started by: Ann Mc Cartney
1 Replies

8. Shell Programming and Scripting

File created in a different location instead of desired location on using crontab

Hi, I am logging to a linux server through a user "user1" in /home directory. There is a script in a directory in 'root' for which all permissions are available including the directory. This script when executed creates a file in the directory. When the script is added to crontab, on... (1 Reply)
Discussion started by: archana.n
1 Replies

9. Shell Programming and Scripting

Shell Script for Copy files from one location to another location

Create a script that copies files from one specified directory to another specified directory, in the order they were created in the original directory between specified times. Copy the files at a specified interval. (2 Replies)
Discussion started by: allways4u21
2 Replies

10. Shell Programming and Scripting

Perl script to search and extract using wildcards.

Good evening All, I have a perl script to pull out all occurrences of a files beginning with xx and ending in .p. I will then loop through all 1K files in a directory. I can grep for xx*.p files but it gives me the entire line. I wish to output to a single colum with only the hits found. ... (3 Replies)
Discussion started by: CammyD
3 Replies
Login or Register to Ask a Question