GenBank Perl help...


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting GenBank Perl help...
# 1  
Old 05-24-2009
GenBank Perl help...

Hey guys,

I'm doing some Perl scripting for genomic data out of GenBank files...I have to extract the name of the plant, the file name, the number of bases, and all of the genes including their starting and ending positions...for example, with this GenBank file,


LOCUS NC_010093 153819 bp DNA circular PLN 07-MAY-2009
DEFINITION Acorus americanus chloroplast, complete genome.
ACCESSION NC_010093
VERSION NC_010093.1 GI:161622288
DBLINK Project:27981
KEYWORDS .
SOURCE chloroplast Acorus americanus
ORGANISM Acorus americanus
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; Liliopsida; Acoraceae; Acorus.
REFERENCE 1 (bases 1 to 153819)
AUTHORS Peery,R.M., Chumley,T.W., Kuehl,J.V., Boore,J.L. and Raubeson,L.A.
TITLE The complete chloroplast genome of Acorus americanus
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 153819)
CONSRTM NCBI Genome Project
TITLE Direct Submission
JOURNAL Submitted (03-DEC-2007) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
REFERENCE 3 (bases 1 to 153819)
AUTHORS Peery,R.M., Chumley,T.W., Kuehl,J.V., Boore,J.L. and Raubeson,L.A.
TITLE Direct Submission
JOURNAL Submitted (09-NOV-2007) Department of Biological Sciences, Central
Washington University, 400 E University Way, Ellensburg, WA
98926-7537, USA
COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final
NCBI review. The reference sequence was derived from EU273602.
COMPLETENESS: full length.
FEATURES Location/Qualifiers
source 1..153819
/organism="Acorus americanus"
/organelle="plastid:chloroplast"
/mol_type="genomic DNA"
/db_xref="taxon:263995"
gene complement(join(96591..97384,69098..69211))
/gene="rps12"
/locus_tag="AcamCp045"
/trans_splicing
/db_xref="GeneID:5777700"
CDS complement(join(96591..96616,97153..97384,69098..69211))
/gene="rps12"
/locus_tag="AcamCp045"
/trans_splicing
/note="trans-splices to 5' rps12 exon within the LSC"
/codon_start=1
/transl_table=11
/product="ribosomal protein S12"
/protein_id="YP_001586161.1"
/db_xref="GI:161622289"
/db_xref="GeneID:5777700"
/translation="MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKK
PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTL
DAVGVKDRQQGRSKYGVKKPK"
misc_feature 1..83496
/note="large single copy region (LSC)"
gene complement(136..1197)
/gene="psbA"
/locus_tag="AcamCp001"
/db_xref="GeneID:5777757"

From this I would need to extract the NC number next to Locus for filename, the name next to organism, and the number in front of bp for bases. Then for every time there's a new gene, return what is next to gene=, and then the 2 numbers inside the complement parentheses for the start and end of that gene sequence.

Also, is there a way to make so if I put in the NC number I want, it spits back out the rest of the info...the website this is from is Nucleotide - Acorus americanus chloroplast, complete genome, so if I put in the number I wanted next to the NC, then it would give me the information I want as specified above for that specific NC number....?

I've just been having trouble getting the info extracted out of a GenBank file...any help in the coding for this would be great...thanks!
# 2  
Old 05-25-2009
Here's one way of doing it:

Code:
$ 
$ perl -ne 'if (/^LOCUS/) {
>   split; printf("Filename = %s\nBase = %s\n",$_[1],$_[2])
> } elsif (/organism/) {                                   
>   s/^\s*\/o/O/; s/=/ = /; print                 
> } elsif (/gene/) {                                       
>   print                                                  
> }' genbank.txt                                           
Filename = NC_010093                                       
Base = 153819                                              
Organism = "Acorus americanus"                             
     gene            complement(join(96591..97384,69098..69211))
                     /gene="rps12"                              
                     /gene="rps12"                              
     gene            complement(136..1197)                      
                     /gene="psbA"                               
                     /gene="psbA"                               
     gene            complement(1388..3959)                     
                     /gene="trnK"                               
                     /gene="trnK"                               
                     /gene="trnK"                               
                     /gene="trnK"                               
...
... (trimmed a lot of lines here for succinctness)
...
     gene            complement(153492..153566)
                     /gene="trnH"
                     /gene="trnH"
     gene            153733..153819
                     /gene="rps19"
$

You hadn't specified the output format, so I took the liberty to specify one of my own.

Quote:
...Then for every time there's a new gene, ... the 2 numbers inside the complement parentheses for the start and end of that gene sequence...
Note that not all of the lines starting with "gene" have a "complement" parenthesis. Some of the variations are:

Code:
151670..151951             <= number1..number2 format
complement(join(96591..97384,69098..69211))  <= "complement" followed by "join" and 2 number ranges
join(complement(69098..69211),139932..140725) <= "join" followed by "complement" and 2 number ranges

What do you want to display in such cases ?

Quote:
... is there a way to make so if I put in the NC number I want, it spits back out the rest of the info...
This would be useful if the file had more than one NC numbers. But the file from the website you specified, had data for only one NC number.

HTH,
tyler_durden
# 3  
Old 05-25-2009
Hi tyler_durden,

Well, what's going on is I have over 100 GenBank files, each is it's own NC_ number...(this NC number is just a way to catalogue them). I want to be able to give that NC_ number as a command-line possibly and have it go to the internet, replace that website I posted with whatever NC_ number was given, go into that GenBank file, and return something like:

Name: ___________

FileName: ___________ (NC_######)

Bases: ________

Genes: rps (####, #####) - for example, this would be for the rps gene, and then a new line, and all the rest of the genes are given....if the word complement is present before it, then it should print a negative before the numbers in parentheses: rps -(####, #####)


If it's impossible to use the NC_ number as a command-line, it's okay for me to do it manually, just so I can have a program that can do the output of the name, filename, bases, and genes for me. It would save me loads and loads of time in my research, rather than going through and doing each one by hand. Thanks!

-akreibich07
# 4  
Old 05-25-2009
Quote:
Originally Posted by akreibich07
...
Name: ___________
(1) Name of what ?
(2) What's the value of this field for your example above ?

Quote:
Genes: rps (####, #####) - for example, this would be for the rps gene, and then a new line, and all the rest of the genes are given....if the word complement is present before it, then it should print a negative before the numbers in parentheses: rps -(####, #####)
This is not clear. You haven't addressed the occurrence of 2 different number ranges. Also, you haven't told what to do if the word "join" is present.

What should be the output for each of the following cases ? (Type in the required output for each case.)

Code:
(3)  gene 151670..151951
(4)  gene complement(join(96591..97384,69098..69211))
(5)  gene join(complement(69098..69211),139932..140725)
(6)  gene complement(153492..153566)

Quote:
If it's impossible to use the NC_ number as a command-line, it's okay for me to do it manually, just so I can have a program that can do the output of the name, filename, bases, and genes for me. It would save me loads and loads of time in my research, rather than going through and doing each one by hand. Thanks!
Given an NC_ number, is it possible to know the URL of the download text file (not the NCBI website) ?

For example, when you specify the NC_ number "NC_010093" - the website is "http://www.ncbi.nlm.nih.gov/nuccore/NC_010093". This is trivial. But this would not allow you to download the text file that has the data you need and that is embedded in that website.

The download URL for the text file for your NC_ number "NC_010093" is this -
"http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&db=nuccore&val=161622288&dopt=genbank&sendto=on&log$=seqview&extrafeat=976&m axplex=1"

Does this URL ring a bell ? Do you know how to generate this URL, given the NC_ number ? If you do, then you can use LWP::Simple perl module to download the text file to your computer and then process it the way you want.

HTH,
tyler_durden
# 5  
Old 05-25-2009
(1) Name: _________ -- something like "Acorus Americanus"
(2) for situation (3) gene (151670..151951)
(4) gene -(96591..97384, 69098..69211)
(5) gene -(69098..69211), (139932..140725)
(6) gene -(153492..153566)

I'll check on the URL idea you presented...the most important thing to me right now is just getting a script that can be given a whole GenBank file and can give the output I specified. Even if it can only do it for the specific NC_ number I used as an example, if that one works, I can try to find ways to generalize it. Thanks again.

-akreibich07
# 6  
Old 05-26-2009
Quote:
Originally Posted by akreibich07
...the most important thing to me right now is just getting a script that can be given a whole GenBank file and can give the output I specified....
Ok, so what I did was that I went to the hyperlink: "http://www.ncbi.nlm.nih.gov/nuccore/NC_010093" and clicked "Download" -> "GenBank" and saved the resultant "sequences.gb" text file in my computer.

Assuming that you have figured out a way to download one "sequences.gb" for a given NC_ number, (either manually or via some other tool), here's what you can do:

Code:
$ 
$ head -5 sequences.gb
LOCUS       NC_010093             153819 bp    DNA     circular PLN 07-MAY-2009
DEFINITION  Acorus americanus chloroplast, complete genome.                    
ACCESSION   NC_010093                                                          
VERSION     NC_010093.1  GI:161622288                                          
DBLINK      Project:27981                                                      
$                                                                              
$ tail -5 sequences.gb                                                         
   153661 gattttggaa gaaaattcca attttcaaat acaaaaactt taaactttag gtaaaaaaag    
   153721 gaggaatcag ctgtgacacg ttcattaaaa aaaaatcctt ttgtggccaa tcatttattg    
   153781 gaaaaaattg agaagctcaa catgaaagag gagaaagaa                           
//                                                                             

$ 
$ cat process_genbank.pl
#!/usr/bin/perl -w      
my $element;            
my $in = 0;             
$gbfile = "sequences.gb";
open(GB, $gbfile) or die "Can't open $gbfile: $!";
while (<GB>) {                                    
  chomp($line = $_);                              
  if (/^locus/i) {                                
    @x = split /\s+/,$line;                       
    $fname = $x[1];                               
    $base = $x[2];                                
  } elsif (/ organism /i) {                       
    $line =~ s/\s*ORGANISM\s*//g;                 
    $oname = $line;
  } elsif (/ gene /) {
    @x = split/\s+/,$line;
    if ($x[2] =~ /^\d+..\d+$/) {
      $x[2] = "($x[2])";
    } elsif ($x[2] =~ /^complement\(\d+..\d+\)$/) {
      $x[2] =~ s/complement/-/;
    } elsif ($x[2] =~ /^complement\(join\((\d+..\d+),(\d+..\d+)\)\)$/) {
      $x[2] = "-($1,$2)";
    } elsif ($x[2] =~ /^join\(complement\((\d+..\d+)\),(\d+..\d+)\)$/) {
      $x[2] = "-($1),($2)";
    }
    $element = $x[2];
    $in = 1;
  } elsif (/^\s+\/gene="(.*)"/ and $in == 1) {
    $element = $1.":".$element;
    push @genearray, $element;
    $element = "";
    $in = 0;
  }
}
close(GB) or die "Can't close $gbfile: $!";

# Now start printing the information we gathered
print "Name: $oname\n";
print "FileName: $fname\n";
print "Bases: $base\n";
print "Genes:";
foreach $item (@genearray) {
  @x = split/:/, $item;
  printf("\t%-10s  %s\n",$x[0],$x[1]);
}

$
$

The "head" and "tail" commands were executed so that you know which text file I am talking about.

The execution of the perl program is shown below:

Code:
$ 
$ perl process_genbank.pl
Name: Acorus americanus  
FileName: NC_010093      
Bases: 153819            
Genes:  rps12       -(96591..97384,69098..69211)
        psbA        -(136..1197)                
        trnK        -(1388..3959)               
        matK        -(1680..3215)               
        rps16       -(4654..5737)               
        trnQ        -(6563..6634)               
        psbK        (6985..7170)                
        psbI        (7607..7717)                
        trnS        -(7834..7921)               
        trnG        (8847..9678)                
        trnR        (9773..9844)                
        atpA        -(9940..11463)              
        atpF        -(11531..12803)             
        atpH        -(13300..13545)             
        atpI        -(14474..15217)             
        rps2        -(15451..16161)             
        rpoC2       -(16344..20504)             
        rpoC1       -(20670..23478)             
        rpoB        -(23505..26756)             
        trnC        (27763..27833)              
        petN        (28715..28804)              
        psbM        -(29784..29888)             
        trnD        -(30412..30485)             
        trnY        -(30823..30906)             
        trnE        -(30971..31043)             
        trnT        (31561..31632)              
        psbD        (32999..34060)              
        psbC        (34044..35429)              
        trnS        -(35649..35741)             
        psbZ        (36122..36310)              
        trnG        (36587..36657)              
        trnfM       -(36798..36871)             
        rps14       -(37033..37335)             
        psaB        -(37473..39677)             
        psaA        -(39703..41955)             
        ycf3        -(42574..44553)             
        trnS        (45383..45469)              
        rps4        -(45769..46374)             
        trnT        -(46725..46797)             
        trnL        (47541..48151)              
        trnF        (48522..48594)              
        ndhJ        -(48980..49456)             
        ndhK        -(49532..50251)             
        ndhC        -(50305..50667)             
        trnV        -(52580..53260)             
        trnM        (53438..53509)              
        atpE        -(53676..54080)             
        atpB        -(54077..55573)             
        rbcL        (56391..57833)              
        psaI        (59093..59203)              
        ycf4        (59579..60133)              
        cemA        (60675..61364)              
        petA        (61554..62516)              
        psbJ        -(63395..63517)             
        psbL        -(63652..63768)             
        psbF        -(63791..63910)             
        psbE        -(63920..64171)             
        petL        (65441..65536)              
        petG        (65699..65812)              
        trnW        -(65926..65999)             
        trnP        -(66165..66238)             
        psaJ        (66596..66724)              
        rpl33       (67088..67288)              
        rps18       (67428..67733)              
        rpl20       -(68002..68355)             
        rps12       -(69098..69211),(139932..140725)
        clpP        -(69426..71438)                 
        psbB        (71880..73406)                  
        psbT        (73582..73689)                  
        psbN        -(73749..73880)                 
        psbH        (73993..74214)                  
        petB        (74334..75765)                  
        petD        (75956..77164)                  
        rpoA        -(77332..78345)                 
        rps11       -(78414..78830)                 
        rpl36       -(78951..79064)                 
        infA        -(79178..79411)                 
        rps8        -(79543..79941)                 
        rpl14       -(80114..80482)                 
        rpl16       -(80624..81969)                 
        rps3        -(82130..82786)                 
        rpl22       -(82872..83255)                 
        rps19       -(83305..83583)                 
        trnH        (83750..83824)                  
        rpl2        -(83856..85346)                 
        rpl23       -(85365..85646)                 
        trnI        -(85813..85886)                 
        ycf2        (86098..92328)                  
        trnL-CAA    -(92877..92957)                 
        ndhB        -(93521..95753)                 
        rps7        -(96071..96538)                 
        trnV        (99215..99286)                  
        rrn16       (99514..101004)                 
        trnI        (101318..102331)                
        trnA        (102400..103271)                
        rrn23       (103424..106235)                
        rrn4.5      (106334..106436)                
        rrn5        (106654..106774)                
        trnR        (107036..107109)                
        trnN        -(107607..107678)               
        ycf1        (108281..109521)
        ndhF        -(109702..111921)
        rpl32       (112773..112946)
        trnL        (113810..113889)
        ccsA        (113971..114936)
        ndhD        -(115163..116665)
        psaC        -(116780..117025)
        ndhE        -(117270..117575)
        ndhG        -(117668..118201)
        ndhI        -(118543..119082)
        ndhA        -(119160..121362)
        ndhH        -(121364..122545)
        rps15       -(122651..122923)
        ycf1        -(123294..129035)
        trnN        (129638..129709)
        trnR        -(130207..130280)
        rrn5        -(130542..130662)
        rrn4.5      -(130880..130982)
        rrn23       -(131081..133892)
        trnA        -(134045..134916)
        trnI        -(134985..135998)
        rrn16       -(136312..137802)
        trnV        -(138030..138101)
        rps12       (139932..140725)
        rps7        (140778..141245)
        ndhB        (141563..143795)
        trnL        (144359..144439)
        ycf2        -(144988..151218)
        trnI        (151430..151503)
        rpl23       (151670..151951)
        rpl2        (151970..153460)
        trnH        -(153492..153566)
        rps19       (153733..153819)
$
$

Hope that helps,
tyler_durden
This User Gave Thanks to durden_tyler For This Post:
# 7  
Old 05-26-2009
Hi again tyler_durden,

This code is working great for all the files I have tried it on so far. Thank you...it helps more than you know. I had a couple quick questions:

(1) This again probably has something to do with the URL idea you talked about a few replies ago, but I do most of my text editing on Emacs. I have another program that I already made (much more simple), which just goes onto the main NCBI page and returns only the NC_ file numbers that I want and puts them into another file. Is it possible to do something on Emacs where I put in the file I want to run (perl process_genbank.pl) and then after it put in the specific NC_ file that I want it run on (almost as a command-line). And if so, would there be specific changes to the code I would have to make?

(2) If I needed at some point in time the output to be more specific, so each line after genes: said something like

rps12 CDS exon -(####..####), intron -(####..####), exon(etc)
rps16 tRNA (etc)

If you look on that NCBI file, the left margin has gene for each new gene and then under it says either CDS if it's coding a protein, or tRNA, or I believe there might be a few rRNA's in other files, and then has each exon and intron for that particular gene before going onto the next gene. Is it possible to have that output like I said above (one gene per line, just double spaces in between things like the gene name, the CDS/tRNA/rRNA, etc)?

Again, thanks a million for the code you already submitted on here. It's already been a big help with some of my research. Smilie

-akreibich07

-----Post Update-----

Just a quick note...

I don't think I was quite clear in my last message, so I wanted to clarify.

Next to each instance of gene, there is a sequence of numbers that signifies the entire sequence of bases that that gene is contained in. When you go down to the exon, intron, exon subheadings below each gene, each has its own sequence of numbers which is always inside the range of the full sequence of numbers given by the sequence next to gene.

In the input that I just stated I was possibly looking for, I didn't want the full sequence next to gene, just the individual subsequences next to each exon, intron, exon, etc... In an over-simplified example, if the sequence next to gene was (1..117), then I wouldn't want that, but rather something like... exon (1..53), intron (54..98), exon (99..117). It will always (at least I'm banking on it) work out that if the numbers for each exon, intron sequence were concatenated, it would end up being the full sequence from the gene heading.

-akreibich07

-----Post Update-----

If there is no exon, intron, exon subheadings below gene, this indicates that the full sequence is just one full exon with no interrupting introns. In cases like this, the original output that you gave will be good, just with an added CDS/tRNA/rRNA in between the gene name and the range.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Programming

PERL: In a perl-scripttTrying to execute another perl-script that SETS SOME VARIABLES !

I have reviewed many examples on-line about running another process (either PERL or shell command or a program), but do not find any usefull for my needs way. (Reviewed and not useful the system(), 'back ticks', exec() and open()) I would like to run another PERL-script from first one, not... (1 Reply)
Discussion started by: alex_5161
1 Replies

2. Programming

Perl: restrict perl from automaticaly creating a hash branches on check

My issue is that the perl script (as I have done it so far) created empty branches when I try to check some branches on existence. I am using multydimentional hashes: found it as the best way for information that I need to handle. Saing multidimentional I means hash of hashes ... So, I have ... (2 Replies)
Discussion started by: alex_5161
2 Replies

3. UNIX for Advanced & Expert Users

perl and HP-UX : instmodsh in combination with software depot : update inventory for installed Perl

we create a HP-UX software depot with a new perl-modul. after installation of the software depot, the perl module i can't find with instmodsh in the inventory for installed Perl modules. - i have learned of using instmodsh command : i find out what modules are already installed on my system. ... (0 Replies)
Discussion started by: bora99
0 Replies

4. Shell Programming and Scripting

HELP on Perl array / sorting - trying to convert Korn Shell Script to Perl

Hi all, Not sure if this should be in the programming forum, but I believe it will get more response under the Shell Programming and Scripting FORUM. Am trying to write a customized df script in Perl and need some help with regards to using arrays and file handlers. At the moment am... (3 Replies)
Discussion started by: newbie_01
3 Replies

5. Shell Programming and Scripting

Hidden Characters in Regular Expression Matching Perl - Perl Newbie

I am completely new to perl programming. My father is helping me learn said programming language. However, I am stuck on one of the assignments he has given me, and I can't find very much help with it via google, either because I have a tiny attention span, or because I can be very very dense. ... (4 Replies)
Discussion started by: kittyluva2
4 Replies

6. Shell Programming and Scripting

Perl :How to print the o/p of a Perl script on console and redirecting same in log file @ same time.

How can i print the output of a perl script on a unix console and redirect the same in a log file under same directory simultaneously ? Like in Shell script, we use tee, is there anything in Perl or any other option ? (2 Replies)
Discussion started by: butterfly20
2 Replies

7. Shell Programming and Scripting

To retrieve gene name & function thorugh Genbank id (gi|9910297)

Hi , I have list of genbank id's and ref number in this format. gi|9910297|ref|NM_019974.1| I want to retrive the gene name and fuction for each genbank list. I have around 1300 gi numbers in my excel sheet. So anybody can help me to retrive the information from NCBI through perl script... (0 Replies)
Discussion started by: shibujohn82
0 Replies

8. Shell Programming and Scripting

Passing date formats in Perl: i.e. Jul/10/2007 -> 20070710 (yyyymmdd) - Perl

Hi , This script working for fine if pass script-name.sh Jul/10/2007 ,I want to pass 20070710(yyyymmdd) .Please any help it should be appereciated. use Time::Local; my $d = $ARGV; my $t = $ARGV; my $m = ""; @d = split /\//, $d; @t = split /:/, $t; if ( $d eq "Jan" ) { $m = 0 }... (7 Replies)
Discussion started by: akil
7 Replies

9. Shell Programming and Scripting

Extracting DNA sequences from GenBank files using Perl

Hi all, Using Perl, I need to extract DNA bases from a GenBank file for a given plant species. A sample GenBank file is here... Nucleotide This is saved on my computer as NC_001666.gb. I also have a file that is saved on my computer as NC_001666.txt. This text file has a list of all... (5 Replies)
Discussion started by: akreibich07
5 Replies

10. Shell Programming and Scripting

[Perl] Accessing array elements within a sed command in Perl script

I am trying to use a script to replace the header of each file, whose filename are stored within the array $test, using the sed command within a Perl script as follows: $count = 0; while ( $count < $#test ) { `sed -e 's/BIOGRF 321/BIOGRF 332/g' ${test} > 0`; `cat 0 >... (2 Replies)
Discussion started by: userix
2 Replies
Login or Register to Ask a Question