GenBank Perl help...

05-24-2009

Registered User

10, 0

Join Date: May 2009

Last Activity: 25 June 2009, 1:24 PM EDT

Posts: 10

Thanks Given: 0

Thanked 0 Times in 0 Posts

GenBank Perl help...

Hey guys,

I'm doing some Perl scripting for genomic data out of GenBank files...I have to extract the name of the plant, the file name, the number of bases, and all of the genes including their starting and ending positions...for example, with this GenBank file,

LOCUS NC_010093 153819 bp DNA circular PLN 07-MAY-2009
DEFINITION Acorus americanus chloroplast, complete genome.
ACCESSION NC_010093
VERSION NC_010093.1 GI:161622288
DBLINK Project:27981
KEYWORDS .
SOURCE chloroplast Acorus americanus
ORGANISM Acorus americanus
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; Liliopsida; Acoraceae; Acorus.
REFERENCE 1 (bases 1 to 153819)
AUTHORS Peery,R.M., Chumley,T.W., Kuehl,J.V., Boore,J.L. and Raubeson,L.A.
TITLE The complete chloroplast genome of Acorus americanus
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 153819)
CONSRTM NCBI Genome Project
TITLE Direct Submission
JOURNAL Submitted (03-DEC-2007) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
REFERENCE 3 (bases 1 to 153819)
AUTHORS Peery,R.M., Chumley,T.W., Kuehl,J.V., Boore,J.L. and Raubeson,L.A.
TITLE Direct Submission
JOURNAL Submitted (09-NOV-2007) Department of Biological Sciences, Central
Washington University, 400 E University Way, Ellensburg, WA
98926-7537, USA
COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final
NCBI review. The reference sequence was derived from EU273602.
COMPLETENESS: full length.
FEATURES Location/Qualifiers
source 1..153819
/organism="Acorus americanus"
/organelle="plastid:chloroplast"
/mol_type="genomic DNA"
/db_xref="taxon:263995"
gene complement(join(96591..97384,69098..69211))
/gene="rps12"
/locus_tag="AcamCp045"
/trans_splicing
/db_xref="GeneID:5777700"
CDS complement(join(96591..96616,97153..97384,69098..69211))
/gene="rps12"
/locus_tag="AcamCp045"
/trans_splicing
/note="trans-splices to 5' rps12 exon within the LSC"
/codon_start=1
/transl_table=11
/product="ribosomal protein S12"
/protein_id="YP_001586161.1"
/db_xref="GI:161622289"
/db_xref="GeneID:5777700"
/translation="MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKK
PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTL
DAVGVKDRQQGRSKYGVKKPK"
misc_feature 1..83496
/note="large single copy region (LSC)"
gene complement(136..1197)
/gene="psbA"
/locus_tag="AcamCp001"
/db_xref="GeneID:5777757"

From this I would need to extract the NC number next to Locus for filename, the name next to organism, and the number in front of bp for bases. Then for every time there's a new gene, return what is next to gene=, and then the 2 numbers inside the complement parentheses for the start and end of that gene sequence.

Also, is there a way to make so if I put in the NC number I want, it spits back out the rest of the info...the website this is from is Nucleotide - Acorus americanus chloroplast, complete genome, so if I put in the number I wanted next to the NC, then it would give me the information I want as specified above for that specific NC number....?

I've just been having trouble getting the info extracted out of a GenBank file...any help in the coding for this would be great...thanks!

akreibich07

View Public Profile for akreibich07

Find all posts by akreibich07

05-25-2009

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

Here's one way of doing it:

Code:

$ 
$ perl -ne 'if (/^LOCUS/) {
>   split; printf("Filename = %s\nBase = %s\n",$_[1],$_[2])
> } elsif (/organism/) {                                   
>   s/^\s*\/o/O/; s/=/ = /; print                 
> } elsif (/gene/) {                                       
>   print                                                  
> }' genbank.txt                                           
Filename = NC_010093                                       
Base = 153819                                              
Organism = "Acorus americanus"                             
     gene            complement(join(96591..97384,69098..69211))
                     /gene="rps12"                              
                     /gene="rps12"                              
     gene            complement(136..1197)                      
                     /gene="psbA"                               
                     /gene="psbA"                               
     gene            complement(1388..3959)                     
                     /gene="trnK"                               
                     /gene="trnK"                               
                     /gene="trnK"                               
                     /gene="trnK"                               
...
... (trimmed a lot of lines here for succinctness)
...
     gene            complement(153492..153566)
                     /gene="trnH"
                     /gene="trnH"
     gene            153733..153819
                     /gene="rps19"
$

You hadn't specified the output format, so I took the liberty to specify one of my own.

Quote:

...Then for every time there's a new gene, ... the 2 numbers inside the complement parentheses for the start and end of that gene sequence...

Note that not all of the lines starting with "gene" have a "complement" parenthesis. Some of the variations are:

Code:

151670..151951             <= number1..number2 format
complement(join(96591..97384,69098..69211))  <= "complement" followed by "join" and 2 number ranges
join(complement(69098..69211),139932..140725) <= "join" followed by "complement" and 2 number ranges

What do you want to display in such cases ?

Quote:

... is there a way to make so if I put in the NC number I want, it spits back out the rest of the info...

This would be useful if the file had more than one NC numbers. But the file from the website you specified, had data for only one NC number.

HTH,
tyler_durden

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

05-25-2009

Registered User

10, 0

Join Date: May 2009

Last Activity: 25 June 2009, 1:24 PM EDT

Posts: 10

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi tyler_durden,

Well, what's going on is I have over 100 GenBank files, each is it's own NC_ number...(this NC number is just a way to catalogue them). I want to be able to give that NC_ number as a command-line possibly and have it go to the internet, replace that website I posted with whatever NC_ number was given, go into that GenBank file, and return something like:

Name: ___________

FileName: ___________ (NC_######)

Bases: ________

Genes: rps (####, #####) - for example, this would be for the rps gene, and then a new line, and all the rest of the genes are given....if the word complement is present before it, then it should print a negative before the numbers in parentheses: rps -(####, #####)

If it's impossible to use the NC_ number as a command-line, it's okay for me to do it manually, just so I can have a program that can do the output of the name, filename, bases, and genes for me. It would save me loads and loads of time in my research, rather than going through and doing each one by hand. Thanks!

-akreibich07

akreibich07

View Public Profile for akreibich07

Find all posts by akreibich07

05-25-2009

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

Quote:

Originally Posted by akreibich07

...
Name: ___________

(1) Name of what ?
(2) What's the value of this field for your example above ?

Quote:

Genes: rps (####, #####) - for example, this would be for the rps gene, and then a new line, and all the rest of the genes are given....if the word complement is present before it, then it should print a negative before the numbers in parentheses: rps -(####, #####)

This is not clear. You haven't addressed the occurrence of 2 different number ranges. Also, you haven't told what to do if the word "join" is present.

What should be the output for each of the following cases ? (Type in the required output for each case.)

Code:

(3)  gene 151670..151951
(4)  gene complement(join(96591..97384,69098..69211))
(5)  gene join(complement(69098..69211),139932..140725)
(6)  gene complement(153492..153566)

Quote:

If it's impossible to use the NC_ number as a command-line, it's okay for me to do it manually, just so I can have a program that can do the output of the name, filename, bases, and genes for me. It would save me loads and loads of time in my research, rather than going through and doing each one by hand. Thanks!

Given an NC_ number, is it possible to know the URL of the download text file (not the NCBI website) ?

For example, when you specify the NC_ number "NC_010093" - the website is "http://www.ncbi.nlm.nih.gov/nuccore/NC_010093". This is trivial. But this would not allow you to download the text file that has the data you need and that is embedded in that website.

The download URL for the text file for your NC_ number "NC_010093" is this -
"http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&db=nuccore&val=161622288&dopt=genbank&sendto=on&log$=seqview&extrafeat=976&m axplex=1"

Does this URL ring a bell ? Do you know how to generate this URL, given the NC_ number ? If you do, then you can use LWP::Simple perl module to download the text file to your computer and then process it the way you want.

HTH,
tyler_durden

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

05-25-2009

Registered User

10, 0

Join Date: May 2009

Last Activity: 25 June 2009, 1:24 PM EDT

Posts: 10

Thanks Given: 0

Thanked 0 Times in 0 Posts

(1) Name: _________ -- something like "Acorus Americanus"
(2) for situation (3) gene (151670..151951)
(4) gene -(96591..97384, 69098..69211)
(5) gene -(69098..69211), (139932..140725)
(6) gene -(153492..153566)

I'll check on the URL idea you presented...the most important thing to me right now is just getting a script that can be given a whole GenBank file and can give the output I specified. Even if it can only do it for the specific NC_ number I used as an example, if that one works, I can try to find ways to generalize it. Thanks again.

-akreibich07

akreibich07

View Public Profile for akreibich07

Find all posts by akreibich07

05-26-2009

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

Quote:

Originally Posted by akreibich07

...the most important thing to me right now is just getting a script that can be given a whole GenBank file and can give the output I specified....

Ok, so what I did was that I went to the hyperlink: "http://www.ncbi.nlm.nih.gov/nuccore/NC_010093" and clicked "Download" -> "GenBank" and saved the resultant "sequences.gb" text file in my computer.

Assuming that you have figured out a way to download one "sequences.gb" for a given NC_ number, (either manually or via some other tool), here's what you can do:

Code:

$ 
$ head -5 sequences.gb
LOCUS       NC_010093             153819 bp    DNA     circular PLN 07-MAY-2009
DEFINITION  Acorus americanus chloroplast, complete genome.                    
ACCESSION   NC_010093                                                          
VERSION     NC_010093.1  GI:161622288                                          
DBLINK      Project:27981                                                      
$                                                                              
$ tail -5 sequences.gb                                                         
   153661 gattttggaa gaaaattcca attttcaaat acaaaaactt taaactttag gtaaaaaaag    
   153721 gaggaatcag ctgtgacacg ttcattaaaa aaaaatcctt ttgtggccaa tcatttattg    
   153781 gaaaaaattg agaagctcaa catgaaagag gagaaagaa                           
//                                                                             

$ 
$ cat process_genbank.pl
#!/usr/bin/perl -w      
my $element;            
my $in = 0;             
$gbfile = "sequences.gb";
open(GB, $gbfile) or die "Can't open $gbfile: $!";
while (<GB>) {                                    
  chomp($line = $_);                              
  if (/^locus/i) {                                
    @x = split /\s+/,$line;                       
    $fname = $x[1];                               
    $base = $x[2];                                
  } elsif (/ organism /i) {                       
    $line =~ s/\s*ORGANISM\s*//g;                 
    $oname = $line;
  } elsif (/ gene /) {
    @x = split/\s+/,$line;
    if ($x[2] =~ /^\d+..\d+$/) {
      $x[2] = "($x[2])";
    } elsif ($x[2] =~ /^complement\(\d+..\d+\)$/) {
      $x[2] =~ s/complement/-/;
    } elsif ($x[2] =~ /^complement\(join\((\d+..\d+),(\d+..\d+)\)\)$/) {
      $x[2] = "-($1,$2)";
    } elsif ($x[2] =~ /^join\(complement\((\d+..\d+)\),(\d+..\d+)\)$/) {
      $x[2] = "-($1),($2)";
    }
    $element = $x[2];
    $in = 1;
  } elsif (/^\s+\/gene="(.*)"/ and $in == 1) {
    $element = $1.":".$element;
    push @genearray, $element;
    $element = "";
    $in = 0;
  }
}
close(GB) or die "Can't close $gbfile: $!";

# Now start printing the information we gathered
print "Name: $oname\n";
print "FileName: $fname\n";
print "Bases: $base\n";
print "Genes:";
foreach $item (@genearray) {
  @x = split/:/, $item;
  printf("\t%-10s  %s\n",$x[0],$x[1]);
}

$
$

The "head" and "tail" commands were executed so that you know which text file I am talking about.

The execution of the perl program is shown below:

Code:

$ 
$ perl process_genbank.pl
Name: Acorus americanus  
FileName: NC_010093      
Bases: 153819            
Genes:  rps12       -(96591..97384,69098..69211)
        psbA        -(136..1197)                
        trnK        -(1388..3959)               
        matK        -(1680..3215)               
        rps16       -(4654..5737)               
        trnQ        -(6563..6634)               
        psbK        (6985..7170)                
        psbI        (7607..7717)                
        trnS        -(7834..7921)               
        trnG        (8847..9678)                
        trnR        (9773..9844)                
        atpA        -(9940..11463)              
        atpF        -(11531..12803)             
        atpH        -(13300..13545)             
        atpI        -(14474..15217)             
        rps2        -(15451..16161)             
        rpoC2       -(16344..20504)             
        rpoC1       -(20670..23478)             
        rpoB        -(23505..26756)             
        trnC        (27763..27833)              
        petN        (28715..28804)              
        psbM        -(29784..29888)             
        trnD        -(30412..30485)             
        trnY        -(30823..30906)             
        trnE        -(30971..31043)             
        trnT        (31561..31632)              
        psbD        (32999..34060)              
        psbC        (34044..35429)              
        trnS        -(35649..35741)             
        psbZ        (36122..36310)              
        trnG        (36587..36657)              
        trnfM       -(36798..36871)             
        rps14       -(37033..37335)             
        psaB        -(37473..39677)             
        psaA        -(39703..41955)             
        ycf3        -(42574..44553)             
        trnS        (45383..45469)              
        rps4        -(45769..46374)             
        trnT        -(46725..46797)             
        trnL        (47541..48151)              
        trnF        (48522..48594)              
        ndhJ        -(48980..49456)             
        ndhK        -(49532..50251)             
        ndhC        -(50305..50667)             
        trnV        -(52580..53260)             
        trnM        (53438..53509)              
        atpE        -(53676..54080)             
        atpB        -(54077..55573)             
        rbcL        (56391..57833)              
        psaI        (59093..59203)              
        ycf4        (59579..60133)              
        cemA        (60675..61364)              
        petA        (61554..62516)              
        psbJ        -(63395..63517)             
        psbL        -(63652..63768)             
        psbF        -(63791..63910)             
        psbE        -(63920..64171)             
        petL        (65441..65536)              
        petG        (65699..65812)              
        trnW        -(65926..65999)             
        trnP        -(66165..66238)             
        psaJ        (66596..66724)              
        rpl33       (67088..67288)              
        rps18       (67428..67733)              
        rpl20       -(68002..68355)             
        rps12       -(69098..69211),(139932..140725)
        clpP        -(69426..71438)                 
        psbB        (71880..73406)                  
        psbT        (73582..73689)                  
        psbN        -(73749..73880)                 
        psbH        (73993..74214)                  
        petB        (74334..75765)                  
        petD        (75956..77164)                  
        rpoA        -(77332..78345)                 
        rps11       -(78414..78830)                 
        rpl36       -(78951..79064)                 
        infA        -(79178..79411)                 
        rps8        -(79543..79941)                 
        rpl14       -(80114..80482)                 
        rpl16       -(80624..81969)                 
        rps3        -(82130..82786)                 
        rpl22       -(82872..83255)                 
        rps19       -(83305..83583)                 
        trnH        (83750..83824)                  
        rpl2        -(83856..85346)                 
        rpl23       -(85365..85646)                 
        trnI        -(85813..85886)                 
        ycf2        (86098..92328)                  
        trnL-CAA    -(92877..92957)                 
        ndhB        -(93521..95753)                 
        rps7        -(96071..96538)                 
        trnV        (99215..99286)                  
        rrn16       (99514..101004)                 
        trnI        (101318..102331)                
        trnA        (102400..103271)                
        rrn23       (103424..106235)                
        rrn4.5      (106334..106436)                
        rrn5        (106654..106774)                
        trnR        (107036..107109)                
        trnN        -(107607..107678)               
        ycf1        (108281..109521)
        ndhF        -(109702..111921)
        rpl32       (112773..112946)
        trnL        (113810..113889)
        ccsA        (113971..114936)
        ndhD        -(115163..116665)
        psaC        -(116780..117025)
        ndhE        -(117270..117575)
        ndhG        -(117668..118201)
        ndhI        -(118543..119082)
        ndhA        -(119160..121362)
        ndhH        -(121364..122545)
        rps15       -(122651..122923)
        ycf1        -(123294..129035)
        trnN        (129638..129709)
        trnR        -(130207..130280)
        rrn5        -(130542..130662)
        rrn4.5      -(130880..130982)
        rrn23       -(131081..133892)
        trnA        -(134045..134916)
        trnI        -(134985..135998)
        rrn16       -(136312..137802)
        trnV        -(138030..138101)
        rps12       (139932..140725)
        rps7        (140778..141245)
        ndhB        (141563..143795)
        trnL        (144359..144439)
        ycf2        -(144988..151218)
        trnI        (151430..151503)
        rpl23       (151670..151951)
        rpl2        (151970..153460)
        trnH        -(153492..153566)
        rps19       (153733..153819)
$
$

Hope that helps,
tyler_durden

This User Gave Thanks to durden_tyler For This Post:

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

05-26-2009

Registered User

10, 0

Join Date: May 2009

Last Activity: 25 June 2009, 1:24 PM EDT

Posts: 10

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi again tyler_durden,

This code is working great for all the files I have tried it on so far. Thank you...it helps more than you know. I had a couple quick questions:

(1) This again probably has something to do with the URL idea you talked about a few replies ago, but I do most of my text editing on Emacs. I have another program that I already made (much more simple), which just goes onto the main NCBI page and returns only the NC_ file numbers that I want and puts them into another file. Is it possible to do something on Emacs where I put in the file I want to run (perl process_genbank.pl) and then after it put in the specific NC_ file that I want it run on (almost as a command-line). And if so, would there be specific changes to the code I would have to make?

(2) If I needed at some point in time the output to be more specific, so each line after genes: said something like

rps12 CDS exon -(####..####), intron -(####..####), exon(etc)
rps16 tRNA (etc)

If you look on that NCBI file, the left margin has gene for each new gene and then under it says either CDS if it's coding a protein, or tRNA, or I believe there might be a few rRNA's in other files, and then has each exon and intron for that particular gene before going onto the next gene. Is it possible to have that output like I said above (one gene per line, just double spaces in between things like the gene name, the CDS/tRNA/rRNA, etc)?

Again, thanks a million for the code you already submitted on here. It's already been a big help with some of my research.

-akreibich07

-----Post Update-----

Just a quick note...

I don't think I was quite clear in my last message, so I wanted to clarify.

Next to each instance of gene, there is a sequence of numbers that signifies the entire sequence of bases that that gene is contained in. When you go down to the exon, intron, exon subheadings below each gene, each has its own sequence of numbers which is always inside the range of the full sequence of numbers given by the sequence next to gene.

In the input that I just stated I was possibly looking for, I didn't want the full sequence next to gene, just the individual subsequences next to each exon, intron, exon, etc... In an over-simplified example, if the sequence next to gene was (1..117), then I wouldn't want that, but rather something like... exon (1..53), intron (54..98), exon (99..117). It will always (at least I'm banking on it) work out that if the numbers for each exon, intron sequence were concatenated, it would end up being the full sequence from the gene heading.

-akreibich07

-----Post Update-----

If there is no exon, intron, exon subheadings below gene, this indicates that the full sequence is just one full exon with no interrupting introns. In cases like this, the original output that you gave will be good, just with an added CDS/tRNA/rRNA in between the gene name and the range.

akreibich07

View Public Profile for akreibich07

Find all posts by akreibich07

Shell Programming and Scripting

GenBank Perl help...

10 More Discussions You Might Find Interesting

1. Programming

PERL: In a perl-scripttTrying to execute another perl-script that SETS SOME VARIABLES !

Discussion started by: alex_5161

2. Programming

Perl: restrict perl from automaticaly creating a hash branches on check

Discussion started by: alex_5161

3. UNIX for Advanced & Expert Users

perl and HP-UX : instmodsh in combination with software depot : update inventory for installed Perl

Discussion started by: bora99

4. Shell Programming and Scripting

HELP on Perl array / sorting - trying to convert Korn Shell Script to Perl

Discussion started by: newbie_01

5. Shell Programming and Scripting

Hidden Characters in Regular Expression Matching Perl - Perl Newbie

Discussion started by: kittyluva2

6. Shell Programming and Scripting

Perl :How to print the o/p of a Perl script on console and redirecting same in log file @ same time.

Discussion started by: butterfly20

7. Shell Programming and Scripting

To retrieve gene name & function thorugh Genbank id (gi|9910297)

Discussion started by: shibujohn82

8. Shell Programming and Scripting

Passing date formats in Perl: i.e. Jul/10/2007 -> 20070710 (yyyymmdd) - Perl

Discussion started by: akil

9. Shell Programming and Scripting

Extracting DNA sequences from GenBank files using Perl

Discussion started by: akreibich07

10. Shell Programming and Scripting

[Perl] Accessing array elements within a sed command in Perl script

Discussion started by: userix