The UNIX and Linux Forums  

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
perl -write values in a file to @array in perl meghana Shell Programming and Scripting 27 06-07-2009 06:05 PM
How to Turn perl one-liners into full perl script? EDALBNUG UNIX for Dummies Questions & Answers 1 02-04-2009 10:49 AM
[Perl] Accessing array elements within a sed command in Perl script userix Shell Programming and Scripting 2 10-03-2008 01:05 PM
[PERL] Running unix commands within Perl Scripts userix Shell Programming and Scripting 1 05-28-2008 07:06 PM
Perl: Run perl script in the current process vino Shell Programming and Scripting 10 12-09-2005 10:45 AM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 05-24-2009
akreibich07 akreibich07 is offline
Registered User
  
 

Join Date: May 2009
Posts: 10
GenBank Perl help...

Hey guys,

I'm doing some Perl scripting for genomic data out of GenBank files...I have to extract the name of the plant, the file name, the number of bases, and all of the genes including their starting and ending positions...for example, with this GenBank file,


LOCUS NC_010093 153819 bp DNA circular PLN 07-MAY-2009
DEFINITION Acorus americanus chloroplast, complete genome.
ACCESSION NC_010093
VERSION NC_010093.1 GI:161622288
DBLINK Project:27981
KEYWORDS .
SOURCE chloroplast Acorus americanus
ORGANISM Acorus americanus
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; Liliopsida; Acoraceae; Acorus.
REFERENCE 1 (bases 1 to 153819)
AUTHORS Peery,R.M., Chumley,T.W., Kuehl,J.V., Boore,J.L. and Raubeson,L.A.
TITLE The complete chloroplast genome of Acorus americanus
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 153819)
CONSRTM NCBI Genome Project
TITLE Direct Submission
JOURNAL Submitted (03-DEC-2007) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
REFERENCE 3 (bases 1 to 153819)
AUTHORS Peery,R.M., Chumley,T.W., Kuehl,J.V., Boore,J.L. and Raubeson,L.A.
TITLE Direct Submission
JOURNAL Submitted (09-NOV-2007) Department of Biological Sciences, Central
Washington University, 400 E University Way, Ellensburg, WA
98926-7537, USA
COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final
NCBI review. The reference sequence was derived from EU273602.
COMPLETENESS: full length.
FEATURES Location/Qualifiers
source 1..153819
/organism="Acorus americanus"
/organelle="plastid:chloroplast"
/mol_type="genomic DNA"
/db_xref="taxon:263995"
gene complement(join(96591..97384,69098..69211))
/gene="rps12"
/locus_tag="AcamCp045"
/trans_splicing
/db_xref="GeneID:5777700"
CDS complement(join(96591..96616,97153..97384,69098..69211))
/gene="rps12"
/locus_tag="AcamCp045"
/trans_splicing
/note="trans-splices to 5' rps12 exon within the LSC"
/codon_start=1
/transl_table=11
/product="ribosomal protein S12"
/protein_id="YP_001586161.1"
/db_xref="GI:161622289"
/db_xref="GeneID:5777700"
/translation="MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKK
PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTL
DAVGVKDRQQGRSKYGVKKPK"
misc_feature 1..83496
/note="large single copy region (LSC)"
gene complement(136..1197)
/gene="psbA"
/locus_tag="AcamCp001"
/db_xref="GeneID:5777757"

From this I would need to extract the NC number next to Locus for filename, the name next to organism, and the number in front of bp for bases. Then for every time there's a new gene, return what is next to gene=, and then the 2 numbers inside the complement parentheses for the start and end of that gene sequence.

Also, is there a way to make so if I put in the NC number I want, it spits back out the rest of the info...the website this is from is Nucleotide - Acorus americanus chloroplast, complete genome, so if I put in the number I wanted next to the NC, then it would give me the information I want as specified above for that specific NC number....?

I've just been having trouble getting the info extracted out of a GenBank file...any help in the coding for this would be great...thanks!
  #2 (permalink)  
Old 05-25-2009
durden_tyler's Avatar
durden_tyler durden_tyler is offline Forum Advisor  
Registered User
  
 

Join Date: Apr 2009
Posts: 551
Here's one way of doing it:


Code:
$ 
$ perl -ne 'if (/^LOCUS/) {
>   split; printf("Filename = %s\nBase = %s\n",$_[1],$_[2])
> } elsif (/organism/) {                                   
>   s/^\s*\/o/O/; s/=/ = /; print                 
> } elsif (/gene/) {                                       
>   print                                                  
> }' genbank.txt                                           
Filename = NC_010093                                       
Base = 153819                                              
Organism = "Acorus americanus"                             
     gene            complement(join(96591..97384,69098..69211))
                     /gene="rps12"                              
                     /gene="rps12"                              
     gene            complement(136..1197)                      
                     /gene="psbA"                               
                     /gene="psbA"                               
     gene            complement(1388..3959)                     
                     /gene="trnK"                               
                     /gene="trnK"                               
                     /gene="trnK"                               
                     /gene="trnK"                               
...
... (trimmed a lot of lines here for succinctness)
...
     gene            complement(153492..153566)
                     /gene="trnH"
                     /gene="trnH"
     gene            153733..153819
                     /gene="rps19"
$

You hadn't specified the output format, so I took the liberty to specify one of my own.

Quote:
...Then for every time there's a new gene, ... the 2 numbers inside the complement parentheses for the start and end of that gene sequence...
Note that not all of the lines starting with "gene" have a "complement" parenthesis. Some of the variations are:


Code:
151670..151951             <= number1..number2 format
complement(join(96591..97384,69098..69211))  <= "complement" followed by "join" and 2 number ranges
join(complement(69098..69211),139932..140725) <= "join" followed by "complement" and 2 number ranges

What do you want to display in such cases ?

Quote:
... is there a way to make so if I put in the NC number I want, it spits back out the rest of the info...
This would be useful if the file had more than one NC numbers. But the file from the website you specified, had data for only one NC number.

HTH,
tyler_durden
Bits Awarded / Charged to durden_tyler for this Post
Date User Comment Amount
05-25-2009 Neo N/A 50,000
  #3 (permalink)  
Old 05-25-2009
akreibich07 akreibich07 is offline
Registered User
  
 

Join Date: May 2009
Posts: 10
Hi tyler_durden,

Well, what's going on is I have over 100 GenBank files, each is it's own NC_ number...(this NC number is just a way to catalogue them). I want to be able to give that NC_ number as a command-line possibly and have it go to the internet, replace that website I posted with whatever NC_ number was given, go into that GenBank file, and return something like:

Name: ___________

FileName: ___________ (NC_######)

Bases: ________

Genes: rps (####, #####) - for example, this would be for the rps gene, and then a new line, and all the rest of the genes are given....if the word complement is present before it, then it should print a negative before the numbers in parentheses: rps -(####, #####)


If it's impossible to use the NC_ number as a command-line, it's okay for me to do it manually, just so I can have a program that can do the output of the name, filename, bases, and genes for me. It would save me loads and loads of time in my research, rather than going through and doing each one by hand. Thanks!

-akreibich07
  #4 (permalink)  
Old 05-25-2009
durden_tyler's Avatar
durden_tyler durden_tyler is offline Forum Advisor  
Registered User
  
 

Join Date: Apr 2009
Posts: 551
Quote:
Originally Posted by akreibich07 View Post
...
Name: ___________
(1) Name of what ?
(2) What's the value of this field for your example above ?

Quote:
Genes: rps (####, #####) - for example, this would be for the rps gene, and then a new line, and all the rest of the genes are given....if the word complement is present before it, then it should print a negative before the numbers in parentheses: rps -(####, #####)
This is not clear. You haven't addressed the occurrence of 2 different number ranges. Also, you haven't told what to do if the word "join" is present.

What should be the output for each of the following cases ? (Type in the required output for each case.)


Code:
(3)  gene 151670..151951
(4)  gene complement(join(96591..97384,69098..69211))
(5)  gene join(complement(69098..69211),139932..140725)
(6)  gene complement(153492..153566)

Quote:
If it's impossible to use the NC_ number as a command-line, it's okay for me to do it manually, just so I can have a program that can do the output of the name, filename, bases, and genes for me. It would save me loads and loads of time in my research, rather than going through and doing each one by hand. Thanks!
Given an NC_ number, is it possible to know the URL of the download text file (not the NCBI website) ?

For example, when you specify the NC_ number "NC_010093" - the website is "http://www.ncbi.nlm.nih.gov/nuccore/NC_010093". This is trivial. But this would not allow you to download the text file that has the data you need and that is embedded in that website.

The download URL for the text file for your NC_ number "NC_010093" is this -
"http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&db=nuccore&val=161622288&dopt=genbank&sendto=on&log$=seqview&extrafeat=976&m axplex=1"

Does this URL ring a bell ? Do you know how to generate this URL, given the NC_ number ? If you do, then you can use LWP::Simple perl module to download the text file to your computer and then process it the way you want.

HTH,
tyler_durden
Bits Awarded / Charged to durden_tyler for this Post
Date User Comment Amount
05-25-2009 Neo N/A 30,000
  #5 (permalink)  
Old 05-25-2009
akreibich07 akreibich07 is offline
Registered User
  
 

Join Date: May 2009
Posts: 10
(1) Name: _________ -- something like "Acorus Americanus"
(2) for situation (3) gene (151670..151951)
(4) gene -(96591..97384, 69098..69211)
(5) gene -(69098..69211), (139932..140725)
(6) gene -(153492..153566)

I'll check on the URL idea you presented...the most important thing to me right now is just getting a script that can be given a whole GenBank file and can give the output I specified. Even if it can only do it for the specific NC_ number I used as an example, if that one works, I can try to find ways to generalize it. Thanks again.

-akreibich07
  #6 (permalink)  
Old 05-26-2009
durden_tyler's Avatar
durden_tyler durden_tyler is offline Forum Advisor  
Registered User
  
 

Join Date: Apr 2009
Posts: 551
Quote:
Originally Posted by akreibich07 View Post
...the most important thing to me right now is just getting a script that can be given a whole GenBank file and can give the output I specified....
Ok, so what I did was that I went to the hyperlink: "http://www.ncbi.nlm.nih.gov/nuccore/NC_010093" and clicked "Download" -> "GenBank" and saved the resultant "sequences.gb" text file in my computer.

Assuming that you have figured out a way to download one "sequences.gb" for a given NC_ number, (either manually or via some other tool), here's what you can do:


Code:
$ 
$ head -5 sequences.gb
LOCUS       NC_010093             153819 bp    DNA     circular PLN 07-MAY-2009
DEFINITION  Acorus americanus chloroplast, complete genome.                    
ACCESSION   NC_010093                                                          
VERSION     NC_010093.1  GI:161622288                                          
DBLINK      Project:27981                                                      
$                                                                              
$ tail -5 sequences.gb                                                         
   153661 gattttggaa gaaaattcca attttcaaat acaaaaactt taaactttag gtaaaaaaag    
   153721 gaggaatcag ctgtgacacg ttcattaaaa aaaaatcctt ttgtggccaa tcatttattg    
   153781 gaaaaaattg agaagctcaa catgaaagag gagaaagaa                           
//                                                                             

$ 
$ cat process_genbank.pl
#!/usr/bin/perl -w      
my $element;            
my $in = 0;             
$gbfile = "sequences.gb";
open(GB, $gbfile) or die "Can't open $gbfile: $!";
while (<GB>) {                                    
  chomp($line = $_);                              
  if (/^locus/i) {                                
    @x = split /\s+/,$line;                       
    $fname = $x[1];                               
    $base = $x[2];                                
  } elsif (/ organism /i) {                       
    $line =~ s/\s*ORGANISM\s*//g;                 
    $oname = $line;
  } elsif (/ gene /) {
    @x = split/\s+/,$line;
    if ($x[2] =~ /^\d+..\d+$/) {
      $x[2] = "($x[2])";
    } elsif ($x[2] =~ /^complement\(\d+..\d+\)$/) {
      $x[2] =~ s/complement/-/;
    } elsif ($x[2] =~ /^complement\(join\((\d+..\d+),(\d+..\d+)\)\)$/) {
      $x[2] = "-($1,$2)";
    } elsif ($x[2] =~ /^join\(complement\((\d+..\d+)\),(\d+..\d+)\)$/) {
      $x[2] = "-($1),($2)";
    }
    $element = $x[2];
    $in = 1;
  } elsif (/^\s+\/gene="(.*)"/ and $in == 1) {
    $element = $1.":".$element;
    push @genearray, $element;
    $element = "";
    $in = 0;
  }
}
close(GB) or die "Can't close $gbfile: $!";

# Now start printing the information we gathered
print "Name: $oname\n";
print "FileName: $fname\n";
print "Bases: $base\n";
print "Genes:";
foreach $item (@genearray) {
  @x = split/:/, $item;
  printf("\t%-10s  %s\n",$x[0],$x[1]);
}

$
$

The "head" and "tail" commands were executed so that you know which text file I am talking about.

The execution of the perl program is shown below:


Code:
$ 
$ perl process_genbank.pl
Name: Acorus americanus  
FileName: NC_010093      
Bases: 153819            
Genes:  rps12       -(96591..97384,69098..69211)
        psbA        -(136..1197)                
        trnK        -(1388..3959)               
        matK        -(1680..3215)               
        rps16       -(4654..5737)               
        trnQ        -(6563..6634)               
        psbK        (6985..7170)                
        psbI        (7607..7717)                
        trnS        -(7834..7921)               
        trnG        (8847..9678)                
        trnR        (9773..9844)                
        atpA        -(9940..11463)              
        atpF        -(11531..12803)             
        atpH        -(13300..13545)             
        atpI        -(14474..15217)             
        rps2        -(15451..16161)             
        rpoC2       -(16344..20504)             
        rpoC1       -(20670..23478)             
        rpoB        -(23505..26756)             
        trnC        (27763..27833)              
        petN        (28715..28804)              
        psbM        -(29784..29888)             
        trnD        -(30412..30485)             
        trnY        -(30823..30906)             
        trnE        -(30971..31043)             
        trnT        (31561..31632)              
        psbD        (32999..34060)              
        psbC        (34044..35429)              
        trnS        -(35649..35741)             
        psbZ        (36122..36310)              
        trnG        (36587..36657)              
        trnfM       -(36798..36871)             
        rps14       -(37033..37335)             
        psaB        -(37473..39677)             
        psaA        -(39703..41955)             
        ycf3        -(42574..44553)             
        trnS        (45383..45469)              
        rps4        -(45769..46374)             
        trnT        -(46725..46797)             
        trnL        (47541..48151)              
        trnF        (48522..48594)              
        ndhJ        -(48980..49456)             
        ndhK        -(49532..50251)             
        ndhC        -(50305..50667)             
        trnV        -(52580..53260)             
        trnM        (53438..53509)              
        atpE        -(53676..54080)             
        atpB        -(54077..55573)             
        rbcL        (56391..57833)              
        psaI        (59093..59203)              
        ycf4        (59579..60133)              
        cemA        (60675..61364)              
        petA        (61554..62516)              
        psbJ        -(63395..63517)             
        psbL        -(63652..63768)             
        psbF        -(63791..63910)             
        psbE        -(63920..64171)             
        petL        (65441..65536)              
        petG        (65699..65812)              
        trnW        -(65926..65999)             
        trnP        -(66165..66238)             
        psaJ        (66596..66724)              
        rpl33       (67088..67288)              
        rps18       (67428..67733)              
        rpl20       -(68002..68355)             
        rps12       -(69098..69211),(139932..140725)
        clpP        -(69426..71438)                 
        psbB        (71880..73406)                  
        psbT        (73582..73689)                  
        psbN        -(73749..73880)                 
        psbH        (73993..74214)                  
        petB        (74334..75765)                  
        petD        (75956..77164)                  
        rpoA        -(77332..78345)                 
        rps11       -(78414..78830)                 
        rpl36       -(78951..79064)                 
        infA        -(79178..79411)                 
        rps8        -(79543..79941)                 
        rpl14       -(80114..80482)                 
        rpl16       -(80624..81969)                 
        rps3        -(82130..82786)                 
        rpl22       -(82872..83255)                 
        rps19       -(83305..83583)                 
        trnH        (83750..83824)                  
        rpl2        -(83856..85346)                 
        rpl23       -(85365..85646)                 
        trnI        -(85813..85886)                 
        ycf2        (86098..92328)                  
        trnL-CAA    -(92877..92957)                 
        ndhB        -(93521..95753)                 
        rps7        -(96071..96538)                 
        trnV        (99215..99286)                  
        rrn16       (99514..101004)                 
        trnI        (101318..102331)                
        trnA        (102400..103271)                
        rrn23       (103424..106235)                
        rrn4.5      (106334..106436)                
        rrn5        (106654..106774)                
        trnR        (107036..107109)                
        trnN        -(107607..107678)               
        ycf1        (108281..109521)
        ndhF        -(109702..111921)
        rpl32       (112773..112946)
        trnL        (113810..113889)
        ccsA        (113971..114936)
        ndhD        -(115163..116665)
        psaC        -(116780..117025)
        ndhE        -(117270..117575)
        ndhG        -(117668..118201)
        ndhI        -(118543..119082)
        ndhA        -(119160..121362)
        ndhH        -(121364..122545)
        rps15       -(122651..122923)
        ycf1        -(123294..129035)
        trnN        (129638..129709)
        trnR        -(130207..130280)
        rrn5        -(130542..130662)
        rrn4.5      -(130880..130982)
        rrn23       -(131081..133892)
        trnA        -(134045..134916)
        trnI        -(134985..135998)
        rrn16       -(136312..137802)
        trnV        -(138030..138101)
        rps12       (139932..140725)
        rps7        (140778..141245)
        ndhB        (141563..143795)
        trnL        (144359..144439)
        ycf2        -(144988..151218)
        trnI        (151430..151503)
        rpl23       (151670..151951)
        rpl2        (151970..153460)
        trnH        -(153492..153566)
        rps19       (153733..153819)
$
$

Hope that helps,
tyler_durden
Closed Thread

Bookmarks

Tags
extraction, file, perl, script

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 11:20 PM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0