GenBank Perl help...


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting GenBank Perl help...
# 8  
Old 05-27-2009
Quote:
Originally Posted by akreibich07
...
Is it possible to do something on Emacs where I put in the file I want to run (perl process_genbank.pl) and then after it put in the specific NC_ file that I want it run on (almost as a command-line)...
Sorry, I have never used Emacs.
In vim, however, if you type this:

Code:
: r ! <shell_command>

at the editor prompt ":", then it will execute the shell_command and paste the output in your vim file at the cursor position.
You may want to check the Emacs documentation for the equivalent set of keystrokes.

Quote:
...
(2) If I needed at some point in time the output to be more specific, so each line after genes: said something like

rps12 CDS exon -(####..####), intron -(####..####), exon(etc)
rps16 tRNA (etc)

If you look on that NCBI file, the left margin has gene for each new gene and then under it says either CDS if it's coding a protein, or tRNA, or I believe there might be a few rRNA's in other files, and then has each exon and intron for that particular gene before going onto the next gene. Is it possible to have that output like I said above (one gene per line, just double spaces in between things like the gene name, the CDS/tRNA/rRNA, etc)?
...
This is with references to "sequences.gb" from ""http://www.ncbi.nlm.nih.gov/nuccore/NC_010093"

(1) The gene name was picked from the line right below / gene / e.g. for this excerpt:

Code:
$ perl -ne 'print $.,"\t",$_ if 35..56' sequences.gb
35           gene            complement(join(96591..97384,69098..69211))
36                           /gene="rps12"
37                           /locus_tag="AcamCp045"
38                           /trans_splicing
39                           /db_xref="GeneID:5777700"
40           CDS             complement(join(96591..96616,97153..97384,69098..69211))
41                           /gene="rps12"
42                           /locus_tag="AcamCp045"
43                           /trans_splicing
44                           /note="trans-splices to 5' rps12 exon within the LSC"
45                           /codon_start=1
46                           /transl_table=11
47                           /product="ribosomal protein S12"
48                           /protein_id="YP_001586161.1"
49                           /db_xref="GI:161622289"
50                           /db_xref="GeneID:5777700"
51                           /translation="MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKK
52                           PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTL
53                           DAVGVKDRQQGRSKYGVKKPK"
54           misc_feature    1..83496
55                           /note="large single copy region (LSC)"
56           gene            complement(136..1197)
$

The name in bold red was picked for gene name; the name in bold blue was *NOT*.

(2) In cases where none of CDS, tRNA or rRNA was found for a gene, the value "DEFAULT" is being displayed e.g. for this excerpt:

Code:
$
$ perl -ne 'print $.,"\t",$_ if 1784..1793' sequences.gb
1784         gene            108281..109521
1785                         /gene="ycf1"
1786                         /locus_tag="AcamCp068"
1787                         /note="hypothetical protein RF1; 5' truncated fragment
1788                         within IRb"
1789                         /pseudo
1790                         /db_xref="GeneID:5777792"
1791         misc_feature    109522..127794
1792                         /note="small single copy region (SSC)"
1793         gene            complement(109702..111921)
$
$

you'd see "DEFAULT" as the gene type. You can set it to NULL or something else.

The updated script is as follows:

Code:
$ 
$ cat process_genbank2.pl
#!/usr/bin/perl -w       

my $element = "";             # an element of the gene array
# An element of the gene array will be of the form:         
#                                                           
# GENE_NAME~GENE_SEQUENCE~GENE_TYPE~EXIN_STRING             
#                                                           
# GENE_NAME = the value after "/gene=..." right below / gene /
# GENE_SEQUENCE = the formatted gene sequence to the right of / gene /
# GENE_TYPE = CDS|tRNA|rRNA, or "" if none was found                  
# EXIN_STRING = ", EI FMTSEQ, EI FMTSEQ..." where EI = exon|intron    
#               and FMTSEQ = formatted sequence of that EI            
#                                                                     
my $gbfile = "sequences.gb";  # file that we are processing           
my $genename = "";            # main gene name; right below / gene /  
my $geneseq = "";             # main gene sequence                    
my $exin = "";                # string of exons and introns           
my $genetype = "";            # the gene type - CDS|tRNA|rRNA         

## subroutines
sub format_sequence {
  my $seq = "@_";    
  if ($seq =~ /^\d+..\d+$/) {
    $seq = "($seq)";         
  } elsif ($seq =~ /^complement\(\d+..\d+\)$/) {
    $seq =~ s/complement/-/;                    
  } elsif ($seq =~ /^complement\(join\((\d+..\d+),(\d+..\d+)\)\)$/) {
    $seq = "-($1,$2)";                                               
  } elsif ($seq =~ /^join\(complement\((\d+..\d+)\),(\d+..\d+)\)$/) {
    $seq = "-($1),($2)";                                             
  }                                                                  
  return $seq;                                                       
}                                                                    

## main section
open(GB, $gbfile) or die "Can't open $gbfile: $!";
while (<GB>) {                                    
  chomp($line = $_);                              
  next if ($line =~ /^\s*$/ or $line =~ /\/\//);  
  if (/^locus/i) {                                
    @x = split /\s+/,$line;                       
    $fname = $x[1];                               
    $base = $x[2];                                
  } elsif (/ organism /i) {                       
    $line =~ s/\s*ORGANISM\s*//g;                 
    $oname = $line;                               
  } elsif (/^ORIGIN| gene /) {                    
    if ($element ne "" or /^ORIGIN/) {            
      # push all data we've gathered of the previous gene into the array
      $element = "$element~$exin";                                      
      push @genearray, $element;                                        
      $element = "";                                                    
      $exin = "";                                                       
      $genename = "";                                                   
      $genetype = "";                                                   
    }                                                                   
    if (/ gene /){                                                      
      @x = split/\s+/,$line;                                              
      $geneseq = format_sequence($x[2]);                                  
      $element = $geneseq;                                                
    }                                                                   
  } elsif ($genename eq "" and /^\s+\/gene="(.*)"/) {                   
    $genename = $1;                                                     
    $element = "$genename~$element";                                    
  } elsif (/ (CDS|tRNA|rRNA) /) {                                       
    $genetype = $1;                                                     
    $element = "$element~$genetype";                                    
  } elsif (defined substr($line,5,1) &&                                 
           substr($line,5,1) ne " " &&                                  
           / (exon|intron) /) {
      @x = split/\s+/,$line;
      $x[2] = format_sequence($x[2]);
      $exin = "$exin, $1 $x[2]";
  }
}
close(GB) or die "Can't close $gbfile: $!";

# Now start printing the information we gathered
print "Name: $oname\n";
print "FileName: $fname\n";
print "Bases: $base\n";
print "Genes:";
foreach $item (@genearray) {
  @x = split/~/, $item;
  $genetype = "DEFAULT"; # set genetype to "DEFAULT"
  $geneseq = $x[1];      # set the default geneseq to the main gene sequence
  # now start from the right side of the array
  if (defined $x[3] and $x[3] ne ""){ # this is exin information
    $geneseq = $x[3];
    substr($geneseq,0,2) = "";
    # set genetype if $x[2] looks good
    if (defined $x[2] and $x[2] !~ /^, /) {
      $genetype = $x[2];
    }
  } elsif (defined $x[2] and $x[2] =~ /^, /) { # no CDS|tRNA|rRNA found, but exin found
    $geneseq = $x[2];
    substr($geneseq,0,2) = "";
  } elsif (defined $x[2] and $x[2] !~ /^, /) { # no exin found but CDS|tRNA|rRNA found
    $genetype = $x[2];
  }
  printf("\t%-10s  %-10s %s\n",$x[0],$genetype,$geneseq);
}

$
$

And the execution on the "sequences.gb" mentioned above is as follows:

Code:
$ 
$ perl process_genbank2.pl
Name: Acorus americanus   
FileName: NC_010093       
Bases: 153819             
Genes:  rps12       CDS        -(96591..97384,69098..69211)
        psbA        CDS        -(136..1197)                
        trnK        tRNA       exon -(1388..1422), intron -(1423..3922)
        matK        CDS        exon -(3923..3959)                      
        rps16       CDS        exon -(4654..4874), intron -(4875..5697), exon -(5698..5737)
        trnQ        tRNA       -(6563..6634)                                               
        psbK        CDS        (6985..7170)                                                
        psbI        CDS        (7607..7717)                                                
        trnS        tRNA       -(7834..7921)                                               
        trnG        tRNA       exon (8847..8869), intron (8870..9630), exon (9631..9678)   
        trnR        tRNA       (9773..9844)                                                
        atpA        CDS        -(9940..11463)                                              
        atpF        CDS        exon -(11531..11940), intron -(11941..12658), exon -(12659..12803)
        atpH        CDS        -(13300..13545)                                                   
        atpI        CDS        -(14474..15217)                                                   
        rps2        CDS        -(15451..16161)                                                   
        rpoC2       CDS        -(16344..20504)                                                   
        rpoC1       CDS        exon -(20670..22286), intron -(22287..23046), exon -(23047..23478)
        rpoB        CDS        -(23505..26756)                                                   
        trnC        tRNA       (27763..27833)                                                    
        petN        CDS        (28715..28804)                                                    
        psbM        CDS        -(29784..29888)                                                   
        trnD        tRNA       -(30412..30485)                                                   
        trnY        tRNA       -(30823..30906)                                                   
        trnE        tRNA       -(30971..31043)                                                   
        trnT        tRNA       (31561..31632)                                                    
        psbD        CDS        (32999..34060)                                                    
        psbC        CDS        (34044..35429)                                                    
        trnS        tRNA       -(35649..35741)                                                   
        psbZ        CDS        (36122..36310)                                                    
        trnG        tRNA       (36587..36657)                                                    
        trnfM       tRNA       -(36798..36871)                                                   
        rps14       CDS        -(37033..37335)                                                   
        psaB        CDS        -(37473..39677)                                                   
        psaA        CDS        -(39703..41955)                                                   
        ycf3        CDS        exon -(42574..42726), intron -(42727..43454), exon -(43455..43684), intron -(43685..44429), exon -(44430..44553)                                                                                                                                 
        trnS        tRNA       (45383..45469)                                                                                           
        rps4        CDS        -(45769..46374)                                                                                          
        trnT        tRNA       -(46725..46797)                                                                                          
        trnL        tRNA       exon (47541..47575), intron (47576..48101), exon (48102..48151)                                          
        trnF        tRNA       (48522..48594)                                                                                           
        ndhJ        CDS        -(48980..49456)                                                                                          
        ndhK        CDS        -(49532..50251)                                                                                          
        ndhC        CDS        -(50305..50667)                                                                                          
        trnV        tRNA       exon -(52580..52614), intron -(52615..53222), exon -(53223..53260)                                       
        trnM        tRNA       (53438..53509)                                                                                           
        atpE        CDS        -(53676..54080)                                                                                          
        atpB        CDS        -(54077..55573)                                                                                          
        rbcL        CDS        (56391..57833)                                                                                           
        psaI        CDS        (59093..59203)                                                                                           
        ycf4        CDS        (59579..60133)                                                                                           
        cemA        CDS        (60675..61364)                                                                                           
        petA        CDS        (61554..62516)                                                                                           
        psbJ        CDS        -(63395..63517)                                                                                          
        psbL        CDS        -(63652..63768)                                                                                          
        psbF        CDS        -(63791..63910)                                                                                          
        psbE        CDS        -(63920..64171)                                                                                          
        petL        CDS        (65441..65536)                                                                                           
        petG        CDS        (65699..65812)                                                                                           
        trnW        tRNA       -(65926..65999)                                                                                          
        trnP        tRNA       -(66165..66238)                                                                                          
        psaJ        CDS        (66596..66724)                                                                                           
        rpl33       CDS        (67088..67288)                                                                                           
        rps18       CDS        (67428..67733)                                                                                           
        rpl20       CDS        -(68002..68355)                                                                                          
        rps12       CDS        exon -(69098..69211)                                                                                     
        clpP        CDS        exon -(69426..69671), intron -(69672..70313), exon -(70314..70605), intron -(70606..71367), exon -(71368..71438)                                                                                                                                 
        psbB        CDS        (71880..73406)                                                                                           
        psbT        CDS        (73582..73689)                                                                                           
        psbN        CDS        -(73749..73880)                                                                                          
        psbH        CDS        (73993..74214)                                                                                           
        petB        CDS        exon (74334..74339), intron (74340..75123), exon (75124..75765)                                          
        petD        CDS        exon (75956..75963), intron (75964..76689), exon (76690..77164)                                          
        rpoA        CDS        -(77332..78345)                                                                                          
        rps11       CDS        -(78414..78830)                                                                                          
        rpl36       CDS        -(78951..79064)                                                                                          
        infA        CDS        -(79178..79411)                                                                                          
        rps8        CDS        -(79543..79941)                                                                                          
        rpl14       CDS        -(80114..80482)                                                                                          
        rpl16       CDS        exon -(80624..81022), intron -(81023..81960), exon -(81961..81969)                                       
        rps3        CDS        -(82130..82786)                                                                                          
        rpl22       CDS        -(82872..83255)                                                                                          
        rps19       CDS        -(83305..83583)                                                                                          
        trnH        tRNA       (83750..83824)                                                                                           
        rpl2        CDS        exon -(83856..84286), intron -(84287..84955), exon -(84956..85346)                                       
        rpl23       CDS        -(85365..85646)                                                                                          
        trnI        tRNA       -(85813..85886)                                                                                          
        ycf2        CDS        (86098..92328)                                                                                           
        trnL-CAA    tRNA       -(92877..92957)                                                                                          
        ndhB        CDS        exon -(93521..94276), intron -(94277..94976), exon -(94977..95753)                                       
        rps7        CDS        exon -(96591..96616), intron -(96617..97152), exon -(97153..97384)                                       
        trnV        tRNA       (99215..99286)                                                                                           
        rrn16       rRNA       (99514..101004)                                                                                          
        trnI        tRNA       exon (101318..101354), intron (101355..102296), exon (102297..102331)                                    
        trnA        tRNA       exon (102400..102437), intron (102438..103236), exon (103237..103271)                                    
        rrn23       rRNA       (103424..106235)                                                                                         
        rrn4.5      rRNA       (106334..106436)                                                                                         
        rrn5        rRNA       (106654..106774)                                                                                         
        trnR        tRNA       (107036..107109)                                                                                         
        trnN        tRNA       -(107607..107678)                                                                                        
        ycf1        DEFAULT    (108281..109521)
        ndhF        CDS        -(109702..111921)
        rpl32       CDS        (112773..112946)
        trnL        tRNA       (113810..113889)
        ccsA        CDS        (113971..114936)
        ndhD        CDS        -(115163..116665)
        psaC        CDS        -(116780..117025)
        ndhE        CDS        -(117270..117575)
        ndhG        CDS        -(117668..118201)
        ndhI        CDS        -(118543..119082)
        ndhA        CDS        exon -(119160..119698), intron -(119699..120803), exon -(120804..121362)
        ndhH        CDS        -(121364..122545)
        rps15       CDS        -(122651..122923)
        ycf1        CDS        -(123294..129035)
        trnN        tRNA       (129638..129709)
        trnR        tRNA       -(130207..130280)
        rrn5        rRNA       -(130542..130662)
        rrn4.5      rRNA       -(130880..130982)
        rrn23       rRNA       -(131081..133892)
        trnA        tRNA       exon -(134045..134079), intron -(134080..134878), exon -(134879..134916)
        trnI        tRNA       exon -(134985..135019), intron -(135020..135961), exon -(135962..135998)
        rrn16       rRNA       -(136312..137802)
        trnV        tRNA       -(138030..138101)
        rps12       DEFAULT    exon (139932..140163), intron (140164..140699), exon (140700..140725)
        rps7        CDS        (140778..141245)
        ndhB        CDS        exon (141563..142339), intron (142340..143039), exon (143040..143795)
        trnL        tRNA       (144359..144439)
        ycf2        CDS        -(144988..151218)
        trnI        tRNA       (151430..151503)
        rpl23       CDS        (151670..151951)
        rpl2        CDS        exon (151970..152360), intron (152361..153029), exon (153030..153460)
        trnH        tRNA       -(153492..153566)
        rps19       DEFAULT    (153733..153819)
$
$

Hope that helps,
tyler_durden

_________________________________________________________________________
"And the eighth and final rule: if this is your first time at Fight Club, you have to fight."
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Programming

PERL: In a perl-scripttTrying to execute another perl-script that SETS SOME VARIABLES !

I have reviewed many examples on-line about running another process (either PERL or shell command or a program), but do not find any usefull for my needs way. (Reviewed and not useful the system(), 'back ticks', exec() and open()) I would like to run another PERL-script from first one, not... (1 Reply)
Discussion started by: alex_5161
1 Replies

2. Programming

Perl: restrict perl from automaticaly creating a hash branches on check

My issue is that the perl script (as I have done it so far) created empty branches when I try to check some branches on existence. I am using multydimentional hashes: found it as the best way for information that I need to handle. Saing multidimentional I means hash of hashes ... So, I have ... (2 Replies)
Discussion started by: alex_5161
2 Replies

3. UNIX for Advanced & Expert Users

perl and HP-UX : instmodsh in combination with software depot : update inventory for installed Perl

we create a HP-UX software depot with a new perl-modul. after installation of the software depot, the perl module i can't find with instmodsh in the inventory for installed Perl modules. - i have learned of using instmodsh command : i find out what modules are already installed on my system. ... (0 Replies)
Discussion started by: bora99
0 Replies

4. Shell Programming and Scripting

HELP on Perl array / sorting - trying to convert Korn Shell Script to Perl

Hi all, Not sure if this should be in the programming forum, but I believe it will get more response under the Shell Programming and Scripting FORUM. Am trying to write a customized df script in Perl and need some help with regards to using arrays and file handlers. At the moment am... (3 Replies)
Discussion started by: newbie_01
3 Replies

5. Shell Programming and Scripting

Hidden Characters in Regular Expression Matching Perl - Perl Newbie

I am completely new to perl programming. My father is helping me learn said programming language. However, I am stuck on one of the assignments he has given me, and I can't find very much help with it via google, either because I have a tiny attention span, or because I can be very very dense. ... (4 Replies)
Discussion started by: kittyluva2
4 Replies

6. Shell Programming and Scripting

Perl :How to print the o/p of a Perl script on console and redirecting same in log file @ same time.

How can i print the output of a perl script on a unix console and redirect the same in a log file under same directory simultaneously ? Like in Shell script, we use tee, is there anything in Perl or any other option ? (2 Replies)
Discussion started by: butterfly20
2 Replies

7. Shell Programming and Scripting

To retrieve gene name & function thorugh Genbank id (gi|9910297)

Hi , I have list of genbank id's and ref number in this format. gi|9910297|ref|NM_019974.1| I want to retrive the gene name and fuction for each genbank list. I have around 1300 gi numbers in my excel sheet. So anybody can help me to retrive the information from NCBI through perl script... (0 Replies)
Discussion started by: shibujohn82
0 Replies

8. Shell Programming and Scripting

Passing date formats in Perl: i.e. Jul/10/2007 -> 20070710 (yyyymmdd) - Perl

Hi , This script working for fine if pass script-name.sh Jul/10/2007 ,I want to pass 20070710(yyyymmdd) .Please any help it should be appereciated. use Time::Local; my $d = $ARGV; my $t = $ARGV; my $m = ""; @d = split /\//, $d; @t = split /:/, $t; if ( $d eq "Jan" ) { $m = 0 }... (7 Replies)
Discussion started by: akil
7 Replies

9. Shell Programming and Scripting

Extracting DNA sequences from GenBank files using Perl

Hi all, Using Perl, I need to extract DNA bases from a GenBank file for a given plant species. A sample GenBank file is here... Nucleotide This is saved on my computer as NC_001666.gb. I also have a file that is saved on my computer as NC_001666.txt. This text file has a list of all... (5 Replies)
Discussion started by: akreibich07
5 Replies

10. Shell Programming and Scripting

[Perl] Accessing array elements within a sed command in Perl script

I am trying to use a script to replace the header of each file, whose filename are stored within the array $test, using the sed command within a Perl script as follows: $count = 0; while ( $count < $#test ) { `sed -e 's/BIOGRF 321/BIOGRF 332/g' ${test} > 0`; `cat 0 >... (2 Replies)
Discussion started by: userix
2 Replies
Login or Register to Ask a Question