GenBank Perl help...

05-27-2009

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

Quote:

Originally Posted by akreibich07

...
Is it possible to do something on Emacs where I put in the file I want to run (perl process_genbank.pl) and then after it put in the specific NC_ file that I want it run on (almost as a command-line)...

Sorry, I have never used Emacs.
In vim, however, if you type this:

Code:

: r ! <shell_command>

at the editor prompt ":", then it will execute the shell_command and paste the output in your vim file at the cursor position.
You may want to check the Emacs documentation for the equivalent set of keystrokes.

Quote:

...
(2) If I needed at some point in time the output to be more specific, so each line after genes: said something like

rps12 CDS exon -(####..####), intron -(####..####), exon(etc)
rps16 tRNA (etc)

If you look on that NCBI file, the left margin has gene for each new gene and then under it says either CDS if it's coding a protein, or tRNA, or I believe there might be a few rRNA's in other files, and then has each exon and intron for that particular gene before going onto the next gene. Is it possible to have that output like I said above (one gene per line, just double spaces in between things like the gene name, the CDS/tRNA/rRNA, etc)?
...

This is with references to "sequences.gb" from ""http://www.ncbi.nlm.nih.gov/nuccore/NC_010093"

(1) The gene name was picked from the line right below / gene / e.g. for this excerpt:

Code:

$ perl -ne 'print $.,"\t",$_ if 35..56' sequences.gb
35           gene            complement(join(96591..97384,69098..69211))
36                           /gene="rps12"
37                           /locus_tag="AcamCp045"
38                           /trans_splicing
39                           /db_xref="GeneID:5777700"
40           CDS             complement(join(96591..96616,97153..97384,69098..69211))
41                           /gene="rps12"
42                           /locus_tag="AcamCp045"
43                           /trans_splicing
44                           /note="trans-splices to 5' rps12 exon within the LSC"
45                           /codon_start=1
46                           /transl_table=11
47                           /product="ribosomal protein S12"
48                           /protein_id="YP_001586161.1"
49                           /db_xref="GI:161622289"
50                           /db_xref="GeneID:5777700"
51                           /translation="MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKK
52                           PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTL
53                           DAVGVKDRQQGRSKYGVKKPK"
54           misc_feature    1..83496
55                           /note="large single copy region (LSC)"
56           gene            complement(136..1197)
$

The name in bold red was picked for gene name; the name in bold blue was *NOT*.

(2) In cases where none of CDS, tRNA or rRNA was found for a gene, the value "DEFAULT" is being displayed e.g. for this excerpt:

Code:

$
$ perl -ne 'print $.,"\t",$_ if 1784..1793' sequences.gb
1784         gene            108281..109521
1785                         /gene="ycf1"
1786                         /locus_tag="AcamCp068"
1787                         /note="hypothetical protein RF1; 5' truncated fragment
1788                         within IRb"
1789                         /pseudo
1790                         /db_xref="GeneID:5777792"
1791         misc_feature    109522..127794
1792                         /note="small single copy region (SSC)"
1793         gene            complement(109702..111921)
$
$

you'd see "DEFAULT" as the gene type. You can set it to NULL or something else.

The updated script is as follows:

Code:

$ 
$ cat process_genbank2.pl
#!/usr/bin/perl -w       

my $element = "";             # an element of the gene array
# An element of the gene array will be of the form:         
#                                                           
# GENE_NAME~GENE_SEQUENCE~GENE_TYPE~EXIN_STRING             
#                                                           
# GENE_NAME = the value after "/gene=..." right below / gene /
# GENE_SEQUENCE = the formatted gene sequence to the right of / gene /
# GENE_TYPE = CDS|tRNA|rRNA, or "" if none was found                  
# EXIN_STRING = ", EI FMTSEQ, EI FMTSEQ..." where EI = exon|intron    
#               and FMTSEQ = formatted sequence of that EI            
#                                                                     
my $gbfile = "sequences.gb";  # file that we are processing           
my $genename = "";            # main gene name; right below / gene /  
my $geneseq = "";             # main gene sequence                    
my $exin = "";                # string of exons and introns           
my $genetype = "";            # the gene type - CDS|tRNA|rRNA         

## subroutines
sub format_sequence {
  my $seq = "@_";    
  if ($seq =~ /^\d+..\d+$/) {
    $seq = "($seq)";         
  } elsif ($seq =~ /^complement\(\d+..\d+\)$/) {
    $seq =~ s/complement/-/;                    
  } elsif ($seq =~ /^complement\(join\((\d+..\d+),(\d+..\d+)\)\)$/) {
    $seq = "-($1,$2)";                                               
  } elsif ($seq =~ /^join\(complement\((\d+..\d+)\),(\d+..\d+)\)$/) {
    $seq = "-($1),($2)";                                             
  }                                                                  
  return $seq;                                                       
}                                                                    

## main section
open(GB, $gbfile) or die "Can't open $gbfile: $!";
while (<GB>) {                                    
  chomp($line = $_);                              
  next if ($line =~ /^\s*$/ or $line =~ /\/\//);  
  if (/^locus/i) {                                
    @x = split /\s+/,$line;                       
    $fname = $x[1];                               
    $base = $x[2];                                
  } elsif (/ organism /i) {                       
    $line =~ s/\s*ORGANISM\s*//g;                 
    $oname = $line;                               
  } elsif (/^ORIGIN| gene /) {                    
    if ($element ne "" or /^ORIGIN/) {            
      # push all data we've gathered of the previous gene into the array
      $element = "$element~$exin";                                      
      push @genearray, $element;                                        
      $element = "";                                                    
      $exin = "";                                                       
      $genename = "";                                                   
      $genetype = "";                                                   
    }                                                                   
    if (/ gene /){                                                      
      @x = split/\s+/,$line;                                              
      $geneseq = format_sequence($x[2]);                                  
      $element = $geneseq;                                                
    }                                                                   
  } elsif ($genename eq "" and /^\s+\/gene="(.*)"/) {                   
    $genename = $1;                                                     
    $element = "$genename~$element";                                    
  } elsif (/ (CDS|tRNA|rRNA) /) {                                       
    $genetype = $1;                                                     
    $element = "$element~$genetype";                                    
  } elsif (defined substr($line,5,1) &&                                 
           substr($line,5,1) ne " " &&                                  
           / (exon|intron) /) {
      @x = split/\s+/,$line;
      $x[2] = format_sequence($x[2]);
      $exin = "$exin, $1 $x[2]";
  }
}
close(GB) or die "Can't close $gbfile: $!";

# Now start printing the information we gathered
print "Name: $oname\n";
print "FileName: $fname\n";
print "Bases: $base\n";
print "Genes:";
foreach $item (@genearray) {
  @x = split/~/, $item;
  $genetype = "DEFAULT"; # set genetype to "DEFAULT"
  $geneseq = $x[1];      # set the default geneseq to the main gene sequence
  # now start from the right side of the array
  if (defined $x[3] and $x[3] ne ""){ # this is exin information
    $geneseq = $x[3];
    substr($geneseq,0,2) = "";
    # set genetype if $x[2] looks good
    if (defined $x[2] and $x[2] !~ /^, /) {
      $genetype = $x[2];
    }
  } elsif (defined $x[2] and $x[2] =~ /^, /) { # no CDS|tRNA|rRNA found, but exin found
    $geneseq = $x[2];
    substr($geneseq,0,2) = "";
  } elsif (defined $x[2] and $x[2] !~ /^, /) { # no exin found but CDS|tRNA|rRNA found
    $genetype = $x[2];
  }
  printf("\t%-10s  %-10s %s\n",$x[0],$genetype,$geneseq);
}

$
$

And the execution on the "sequences.gb" mentioned above is as follows:

Code:

$ 
$ perl process_genbank2.pl
Name: Acorus americanus   
FileName: NC_010093       
Bases: 153819             
Genes:  rps12       CDS        -(96591..97384,69098..69211)
        psbA        CDS        -(136..1197)                
        trnK        tRNA       exon -(1388..1422), intron -(1423..3922)
        matK        CDS        exon -(3923..3959)                      
        rps16       CDS        exon -(4654..4874), intron -(4875..5697), exon -(5698..5737)
        trnQ        tRNA       -(6563..6634)                                               
        psbK        CDS        (6985..7170)                                                
        psbI        CDS        (7607..7717)                                                
        trnS        tRNA       -(7834..7921)                                               
        trnG        tRNA       exon (8847..8869), intron (8870..9630), exon (9631..9678)   
        trnR        tRNA       (9773..9844)                                                
        atpA        CDS        -(9940..11463)                                              
        atpF        CDS        exon -(11531..11940), intron -(11941..12658), exon -(12659..12803)
        atpH        CDS        -(13300..13545)                                                   
        atpI        CDS        -(14474..15217)                                                   
        rps2        CDS        -(15451..16161)                                                   
        rpoC2       CDS        -(16344..20504)                                                   
        rpoC1       CDS        exon -(20670..22286), intron -(22287..23046), exon -(23047..23478)
        rpoB        CDS        -(23505..26756)                                                   
        trnC        tRNA       (27763..27833)                                                    
        petN        CDS        (28715..28804)                                                    
        psbM        CDS        -(29784..29888)                                                   
        trnD        tRNA       -(30412..30485)                                                   
        trnY        tRNA       -(30823..30906)                                                   
        trnE        tRNA       -(30971..31043)                                                   
        trnT        tRNA       (31561..31632)                                                    
        psbD        CDS        (32999..34060)                                                    
        psbC        CDS        (34044..35429)                                                    
        trnS        tRNA       -(35649..35741)                                                   
        psbZ        CDS        (36122..36310)                                                    
        trnG        tRNA       (36587..36657)                                                    
        trnfM       tRNA       -(36798..36871)                                                   
        rps14       CDS        -(37033..37335)                                                   
        psaB        CDS        -(37473..39677)                                                   
        psaA        CDS        -(39703..41955)                                                   
        ycf3        CDS        exon -(42574..42726), intron -(42727..43454), exon -(43455..43684), intron -(43685..44429), exon -(44430..44553)                                                                                                                                 
        trnS        tRNA       (45383..45469)                                                                                           
        rps4        CDS        -(45769..46374)                                                                                          
        trnT        tRNA       -(46725..46797)                                                                                          
        trnL        tRNA       exon (47541..47575), intron (47576..48101), exon (48102..48151)                                          
        trnF        tRNA       (48522..48594)                                                                                           
        ndhJ        CDS        -(48980..49456)                                                                                          
        ndhK        CDS        -(49532..50251)                                                                                          
        ndhC        CDS        -(50305..50667)                                                                                          
        trnV        tRNA       exon -(52580..52614), intron -(52615..53222), exon -(53223..53260)                                       
        trnM        tRNA       (53438..53509)                                                                                           
        atpE        CDS        -(53676..54080)                                                                                          
        atpB        CDS        -(54077..55573)                                                                                          
        rbcL        CDS        (56391..57833)                                                                                           
        psaI        CDS        (59093..59203)                                                                                           
        ycf4        CDS        (59579..60133)                                                                                           
        cemA        CDS        (60675..61364)                                                                                           
        petA        CDS        (61554..62516)                                                                                           
        psbJ        CDS        -(63395..63517)                                                                                          
        psbL        CDS        -(63652..63768)                                                                                          
        psbF        CDS        -(63791..63910)                                                                                          
        psbE        CDS        -(63920..64171)                                                                                          
        petL        CDS        (65441..65536)                                                                                           
        petG        CDS        (65699..65812)                                                                                           
        trnW        tRNA       -(65926..65999)                                                                                          
        trnP        tRNA       -(66165..66238)                                                                                          
        psaJ        CDS        (66596..66724)                                                                                           
        rpl33       CDS        (67088..67288)                                                                                           
        rps18       CDS        (67428..67733)                                                                                           
        rpl20       CDS        -(68002..68355)                                                                                          
        rps12       CDS        exon -(69098..69211)                                                                                     
        clpP        CDS        exon -(69426..69671), intron -(69672..70313), exon -(70314..70605), intron -(70606..71367), exon -(71368..71438)                                                                                                                                 
        psbB        CDS        (71880..73406)                                                                                           
        psbT        CDS        (73582..73689)                                                                                           
        psbN        CDS        -(73749..73880)                                                                                          
        psbH        CDS        (73993..74214)                                                                                           
        petB        CDS        exon (74334..74339), intron (74340..75123), exon (75124..75765)                                          
        petD        CDS        exon (75956..75963), intron (75964..76689), exon (76690..77164)                                          
        rpoA        CDS        -(77332..78345)                                                                                          
        rps11       CDS        -(78414..78830)                                                                                          
        rpl36       CDS        -(78951..79064)                                                                                          
        infA        CDS        -(79178..79411)                                                                                          
        rps8        CDS        -(79543..79941)                                                                                          
        rpl14       CDS        -(80114..80482)                                                                                          
        rpl16       CDS        exon -(80624..81022), intron -(81023..81960), exon -(81961..81969)                                       
        rps3        CDS        -(82130..82786)                                                                                          
        rpl22       CDS        -(82872..83255)                                                                                          
        rps19       CDS        -(83305..83583)                                                                                          
        trnH        tRNA       (83750..83824)                                                                                           
        rpl2        CDS        exon -(83856..84286), intron -(84287..84955), exon -(84956..85346)                                       
        rpl23       CDS        -(85365..85646)                                                                                          
        trnI        tRNA       -(85813..85886)                                                                                          
        ycf2        CDS        (86098..92328)                                                                                           
        trnL-CAA    tRNA       -(92877..92957)                                                                                          
        ndhB        CDS        exon -(93521..94276), intron -(94277..94976), exon -(94977..95753)                                       
        rps7        CDS        exon -(96591..96616), intron -(96617..97152), exon -(97153..97384)                                       
        trnV        tRNA       (99215..99286)                                                                                           
        rrn16       rRNA       (99514..101004)                                                                                          
        trnI        tRNA       exon (101318..101354), intron (101355..102296), exon (102297..102331)                                    
        trnA        tRNA       exon (102400..102437), intron (102438..103236), exon (103237..103271)                                    
        rrn23       rRNA       (103424..106235)                                                                                         
        rrn4.5      rRNA       (106334..106436)                                                                                         
        rrn5        rRNA       (106654..106774)                                                                                         
        trnR        tRNA       (107036..107109)                                                                                         
        trnN        tRNA       -(107607..107678)                                                                                        
        ycf1        DEFAULT    (108281..109521)
        ndhF        CDS        -(109702..111921)
        rpl32       CDS        (112773..112946)
        trnL        tRNA       (113810..113889)
        ccsA        CDS        (113971..114936)
        ndhD        CDS        -(115163..116665)
        psaC        CDS        -(116780..117025)
        ndhE        CDS        -(117270..117575)
        ndhG        CDS        -(117668..118201)
        ndhI        CDS        -(118543..119082)
        ndhA        CDS        exon -(119160..119698), intron -(119699..120803), exon -(120804..121362)
        ndhH        CDS        -(121364..122545)
        rps15       CDS        -(122651..122923)
        ycf1        CDS        -(123294..129035)
        trnN        tRNA       (129638..129709)
        trnR        tRNA       -(130207..130280)
        rrn5        rRNA       -(130542..130662)
        rrn4.5      rRNA       -(130880..130982)
        rrn23       rRNA       -(131081..133892)
        trnA        tRNA       exon -(134045..134079), intron -(134080..134878), exon -(134879..134916)
        trnI        tRNA       exon -(134985..135019), intron -(135020..135961), exon -(135962..135998)
        rrn16       rRNA       -(136312..137802)
        trnV        tRNA       -(138030..138101)
        rps12       DEFAULT    exon (139932..140163), intron (140164..140699), exon (140700..140725)
        rps7        CDS        (140778..141245)
        ndhB        CDS        exon (141563..142339), intron (142340..143039), exon (143040..143795)
        trnL        tRNA       (144359..144439)
        ycf2        CDS        -(144988..151218)
        trnI        tRNA       (151430..151503)
        rpl23       CDS        (151670..151951)
        rpl2        CDS        exon (151970..152360), intron (152361..153029), exon (153030..153460)
        trnH        tRNA       -(153492..153566)
        rps19       DEFAULT    (153733..153819)
$
$

Hope that helps,
tyler_durden

_________________________________________________________________________
"And the eighth and final rule: if this is your first time at Fight Club, you have to fight."

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

Shell Programming and Scripting

GenBank Perl help...

10 More Discussions You Might Find Interesting

1. Programming

PERL: In a perl-scripttTrying to execute another perl-script that SETS SOME VARIABLES !

Discussion started by: alex_5161

2. Programming

Perl: restrict perl from automaticaly creating a hash branches on check

Discussion started by: alex_5161

3. UNIX for Advanced & Expert Users

perl and HP-UX : instmodsh in combination with software depot : update inventory for installed Perl

Discussion started by: bora99

4. Shell Programming and Scripting

HELP on Perl array / sorting - trying to convert Korn Shell Script to Perl

Discussion started by: newbie_01

5. Shell Programming and Scripting

Hidden Characters in Regular Expression Matching Perl - Perl Newbie

Discussion started by: kittyluva2

6. Shell Programming and Scripting

Perl :How to print the o/p of a Perl script on console and redirecting same in log file @ same time.

Discussion started by: butterfly20

7. Shell Programming and Scripting

To retrieve gene name & function thorugh Genbank id (gi|9910297)

Discussion started by: shibujohn82

8. Shell Programming and Scripting

Passing date formats in Perl: i.e. Jul/10/2007 -> 20070710 (yyyymmdd) - Perl

Discussion started by: akil

9. Shell Programming and Scripting

Extracting DNA sequences from GenBank files using Perl

Discussion started by: akreibich07

10. Shell Programming and Scripting

[Perl] Accessing array elements within a sed command in Perl script

Discussion started by: userix