...
Is it possible to do something on Emacs where I put in the file I want to run (perl process_genbank.pl) and then after it put in the specific NC_ file that I want it run on (almost as a command-line)...
Sorry, I have never used Emacs.
In vim, however, if you type this:
Code:
: r ! <shell_command>
at the editor prompt ":", then it will execute the shell_command and paste the output in your vim file at the cursor position.
You may want to check the Emacs documentation for the equivalent set of keystrokes.
Quote:
...
(2) If I needed at some point in time the output to be more specific, so each line after genes: said something like
If you look on that NCBI file, the left margin has gene for each new gene and then under it says either CDS if it's coding a protein, or tRNA, or I believe there might be a few rRNA's in other files, and then has each exon and intron for that particular gene before going onto the next gene. Is it possible to have that output like I said above (one gene per line, just double spaces in between things like the gene name, the CDS/tRNA/rRNA, etc)?
...
This is with references to "sequences.gb" from ""http://www.ncbi.nlm.nih.gov/nuccore/NC_010093"
(1) The gene name was picked from the line right below / gene / e.g. for this excerpt:
Code:
$ perl -ne 'print $.,"\t",$_ if 35..56' sequences.gb
35 gene complement(join(96591..97384,69098..69211))
36 /gene="rps12"
37 /locus_tag="AcamCp045"
38 /trans_splicing
39 /db_xref="GeneID:5777700"
40 CDS complement(join(96591..96616,97153..97384,69098..69211))
41 /gene="rps12"
42 /locus_tag="AcamCp045"
43 /trans_splicing
44 /note="trans-splices to 5' rps12 exon within the LSC"
45 /codon_start=1
46 /transl_table=11
47 /product="ribosomal protein S12"
48 /protein_id="YP_001586161.1"
49 /db_xref="GI:161622289"
50 /db_xref="GeneID:5777700"
51 /translation="MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKK
52 PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTL
53 DAVGVKDRQQGRSKYGVKKPK"
54 misc_feature 1..83496
55 /note="large single copy region (LSC)"
56 gene complement(136..1197)
$
The name in bold red was picked for gene name; the name in bold blue was *NOT*.
(2) In cases where none of CDS, tRNA or rRNA was found for a gene, the value "DEFAULT" is being displayed e.g. for this excerpt:
Code:
$
$ perl -ne 'print $.,"\t",$_ if 1784..1793' sequences.gb
1784 gene 108281..109521
1785 /gene="ycf1"
1786 /locus_tag="AcamCp068"
1787 /note="hypothetical protein RF1; 5' truncated fragment
1788 within IRb"
1789 /pseudo
1790 /db_xref="GeneID:5777792"
1791 misc_feature 109522..127794
1792 /note="small single copy region (SSC)"
1793 gene complement(109702..111921)
$
$
you'd see "DEFAULT" as the gene type. You can set it to NULL or something else.
The updated script is as follows:
Code:
$
$ cat process_genbank2.pl
#!/usr/bin/perl -w
my $element = ""; # an element of the gene array
# An element of the gene array will be of the form:
#
# GENE_NAME~GENE_SEQUENCE~GENE_TYPE~EXIN_STRING
#
# GENE_NAME = the value after "/gene=..." right below / gene /
# GENE_SEQUENCE = the formatted gene sequence to the right of / gene /
# GENE_TYPE = CDS|tRNA|rRNA, or "" if none was found
# EXIN_STRING = ", EI FMTSEQ, EI FMTSEQ..." where EI = exon|intron
# and FMTSEQ = formatted sequence of that EI
#
my $gbfile = "sequences.gb"; # file that we are processing
my $genename = ""; # main gene name; right below / gene /
my $geneseq = ""; # main gene sequence
my $exin = ""; # string of exons and introns
my $genetype = ""; # the gene type - CDS|tRNA|rRNA
## subroutines
sub format_sequence {
my $seq = "@_";
if ($seq =~ /^\d+..\d+$/) {
$seq = "($seq)";
} elsif ($seq =~ /^complement\(\d+..\d+\)$/) {
$seq =~ s/complement/-/;
} elsif ($seq =~ /^complement\(join\((\d+..\d+),(\d+..\d+)\)\)$/) {
$seq = "-($1,$2)";
} elsif ($seq =~ /^join\(complement\((\d+..\d+)\),(\d+..\d+)\)$/) {
$seq = "-($1),($2)";
}
return $seq;
}
## main section
open(GB, $gbfile) or die "Can't open $gbfile: $!";
while (<GB>) {
chomp($line = $_);
next if ($line =~ /^\s*$/ or $line =~ /\/\//);
if (/^locus/i) {
@x = split /\s+/,$line;
$fname = $x[1];
$base = $x[2];
} elsif (/ organism /i) {
$line =~ s/\s*ORGANISM\s*//g;
$oname = $line;
} elsif (/^ORIGIN| gene /) {
if ($element ne "" or /^ORIGIN/) {
# push all data we've gathered of the previous gene into the array
$element = "$element~$exin";
push @genearray, $element;
$element = "";
$exin = "";
$genename = "";
$genetype = "";
}
if (/ gene /){
@x = split/\s+/,$line;
$geneseq = format_sequence($x[2]);
$element = $geneseq;
}
} elsif ($genename eq "" and /^\s+\/gene="(.*)"/) {
$genename = $1;
$element = "$genename~$element";
} elsif (/ (CDS|tRNA|rRNA) /) {
$genetype = $1;
$element = "$element~$genetype";
} elsif (defined substr($line,5,1) &&
substr($line,5,1) ne " " &&
/ (exon|intron) /) {
@x = split/\s+/,$line;
$x[2] = format_sequence($x[2]);
$exin = "$exin, $1 $x[2]";
}
}
close(GB) or die "Can't close $gbfile: $!";
# Now start printing the information we gathered
print "Name: $oname\n";
print "FileName: $fname\n";
print "Bases: $base\n";
print "Genes:";
foreach $item (@genearray) {
@x = split/~/, $item;
$genetype = "DEFAULT"; # set genetype to "DEFAULT"
$geneseq = $x[1]; # set the default geneseq to the main gene sequence
# now start from the right side of the array
if (defined $x[3] and $x[3] ne ""){ # this is exin information
$geneseq = $x[3];
substr($geneseq,0,2) = "";
# set genetype if $x[2] looks good
if (defined $x[2] and $x[2] !~ /^, /) {
$genetype = $x[2];
}
} elsif (defined $x[2] and $x[2] =~ /^, /) { # no CDS|tRNA|rRNA found, but exin found
$geneseq = $x[2];
substr($geneseq,0,2) = "";
} elsif (defined $x[2] and $x[2] !~ /^, /) { # no exin found but CDS|tRNA|rRNA found
$genetype = $x[2];
}
printf("\t%-10s %-10s %s\n",$x[0],$genetype,$geneseq);
}
$
$
And the execution on the "sequences.gb" mentioned above is as follows:
_________________________________________________________________________
"And the eighth and final rule: if this is your first time at Fight Club, you have to fight."
I have reviewed many examples on-line about running another process (either PERL or shell command or a program), but do not find any usefull for my needs way. (Reviewed and not useful the system(), 'back ticks', exec() and open())
I would like to run another PERL-script from first one, not... (1 Reply)
My issue is that the perl script (as I have done it so far) created empty branches when I try to check some branches on existence.
I am using multydimentional hashes: found it as the best way for information that I need to handle. Saing multidimentional I means hash of hashes ... So, I have
... (2 Replies)
we create a HP-UX software depot with a new perl-modul. after installation of the software depot, the perl module
i can't find with instmodsh in the inventory for installed Perl modules.
- i have learned of using instmodsh command : i find out what modules are already installed on my system.
... (0 Replies)
Hi all,
Not sure if this should be in the programming forum, but I believe it will get more response under the Shell Programming and Scripting FORUM.
Am trying to write a customized df script in Perl and need some help with regards to using arrays and file handlers.
At the moment am... (3 Replies)
I am completely new to perl programming. My father is helping me learn said programming language. However, I am stuck on one of the assignments he has given me, and I can't find very much help with it via google, either because I have a tiny attention span, or because I can be very very dense.
... (4 Replies)
How can i print the output of a perl script on a unix console and redirect the same in a log file under same directory simultaneously ?
Like in Shell script, we use tee, is there anything in Perl or any other option ? (2 Replies)
Hi ,
I have list of genbank id's and ref number in this format.
gi|9910297|ref|NM_019974.1|
I want to retrive the gene name and fuction for each genbank list. I have around 1300 gi numbers in my excel sheet.
So anybody can help me to retrive the information from NCBI through perl script... (0 Replies)
Hi ,
This script working for fine if pass script-name.sh Jul/10/2007 ,I want to pass 20070710(yyyymmdd) .Please any help it should be appereciated.
use Time::Local;
my $d = $ARGV;
my $t = $ARGV;
my $m = "";
@d = split /\//, $d;
@t = split /:/, $t;
if ( $d eq "Jan" ) { $m = 0 }... (7 Replies)
Hi all,
Using Perl, I need to extract DNA bases from a GenBank file for a given plant species. A sample GenBank file is here...
Nucleotide
This is saved on my computer as NC_001666.gb. I also have a file that is saved on my computer as NC_001666.txt. This text file has a list of all... (5 Replies)
I am trying to use a script to replace the header of each file, whose filename are stored within the array $test, using the sed command within a Perl script as follows:
$count = 0;
while ( $count < $#test )
{
`sed -e 's/BIOGRF 321/BIOGRF 332/g' ${test} > 0`;
`cat 0 >... (2 Replies)