![]() |
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| perl -write values in a file to @array in perl | meghana | Shell Programming and Scripting | 27 | 06-07-2009 05:05 PM |
| How to Turn perl one-liners into full perl script? | EDALBNUG | UNIX for Dummies Questions & Answers | 1 | 02-04-2009 10:49 AM |
| [Perl] Accessing array elements within a sed command in Perl script | userix | Shell Programming and Scripting | 2 | 10-03-2008 12:05 PM |
| [PERL] Running unix commands within Perl Scripts | userix | Shell Programming and Scripting | 1 | 05-28-2008 06:06 PM |
| Perl: Run perl script in the current process | vino | Shell Programming and Scripting | 10 | 12-09-2005 10:45 AM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
||||
|
GenBank Perl help...
Hey guys,
I'm doing some Perl scripting for genomic data out of GenBank files...I have to extract the name of the plant, the file name, the number of bases, and all of the genes including their starting and ending positions...for example, with this GenBank file, LOCUS NC_010093 153819 bp DNA circular PLN 07-MAY-2009 DEFINITION Acorus americanus chloroplast, complete genome. ACCESSION NC_010093 VERSION NC_010093.1 GI:161622288 DBLINK Project:27981 KEYWORDS . SOURCE chloroplast Acorus americanus ORGANISM Acorus americanus Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; Liliopsida; Acoraceae; Acorus. REFERENCE 1 (bases 1 to 153819) AUTHORS Peery,R.M., Chumley,T.W., Kuehl,J.V., Boore,J.L. and Raubeson,L.A. TITLE The complete chloroplast genome of Acorus americanus JOURNAL Unpublished REFERENCE 2 (bases 1 to 153819) CONSRTM NCBI Genome Project TITLE Direct Submission JOURNAL Submitted (03-DEC-2007) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA REFERENCE 3 (bases 1 to 153819) AUTHORS Peery,R.M., Chumley,T.W., Kuehl,J.V., Boore,J.L. and Raubeson,L.A. TITLE Direct Submission JOURNAL Submitted (09-NOV-2007) Department of Biological Sciences, Central Washington University, 400 E University Way, Ellensburg, WA 98926-7537, USA COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI review. The reference sequence was derived from EU273602. COMPLETENESS: full length. FEATURES Location/Qualifiers source 1..153819 /organism="Acorus americanus" /organelle="plastid:chloroplast" /mol_type="genomic DNA" /db_xref="taxon:263995" gene complement(join(96591..97384,69098..69211)) /gene="rps12" /locus_tag="AcamCp045" /trans_splicing /db_xref="GeneID:5777700" CDS complement(join(96591..96616,97153..97384,69098..69211)) /gene="rps12" /locus_tag="AcamCp045" /trans_splicing /note="trans-splices to 5' rps12 exon within the LSC" /codon_start=1 /transl_table=11 /product="ribosomal protein S12" /protein_id="YP_001586161.1" /db_xref="GI:161622289" /db_xref="GeneID:5777700" /translation="MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKK PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTL DAVGVKDRQQGRSKYGVKKPK" misc_feature 1..83496 /note="large single copy region (LSC)" gene complement(136..1197) /gene="psbA" /locus_tag="AcamCp001" /db_xref="GeneID:5777757" From this I would need to extract the NC number next to Locus for filename, the name next to organism, and the number in front of bp for bases. Then for every time there's a new gene, return what is next to gene=, and then the 2 numbers inside the complement parentheses for the start and end of that gene sequence. Also, is there a way to make so if I put in the NC number I want, it spits back out the rest of the info...the website this is from is Nucleotide - Acorus americanus chloroplast, complete genome, so if I put in the number I wanted next to the NC, then it would give me the information I want as specified above for that specific NC number....? I've just been having trouble getting the info extracted out of a GenBank file...any help in the coding for this would be great...thanks! |
| Bits Awarded / Charged to durden_tyler for this Post | |||
| Date | User | Comment | Amount |
| 05-25-2009 | Neo | N/A | 50,000 |
|
||||
|
Hi tyler_durden,
Well, what's going on is I have over 100 GenBank files, each is it's own NC_ number...(this NC number is just a way to catalogue them). I want to be able to give that NC_ number as a command-line possibly and have it go to the internet, replace that website I posted with whatever NC_ number was given, go into that GenBank file, and return something like: Name: ___________ FileName: ___________ (NC_######) Bases: ________ Genes: rps (####, #####) - for example, this would be for the rps gene, and then a new line, and all the rest of the genes are given....if the word complement is present before it, then it should print a negative before the numbers in parentheses: rps -(####, #####) If it's impossible to use the NC_ number as a command-line, it's okay for me to do it manually, just so I can have a program that can do the output of the name, filename, bases, and genes for me. It would save me loads and loads of time in my research, rather than going through and doing each one by hand. Thanks! -akreibich07 |
|
|||||
|
(1) Name of what ?
(2) What's the value of this field for your example above ? Quote:
What should be the output for each of the following cases ? (Type in the required output for each case.) Code:
(3) gene 151670..151951 (4) gene complement(join(96591..97384,69098..69211)) (5) gene join(complement(69098..69211),139932..140725) (6) gene complement(153492..153566) Quote:
For example, when you specify the NC_ number "NC_010093" - the website is "http://www.ncbi.nlm.nih.gov/nuccore/NC_010093". This is trivial. But this would not allow you to download the text file that has the data you need and that is embedded in that website. The download URL for the text file for your NC_ number "NC_010093" is this - "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&db=nuccore&val=161622288&dopt=genbank&sendto=on&log$=seqview&extrafeat=976&m axplex=1" Does this URL ring a bell ? Do you know how to generate this URL, given the NC_ number ? If you do, then you can use LWP::Simple perl module to download the text file to your computer and then process it the way you want. HTH, tyler_durden |
| Bits Awarded / Charged to durden_tyler for this Post | |||
| Date | User | Comment | Amount |
| 05-25-2009 | Neo | N/A | 30,000 |
|
||||
|
(1) Name: _________ -- something like "Acorus Americanus"
(2) for situation (3) gene (151670..151951) (4) gene -(96591..97384, 69098..69211) (5) gene -(69098..69211), (139932..140725) (6) gene -(153492..153566) I'll check on the URL idea you presented...the most important thing to me right now is just getting a script that can be given a whole GenBank file and can give the output I specified. Even if it can only do it for the specific NC_ number I used as an example, if that one works, I can try to find ways to generalize it. Thanks again. -akreibich07 |
|
|||||
|
Quote:
Assuming that you have figured out a way to download one "sequences.gb" for a given NC_ number, (either manually or via some other tool), here's what you can do: Code:
$
$ head -5 sequences.gb
LOCUS NC_010093 153819 bp DNA circular PLN 07-MAY-2009
DEFINITION Acorus americanus chloroplast, complete genome.
ACCESSION NC_010093
VERSION NC_010093.1 GI:161622288
DBLINK Project:27981
$
$ tail -5 sequences.gb
153661 gattttggaa gaaaattcca attttcaaat acaaaaactt taaactttag gtaaaaaaag
153721 gaggaatcag ctgtgacacg ttcattaaaa aaaaatcctt ttgtggccaa tcatttattg
153781 gaaaaaattg agaagctcaa catgaaagag gagaaagaa
//
$
$ cat process_genbank.pl
#!/usr/bin/perl -w
my $element;
my $in = 0;
$gbfile = "sequences.gb";
open(GB, $gbfile) or die "Can't open $gbfile: $!";
while (<GB>) {
chomp($line = $_);
if (/^locus/i) {
@x = split /\s+/,$line;
$fname = $x[1];
$base = $x[2];
} elsif (/ organism /i) {
$line =~ s/\s*ORGANISM\s*//g;
$oname = $line;
} elsif (/ gene /) {
@x = split/\s+/,$line;
if ($x[2] =~ /^\d+..\d+$/) {
$x[2] = "($x[2])";
} elsif ($x[2] =~ /^complement\(\d+..\d+\)$/) {
$x[2] =~ s/complement/-/;
} elsif ($x[2] =~ /^complement\(join\((\d+..\d+),(\d+..\d+)\)\)$/) {
$x[2] = "-($1,$2)";
} elsif ($x[2] =~ /^join\(complement\((\d+..\d+)\),(\d+..\d+)\)$/) {
$x[2] = "-($1),($2)";
}
$element = $x[2];
$in = 1;
} elsif (/^\s+\/gene="(.*)"/ and $in == 1) {
$element = $1.":".$element;
push @genearray, $element;
$element = "";
$in = 0;
}
}
close(GB) or die "Can't close $gbfile: $!";
# Now start printing the information we gathered
print "Name: $oname\n";
print "FileName: $fname\n";
print "Bases: $base\n";
print "Genes:";
foreach $item (@genearray) {
@x = split/:/, $item;
printf("\t%-10s %s\n",$x[0],$x[1]);
}
$
$
The execution of the perl program is shown below: Code:
$
$ perl process_genbank.pl
Name: Acorus americanus
FileName: NC_010093
Bases: 153819
Genes: rps12 -(96591..97384,69098..69211)
psbA -(136..1197)
trnK -(1388..3959)
matK -(1680..3215)
rps16 -(4654..5737)
trnQ -(6563..6634)
psbK (6985..7170)
psbI (7607..7717)
trnS -(7834..7921)
trnG (8847..9678)
trnR (9773..9844)
atpA -(9940..11463)
atpF -(11531..12803)
atpH -(13300..13545)
atpI -(14474..15217)
rps2 -(15451..16161)
rpoC2 -(16344..20504)
rpoC1 -(20670..23478)
rpoB -(23505..26756)
trnC (27763..27833)
petN (28715..28804)
psbM -(29784..29888)
trnD -(30412..30485)
trnY -(30823..30906)
trnE -(30971..31043)
trnT (31561..31632)
psbD (32999..34060)
psbC (34044..35429)
trnS -(35649..35741)
psbZ (36122..36310)
trnG (36587..36657)
trnfM -(36798..36871)
rps14 -(37033..37335)
psaB -(37473..39677)
psaA -(39703..41955)
ycf3 -(42574..44553)
trnS (45383..45469)
rps4 -(45769..46374)
trnT -(46725..46797)
trnL (47541..48151)
trnF (48522..48594)
ndhJ -(48980..49456)
ndhK -(49532..50251)
ndhC -(50305..50667)
trnV -(52580..53260)
trnM (53438..53509)
atpE -(53676..54080)
atpB -(54077..55573)
rbcL (56391..57833)
psaI (59093..59203)
ycf4 (59579..60133)
cemA (60675..61364)
petA (61554..62516)
psbJ -(63395..63517)
psbL -(63652..63768)
psbF -(63791..63910)
psbE -(63920..64171)
petL (65441..65536)
petG (65699..65812)
trnW -(65926..65999)
trnP -(66165..66238)
psaJ (66596..66724)
rpl33 (67088..67288)
rps18 (67428..67733)
rpl20 -(68002..68355)
rps12 -(69098..69211),(139932..140725)
clpP -(69426..71438)
psbB (71880..73406)
psbT (73582..73689)
psbN -(73749..73880)
psbH (73993..74214)
petB (74334..75765)
petD (75956..77164)
rpoA -(77332..78345)
rps11 -(78414..78830)
rpl36 -(78951..79064)
infA -(79178..79411)
rps8 -(79543..79941)
rpl14 -(80114..80482)
rpl16 -(80624..81969)
rps3 -(82130..82786)
rpl22 -(82872..83255)
rps19 -(83305..83583)
trnH (83750..83824)
rpl2 -(83856..85346)
rpl23 -(85365..85646)
trnI -(85813..85886)
ycf2 (86098..92328)
trnL-CAA -(92877..92957)
ndhB -(93521..95753)
rps7 -(96071..96538)
trnV (99215..99286)
rrn16 (99514..101004)
trnI (101318..102331)
trnA (102400..103271)
rrn23 (103424..106235)
rrn4.5 (106334..106436)
rrn5 (106654..106774)
trnR (107036..107109)
trnN -(107607..107678)
ycf1 (108281..109521)
ndhF -(109702..111921)
rpl32 (112773..112946)
trnL (113810..113889)
ccsA (113971..114936)
ndhD -(115163..116665)
psaC -(116780..117025)
ndhE -(117270..117575)
ndhG -(117668..118201)
ndhI -(118543..119082)
ndhA -(119160..121362)
ndhH -(121364..122545)
rps15 -(122651..122923)
ycf1 -(123294..129035)
trnN (129638..129709)
trnR -(130207..130280)
rrn5 -(130542..130662)
rrn4.5 -(130880..130982)
rrn23 -(131081..133892)
trnA -(134045..134916)
trnI -(134985..135998)
rrn16 -(136312..137802)
trnV -(138030..138101)
rps12 (139932..140725)
rps7 (140778..141245)
ndhB (141563..143795)
trnL (144359..144439)
ycf2 -(144988..151218)
trnI (151430..151503)
rpl23 (151670..151951)
rpl2 (151970..153460)
trnH -(153492..153566)
rps19 (153733..153819)
$
$
tyler_durden |
|
||||
|
Hi again tyler_durden,
This code is working great for all the files I have tried it on so far. Thank you...it helps more than you know. I had a couple quick questions: (1) This again probably has something to do with the URL idea you talked about a few replies ago, but I do most of my text editing on Emacs. I have another program that I already made (much more simple), which just goes onto the main NCBI page and returns only the NC_ file numbers that I want and puts them into another file. Is it possible to do something on Emacs where I put in the file I want to run (perl process_genbank.pl) and then after it put in the specific NC_ file that I want it run on (almost as a command-line). And if so, would there be specific changes to the code I would have to make? (2) If I needed at some point in time the output to be more specific, so each line after genes: said something like rps12 CDS exon -(####..####), intron -(####..####), exon(etc) rps16 tRNA (etc) If you look on that NCBI file, the left margin has gene for each new gene and then under it says either CDS if it's coding a protein, or tRNA, or I believe there might be a few rRNA's in other files, and then has each exon and intron for that particular gene before going onto the next gene. Is it possible to have that output like I said above (one gene per line, just double spaces in between things like the gene name, the CDS/tRNA/rRNA, etc)? Again, thanks a million for the code you already submitted on here. It's already been a big help with some of my research. ![]() -akreibich07 -----Post Update----- Just a quick note... I don't think I was quite clear in my last message, so I wanted to clarify. Next to each instance of gene, there is a sequence of numbers that signifies the entire sequence of bases that that gene is contained in. When you go down to the exon, intron, exon subheadings below each gene, each has its own sequence of numbers which is always inside the range of the full sequence of numbers given by the sequence next to gene. In the input that I just stated I was possibly looking for, I didn't want the full sequence next to gene, just the individual subsequences next to each exon, intron, exon, etc... In an over-simplified example, if the sequence next to gene was (1..117), then I wouldn't want that, but rather something like... exon (1..53), intron (54..98), exon (99..117). It will always (at least I'm banking on it) work out that if the numbers for each exon, intron sequence were concatenated, it would end up being the full sequence from the gene heading. -akreibich07 -----Post Update----- If there is no exon, intron, exon subheadings below gene, this indicates that the full sequence is just one full exon with no interrupting introns. In cases like this, the original output that you gave will be good, just with an added CDS/tRNA/rRNA in between the gene name and the range. |
| Sponsored Links | ||
|
|
![]() |
| Bookmarks |
| Tags |
| extraction, file, perl, script |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|