![]() |
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| perl -write values in a file to @array in perl | meghana | Shell Programming and Scripting | 27 | 06-07-2009 06:05 PM |
| How to Turn perl one-liners into full perl script? | EDALBNUG | UNIX for Dummies Questions & Answers | 1 | 02-04-2009 10:49 AM |
| [Perl] Accessing array elements within a sed command in Perl script | userix | Shell Programming and Scripting | 2 | 10-03-2008 01:05 PM |
| [PERL] Running unix commands within Perl Scripts | userix | Shell Programming and Scripting | 1 | 05-28-2008 07:06 PM |
| Perl: Run perl script in the current process | vino | Shell Programming and Scripting | 10 | 12-09-2005 10:45 AM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
||||
|
GenBank Perl help...
Hey guys,
I'm doing some Perl scripting for genomic data out of GenBank files...I have to extract the name of the plant, the file name, the number of bases, and all of the genes including their starting and ending positions...for example, with this GenBank file, LOCUS NC_010093 153819 bp DNA circular PLN 07-MAY-2009 DEFINITION Acorus americanus chloroplast, complete genome. ACCESSION NC_010093 VERSION NC_010093.1 GI:161622288 DBLINK Project:27981 KEYWORDS . SOURCE chloroplast Acorus americanus ORGANISM Acorus americanus Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; Liliopsida; Acoraceae; Acorus. REFERENCE 1 (bases 1 to 153819) AUTHORS Peery,R.M., Chumley,T.W., Kuehl,J.V., Boore,J.L. and Raubeson,L.A. TITLE The complete chloroplast genome of Acorus americanus JOURNAL Unpublished REFERENCE 2 (bases 1 to 153819) CONSRTM NCBI Genome Project TITLE Direct Submission JOURNAL Submitted (03-DEC-2007) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA REFERENCE 3 (bases 1 to 153819) AUTHORS Peery,R.M., Chumley,T.W., Kuehl,J.V., Boore,J.L. and Raubeson,L.A. TITLE Direct Submission JOURNAL Submitted (09-NOV-2007) Department of Biological Sciences, Central Washington University, 400 E University Way, Ellensburg, WA 98926-7537, USA COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI review. The reference sequence was derived from EU273602. COMPLETENESS: full length. FEATURES Location/Qualifiers source 1..153819 /organism="Acorus americanus" /organelle="plastid:chloroplast" /mol_type="genomic DNA" /db_xref="taxon:263995" gene complement(join(96591..97384,69098..69211)) /gene="rps12" /locus_tag="AcamCp045" /trans_splicing /db_xref="GeneID:5777700" CDS complement(join(96591..96616,97153..97384,69098..69211)) /gene="rps12" /locus_tag="AcamCp045" /trans_splicing /note="trans-splices to 5' rps12 exon within the LSC" /codon_start=1 /transl_table=11 /product="ribosomal protein S12" /protein_id="YP_001586161.1" /db_xref="GI:161622289" /db_xref="GeneID:5777700" /translation="MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKK PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTL DAVGVKDRQQGRSKYGVKKPK" misc_feature 1..83496 /note="large single copy region (LSC)" gene complement(136..1197) /gene="psbA" /locus_tag="AcamCp001" /db_xref="GeneID:5777757" From this I would need to extract the NC number next to Locus for filename, the name next to organism, and the number in front of bp for bases. Then for every time there's a new gene, return what is next to gene=, and then the 2 numbers inside the complement parentheses for the start and end of that gene sequence. Also, is there a way to make so if I put in the NC number I want, it spits back out the rest of the info...the website this is from is Nucleotide - Acorus americanus chloroplast, complete genome, so if I put in the number I wanted next to the NC, then it would give me the information I want as specified above for that specific NC number....? I've just been having trouble getting the info extracted out of a GenBank file...any help in the coding for this would be great...thanks! |
| Bits Awarded / Charged to durden_tyler for this Post | |||
| Date | User | Comment | Amount |
| 05-25-2009 | Neo | N/A | 50,000 |
|
||||
|
Hi tyler_durden,
Well, what's going on is I have over 100 GenBank files, each is it's own NC_ number...(this NC number is just a way to catalogue them). I want to be able to give that NC_ number as a command-line possibly and have it go to the internet, replace that website I posted with whatever NC_ number was given, go into that GenBank file, and return something like: Name: ___________ FileName: ___________ (NC_######) Bases: ________ Genes: rps (####, #####) - for example, this would be for the rps gene, and then a new line, and all the rest of the genes are given....if the word complement is present before it, then it should print a negative before the numbers in parentheses: rps -(####, #####) If it's impossible to use the NC_ number as a command-line, it's okay for me to do it manually, just so I can have a program that can do the output of the name, filename, bases, and genes for me. It would save me loads and loads of time in my research, rather than going through and doing each one by hand. Thanks! -akreibich07 |
|
|||||
|
(1) Name of what ? (2) What's the value of this field for your example above ? Quote:
What should be the output for each of the following cases ? (Type in the required output for each case.) Code:
(3) gene 151670..151951 (4) gene complement(join(96591..97384,69098..69211)) (5) gene join(complement(69098..69211),139932..140725) (6) gene complement(153492..153566) Quote:
For example, when you specify the NC_ number "NC_010093" - the website is "http://www.ncbi.nlm.nih.gov/nuccore/NC_010093". This is trivial. But this would not allow you to download the text file that has the data you need and that is embedded in that website. The download URL for the text file for your NC_ number "NC_010093" is this - "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&db=nuccore&val=161622288&dopt=genbank&sendto=on&log$=seqview&extrafeat=976&m axplex=1" Does this URL ring a bell ? Do you know how to generate this URL, given the NC_ number ? If you do, then you can use LWP::Simple perl module to download the text file to your computer and then process it the way you want. HTH, tyler_durden |
| Bits Awarded / Charged to durden_tyler for this Post | |||
| Date | User | Comment | Amount |
| 05-25-2009 | Neo | N/A | 30,000 |
|
||||
|
(1) Name: _________ -- something like "Acorus Americanus"
(2) for situation (3) gene (151670..151951) (4) gene -(96591..97384, 69098..69211) (5) gene -(69098..69211), (139932..140725) (6) gene -(153492..153566) I'll check on the URL idea you presented...the most important thing to me right now is just getting a script that can be given a whole GenBank file and can give the output I specified. Even if it can only do it for the specific NC_ number I used as an example, if that one works, I can try to find ways to generalize it. Thanks again. -akreibich07 |
|
|||||
|
Quote:
Assuming that you have figured out a way to download one "sequences.gb" for a given NC_ number, (either manually or via some other tool), here's what you can do: Code:
$
$ head -5 sequences.gb
LOCUS NC_010093 153819 bp DNA circular PLN 07-MAY-2009
DEFINITION Acorus americanus chloroplast, complete genome.
ACCESSION NC_010093
VERSION NC_010093.1 GI:161622288
DBLINK Project:27981
$
$ tail -5 sequences.gb
153661 gattttggaa gaaaattcca attttcaaat acaaaaactt taaactttag gtaaaaaaag
153721 gaggaatcag ctgtgacacg ttcattaaaa aaaaatcctt ttgtggccaa tcatttattg
153781 gaaaaaattg agaagctcaa catgaaagag gagaaagaa
//
$
$ cat process_genbank.pl
#!/usr/bin/perl -w
my $element;
my $in = 0;
$gbfile = "sequences.gb";
open(GB, $gbfile) or die "Can't open $gbfile: $!";
while (<GB>) {
chomp($line = $_);
if (/^locus/i) {
@x = split /\s+/,$line;
$fname = $x[1];
$base = $x[2];
} elsif (/ organism /i) {
$line =~ s/\s*ORGANISM\s*//g;
$oname = $line;
} elsif (/ gene /) {
@x = split/\s+/,$line;
if ($x[2] =~ /^\d+..\d+$/) {
$x[2] = "($x[2])";
} elsif ($x[2] =~ /^complement\(\d+..\d+\)$/) {
$x[2] =~ s/complement/-/;
} elsif ($x[2] =~ /^complement\(join\((\d+..\d+),(\d+..\d+)\)\)$/) {
$x[2] = "-($1,$2)";
} elsif ($x[2] =~ /^join\(complement\((\d+..\d+)\),(\d+..\d+)\)$/) {
$x[2] = "-($1),($2)";
}
$element = $x[2];
$in = 1;
} elsif (/^\s+\/gene="(.*)"/ and $in == 1) {
$element = $1.":".$element;
push @genearray, $element;
$element = "";
$in = 0;
}
}
close(GB) or die "Can't close $gbfile: $!";
# Now start printing the information we gathered
print "Name: $oname\n";
print "FileName: $fname\n";
print "Bases: $base\n";
print "Genes:";
foreach $item (@genearray) {
@x = split/:/, $item;
printf("\t%-10s %s\n",$x[0],$x[1]);
}
$
$
The "head" and "tail" commands were executed so that you know which text file I am talking about. The execution of the perl program is shown below: Code:
$
$ perl process_genbank.pl
Name: Acorus americanus
FileName: NC_010093
Bases: 153819
Genes: rps12 -(96591..97384,69098..69211)
psbA -(136..1197)
trnK -(1388..3959)
matK -(1680..3215)
rps16 -(4654..5737)
trnQ -(6563..6634)
psbK (6985..7170)
psbI (7607..7717)
trnS -(7834..7921)
trnG (8847..9678)
trnR (9773..9844)
atpA -(9940..11463)
atpF -(11531..12803)
atpH -(13300..13545)
atpI -(14474..15217)
rps2 -(15451..16161)
rpoC2 -(16344..20504)
rpoC1 -(20670..23478)
rpoB -(23505..26756)
trnC (27763..27833)
petN (28715..28804)
psbM -(29784..29888)
trnD -(30412..30485)
trnY -(30823..30906)
trnE -(30971..31043)
trnT (31561..31632)
psbD (32999..34060)
psbC (34044..35429)
trnS -(35649..35741)
psbZ (36122..36310)
trnG (36587..36657)
trnfM -(36798..36871)
rps14 -(37033..37335)
psaB -(37473..39677)
psaA -(39703..41955)
ycf3 -(42574..44553)
trnS (45383..45469)
rps4 -(45769..46374)
trnT -(46725..46797)
trnL (47541..48151)
trnF (48522..48594)
ndhJ -(48980..49456)
ndhK -(49532..50251)
ndhC -(50305..50667)
trnV -(52580..53260)
trnM (53438..53509)
atpE -(53676..54080)
atpB -(54077..55573)
rbcL (56391..57833)
psaI (59093..59203)
ycf4 (59579..60133)
cemA (60675..61364)
petA (61554..62516)
psbJ -(63395..63517)
psbL -(63652..63768)
psbF -(63791..63910)
psbE -(63920..64171)
petL (65441..65536)
petG (65699..65812)
trnW -(65926..65999)
trnP -(66165..66238)
psaJ (66596..66724)
rpl33 (67088..67288)
rps18 (67428..67733)
rpl20 -(68002..68355)
rps12 -(69098..69211),(139932..140725)
clpP -(69426..71438)
psbB (71880..73406)
psbT (73582..73689)
psbN -(73749..73880)
psbH (73993..74214)
petB (74334..75765)
petD (75956..77164)
rpoA -(77332..78345)
rps11 -(78414..78830)
rpl36 -(78951..79064)
infA -(79178..79411)
rps8 -(79543..79941)
rpl14 -(80114..80482)
rpl16 -(80624..81969)
rps3 -(82130..82786)
rpl22 -(82872..83255)
rps19 -(83305..83583)
trnH (83750..83824)
rpl2 -(83856..85346)
rpl23 -(85365..85646)
trnI -(85813..85886)
ycf2 (86098..92328)
trnL-CAA -(92877..92957)
ndhB -(93521..95753)
rps7 -(96071..96538)
trnV (99215..99286)
rrn16 (99514..101004)
trnI (101318..102331)
trnA (102400..103271)
rrn23 (103424..106235)
rrn4.5 (106334..106436)
rrn5 (106654..106774)
trnR (107036..107109)
trnN -(107607..107678)
ycf1 (108281..109521)
ndhF -(109702..111921)
rpl32 (112773..112946)
trnL (113810..113889)
ccsA (113971..114936)
ndhD -(115163..116665)
psaC -(116780..117025)
ndhE -(117270..117575)
ndhG -(117668..118201)
ndhI -(118543..119082)
ndhA -(119160..121362)
ndhH -(121364..122545)
rps15 -(122651..122923)
ycf1 -(123294..129035)
trnN (129638..129709)
trnR -(130207..130280)
rrn5 -(130542..130662)
rrn4.5 -(130880..130982)
rrn23 -(131081..133892)
trnA -(134045..134916)
trnI -(134985..135998)
rrn16 -(136312..137802)
trnV -(138030..138101)
rps12 (139932..140725)
rps7 (140778..141245)
ndhB (141563..143795)
trnL (144359..144439)
ycf2 -(144988..151218)
trnI (151430..151503)
rpl23 (151670..151951)
rpl2 (151970..153460)
trnH -(153492..153566)
rps19 (153733..153819)
$
$
Hope that helps, tyler_durden |
![]() |
| Bookmarks |
| Tags |
| extraction, file, perl, script |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|