Hey guys,
I'm doing some Perl scripting for genomic data out of GenBank files...I have to extract the name of the plant, the file name, the number of bases, and all of the genes including their starting and ending positions...for example, with this GenBank file,
LOCUS NC_010093 153819 bp DNA circular PLN 07-MAY-2009
DEFINITION Acorus americanus chloroplast, complete genome.
ACCESSION NC_010093
VERSION NC_010093.1 GI:161622288
DBLINK Project:
27981
KEYWORDS .
SOURCE chloroplast Acorus americanus
ORGANISM
Acorus americanus
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; Liliopsida; Acoraceae; Acorus.
REFERENCE 1 (bases 1 to 153819)
AUTHORS Peery,R.M., Chumley,T.W., Kuehl,J.V., Boore,J.L. and Raubeson,L.A.
TITLE The complete chloroplast genome of Acorus americanus
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 153819)
CONSRTM NCBI Genome Project
TITLE Direct Submission
JOURNAL Submitted (03-DEC-2007) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
REFERENCE 3 (bases 1 to 153819)
AUTHORS Peery,R.M., Chumley,T.W., Kuehl,J.V., Boore,J.L. and Raubeson,L.A.
TITLE Direct Submission
JOURNAL Submitted (09-NOV-2007) Department of Biological Sciences, Central
Washington University, 400 E University Way, Ellensburg, WA
98926-7537, USA
COMMENT PROVISIONAL
REFSEQ: This record has not yet been subject to final
NCBI review. The reference sequence was derived from
EU273602.
COMPLETENESS: full length.
FEATURES Location/Qualifiers
source 1..153819
/organism="Acorus americanus"
/organelle="plastid:chloroplast"
/mol_type="genomic DNA"
/db_xref="taxon:
263995"
gene complement(join(96591..97384,69098..69211))
/gene="rps12"
/locus_tag="AcamCp045"
/trans_splicing
/db_xref="GeneID:
5777700"
CDS complement(join(96591..96616,97153..97384,69098..69211))
/gene="rps12"
/locus_tag="AcamCp045"
/trans_splicing
/note="trans-splices to 5' rps12 exon within the LSC"
/codon_start=1
/transl_table=
11
/product="ribosomal protein S12"
/protein_id="
YP_001586161.1"
/db_xref="GI:161622289"
/db_xref="GeneID:
5777700"
/translation="MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKK
PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTL
DAVGVKDRQQGRSKYGVKKPK"
misc_feature 1..83496
/note="large single copy region (LSC)"
gene complement(136..1197)
/gene="psbA"
/locus_tag="AcamCp001"
/db_xref="GeneID:
5777757"
From this I would need to extract the NC number next to Locus for filename, the name next to organism, and the number in front of bp for bases. Then for every time there's a new gene, return what is next to gene=, and then the 2 numbers inside the complement parentheses for the start and end of that gene sequence.
Also, is there a way to make so if I put in the NC number I want, it spits back out the rest of the info...the website this is from is
Nucleotide - Acorus americanus chloroplast, complete genome, so if I put in the number I wanted next to the NC, then it would give me the information I want as specified above for that specific NC number....?
I've just been having trouble getting the info extracted out of a GenBank file...any help in the coding for this would be great...thanks!