Extracting DNA sequences from GenBank files using Perl


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extracting DNA sequences from GenBank files using Perl
# 1  
Old 06-23-2009
Extracting DNA sequences from GenBank files using Perl

Hi all,

Using Perl, I need to extract DNA bases from a GenBank file for a given plant species. A sample GenBank file is here...

Nucleotide

This is saved on my computer as NC_001666.gb. I also have a file that is saved on my computer as NC_001666.txt. This text file has a list of all the genes and their positions in the species corresponding to NC_001666 (corn). Here is a sample of how the text file is formatted...

rbcL (56874..58304)
atpB -(54618..56114)
atpE -(54208..54621)
trnM (54020..54092)
trnV -(53158..53834)

For example, if in my command prompt I give input of the program name, the species number that I want, and the specific gene from that species whose DNA sequence I want:

perl nucleotide_bases.pl NC_001666 trnM

The program would go into NC_001666.txt, find trnM, see that it has a range from 54020 to 54092 and is on the positive strand(no negative sign). The program then goes into NC_001666.gb, goes to the long list of DNA bases at the bottom and starts at position 54020 and returns all base letters through 54092 (inclusively). So for this specific trnM, the output would be:

gcctacttaactcagtggttagagtattgctttcatacggcgggagtcattggttcaaatccaatagtaggta

If a gene has a negative next to the position range (meaning it's on the negative strand of DNA), the output should be reversed, starting from the higher position, going to the lower. Also, when a negative is there, in that output, all A's should be switched to T's, and all G's to C's and vice versa.

Also, if a gene appears more than once in a text file, give an error message that it appears more than once, and end the program.

If I could get a Perl script to return this information for any species (NC_number) I want, and any gene from that species that I want, it would be a great help in the research I am conducting. Thank you all for your time, and any help on how to write this script would be appreciated.

-akreibich07
# 2  
Old 06-23-2009
What you want is "Beginning Perl for Bioinformatics" you can purchase on amazon.com. Also look into BioPerl
# 3  
Old 06-23-2009
Quote:
Originally Posted by KevinADC
What you want is "Beginning Perl for Bioinformatics" you can purchase on amazon.com. Also look into BioPerl
Excellent thanks!

I thought this article on the BioPERL wiki was great, How Perl Saved the Human Genome Project.
# 4  
Old 06-24-2009
Quote:
Originally Posted by Neo
Excellent thanks!

I thought this article on the BioPERL wiki was great, How Perl Saved the Human Genome Project.

Thanks, but apparently the OP did not like my suggestions, he has recently posted the same question on more perl forums without replying to this thread. Oh well, some people just want the code. Smilie
# 5  
Old 06-24-2009
I'll try to help with some code (one of those days):
Code:
($root,$gene) = (@ARGV);
open(TXT,$root.".txt") || die "Cannot open $root.txt";
open(GB,  $root.".gb") || die "Cannot open $root.gb";

$found=0;
while (<TXT>) { 
    ($start,$stop) = m/^$gene \((\d+)..(\d+)\)/io || next;
    $found = 1; last;
}
die "Did not find gene $gene in $root.txt" unless $found;

$found = 0;
while (<DB>) { 
  chop; 
  @x=split;
  if ($x[0] >= $start) { 
      # start-logic here;
  }
}
while (<DB>) { 
  chop; 
  @x=split;
  if ($x[0] < $stop) { 
      # stop-logic here
  }
}
# print logic here

# 6  
Old 06-24-2009
Thank you for the start to the code. I should be able to take it from here.

And to KevinADC...I apologize for the annoyances of asking for so much code. I am enrolled to take a more advanced Perl class that starts this next fall, but I was just attempting to jump-start my research as much as possible. I've been looking for a colleague to hire who's not busy with other work, but it's been hard (not much of this campus knows Perl). I work at UW-Madison, where Java and C++ are the languages of choice. Again, my apologies, and thank you for your patience.

-akreibich07
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Convert a DNA sequence into Amino Acid

I am trying to write a bash script that would be able to read DNA sequences (each line in the file is a sequence) from a file, where sequences are separated by an empty line. I am then to find the amino acid that these DNA sequences encode per codon (each group of three literals.) For example, if I... (3 Replies)
Discussion started by: faizlo
3 Replies

2. Shell Programming and Scripting

Extraction of sequences from files

hey!!! I have 2 files file1 is as ids.txt and is >gi|546473186|gb|AWWX01630222.1| >gi|546473233|gb|AWWX01630175.1| >gi|546473323|gb|AWWX01630097.1| >gi|546474044|gb|AWWX01629456.1| >gi|546474165|gb|AWWX01629352.1| file2 is sequences.fasta and is like >gi|546473233|gb|AWWX01630175.1|... (9 Replies)
Discussion started by: harpreetmanku04
9 Replies

3. Shell Programming and Scripting

Shell script for changing the accession number of DNA sequences in a FASTA file

Hi, I am having a file of dna sequences in fasta format which look like this: >admin_1_45 atatagcaga >admin_1_46 atatagcagaatatatat with many such thousands of sequences in a single file. I want to the replace the accession Id "admin_1_45" similarly in following sequences to... (5 Replies)
Discussion started by: margarita
5 Replies

4. Shell Programming and Scripting

Randomly selecting sequences and generating specific output files

I have two files containing hundreds of different sequences with the same Identifiers (ID-001, ID-002, etc.,), something like this: Infile1: ID-001 ATGGGAGCGGGGGCGTCTGCCTTGAGGGGAGAGAAGCTAGATACA ID-002 ATGGGAGCGGGGGCGTCTGTTTTGAGGGGAGAGAAGCTAGATACA ID-003... (18 Replies)
Discussion started by: Xterra
18 Replies

5. Shell Programming and Scripting

Tricky task with DNA sequences.

I am trying to reverse and complement my DNA sequences. The file format is FASTA, something like this: Now, to reverse the sequence, I should start reading from right to left. At the same should be complemented. Thus, "A" should be read as "T"; "C" should be read as "G"; "T" should be converted... (8 Replies)
Discussion started by: Xterra
8 Replies

6. Shell Programming and Scripting

Extracting column value from perl

Hello Kindly help me to find out the first column from first line of a flat file in perl I/P 9869912|20110830|00000000000013009|130|09|10/15/2010 12:36:22|W860944|N|00 9869912|20110830|00000000000013013|130|13|10/15/2010 12:36:22|W860944|N|00... (5 Replies)
Discussion started by: Pratik4891
5 Replies

7. UNIX for Advanced & Expert Users

Extracting files with multiple links-perl

i want to write a perl script that gets/displays all those files having multiple links (in current directory) (4 Replies)
Discussion started by: guptesanket
4 Replies

8. Shell Programming and Scripting

To retrieve gene name & function thorugh Genbank id (gi|9910297)

Hi , I have list of genbank id's and ref number in this format. gi|9910297|ref|NM_019974.1| I want to retrive the gene name and fuction for each genbank list. I have around 1300 gi numbers in my excel sheet. So anybody can help me to retrive the information from NCBI through perl script... (0 Replies)
Discussion started by: shibujohn82
0 Replies

9. Shell Programming and Scripting

GenBank Perl help...

Hey guys, I'm doing some Perl scripting for genomic data out of GenBank files...I have to extract the name of the plant, the file name, the number of bases, and all of the genes including their starting and ending positions...for example, with this GenBank file, LOCUS NC_010093 ... (7 Replies)
Discussion started by: akreibich07
7 Replies

10. Shell Programming and Scripting

Perl - extracting data from .csv files

PROJECT: Extracting data from an employee timesheet. The timesheets are done in excel (for user ease) and then converted to .csv files that look like this (see color code key below): ,,,,,,,,,,,,,,,,,,, 9/14/2003,<-- Week Ending,,,,,,,,,,,,,,,,,, Craig Brennan,,,,,,,,,,,,,,,,,,,... (3 Replies)
Discussion started by: kregh99
3 Replies
Login or Register to Ask a Question