fast sequence extraction


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers fast sequence extraction
# 1  
Old 07-02-2012
fast sequence extraction

Hi everyone,

I have a large text file containing DNA sequences in fasta format as follows:
Code:
>someseq
GAACTTGAGATCCGGGGAGCAGTGGATCTC
CACCAGCGGCCAGAACTGGTGCACCTCCAG
GCCAGCCTCGTCCTGCGTGTC
>another seq 
GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT
GACATTTTCATTACTACCATTTTGGAGTACA
>seq3450
TTTTCCTGTTCACTGCTGCTTTTCTATAGACAGCA
GCAGCAAGCAGTAAGAGAAAGTA
etc.

In a separate (tab/space delimited) file, I have indexes as follows:
Code:
someseq   5   10
another seq   1   12
seq3450   3   10
etc.

(above column 1 is sequence name, column 2 sequence start position and column 3 sequence end position)
I want to extract sequences from file 1 based on the indexes on file 2. For example, 'someseq 5 10' will extract characters 5-10 from 'someseq' of file 1.
Example output is:
Code:
>someseq 5 10
TTGAGA
>another seq
GGCATTTTTGTG
>seq3450   3   10
TTCCTGTT

any solution is greatly appreciated.

Moderator's Comments:
Mod Comment Please use code tags, thanks!

Last edited by zaxxon; 07-02-2012 at 05:32 AM.. Reason: code tags
# 2  
Old 07-02-2012
Code:
 
while read a b c; do nawk -v pattern="$a" -v start="$b" -v end="$c" '$0~pattern{getline;print substr($0,start,end);exit}' largefile.txt; done < index.txt

# 3  
Old 07-02-2012
Thanks. I don't have 'nawk' in my MAC-OSX. So replacing 'nawk' with 'awk' and with your code and the data files above I get the following output, which appears incorrect:
TTGAGATCCG
G
TTCCTGTTCA
# 4  
Old 07-02-2012
While the rest of your problem description is to the point, a few points remain to be clarified:

Quote:
Originally Posted by Fahmida
I have a large text file containing DNA sequences in fasta format as follows:
Code:
>someseq
GAACTTGAGATCCGGGGAGCAGTGGATCTC
CACCAGCGGCCAGAACTGGTGCACCTCCAG
GCCAGCCTCGTCCTGCGTGTC
>another seq 
GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT
GACATTTTCATTACTACCATTTTGGAGTACA
>seq3450
TTTTCCTGTTCACTGCTGCTTTTCTATAGACAGCA
GCAGCAAGCAGTAAGAGAAAGTA
etc.

Does the file really look like that (with the line breaks) or have you just broken the long lines for better readability? Because it will change a possible solution it is important how you answer this question.

Which shell are you using? What you need is, by and large, a substring-function and some shells have such a function built in, others haven't. Therefore a solution in, for instance, ksh93 (which has a substring-function) will be a lot easier and a lot faster than, say, a solution in Bourne shell or ksh88 (both of which lack such a device).

As it might happen that one of the tools used has some special feature in one OS and doesn't so in the other you might as well tell us which OS you are using.

In short: ask questions the smart way, please! How this is done you can read here in detail.

I hope this helps.

bakunin
This User Gave Thanks to bakunin For This Post:
# 5  
Old 07-02-2012
Try...
Code:
$ head file[12]
==> file1 <==
>someseq
GAACTTGAGATCCGGGGAGCAGTGGATCTC
CACCAGCGGCCAGAACTGGTGCACCTCCAG
GCCAGCCTCGTCCTGCGTGTC
>another seq
GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT
GACATTTTCATTACTACCATTTTGGAGTACA
>seq3450
TTTTCCTGTTCACTGCTGCTTTTCTATAGACAGCA
GCAGCAAGCAGTAAGAGAAAGTA

==> file2 <==
someseq 5       10
another seq     1       12
seq3450 3       10

$ awk 'NR==FNR{if($0~/^>/){i=substr($0,2);getline};a[i]=a[i] $0;next}{print ">" $1 ORS substr(a[$1], $2, $3-$2+1)}' file1 FS=\\t file2
>someseq
TTGAGA
>another seq
GGCATTTTTGTG
>seq3450
TTCCTGTT

$

 
Login or Register to Ask a Question

Previous Thread | Next Thread

7 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extraction of upstream and downstream regions from long sequence file

Hello, here I am posting my query again with modified data input files. see my query is : i have two input files file1 and file2. file1 is smalldata.fasta >gi|546671471|gb|AWWX01449637.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig449636, whole genome shotgun sequence... (20 Replies)
Discussion started by: harpreetmanku04
20 Replies

2. Shell Programming and Scripting

Sequence extraction

i want to extract specific region of interest from big file. i have only start position, end position and seq id, see my query is: I have file1 is this >GL3482.1 GAACTTGAGATCCGGGGA GCAGTGGATCTCCACCAG CGGCCAGAACTGGTGCAC CTCCAGGCCAGCCTCGTC CTGCGTGTC >GL3550.1... (14 Replies)
Discussion started by: harpreetmanku04
14 Replies

3. Shell Programming and Scripting

Help me in this script fast

i have log files that represent names, times and countries, each name come once in country but may in diff times i need at end each name visited which country and its USA | Tony | 12:25:22:431 Italy | Tony | 09:33:11:212 **** Italy| John | 08:22:12:349 France | Adam | 14:22:42:981... (2 Replies)
Discussion started by: teefa
2 Replies

4. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Hi all, I have a file like this ID 3BP5L_HUMAN Reviewed; 393 AA. AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3; DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot. DT 05-JUL-2004, sequence version 1. DT 05-SEP-2012, entry version 71. FT COILED 59 140 ... (1 Reply)
Discussion started by: manigrover
1 Replies

5. Solaris

How do you ufsrestore the fast way?

hi, on my sol9 box i create my backup using the below command: /usr/sbin/ufsdump 0uf /dev/rmt/0n /u1 /usr/sbin/ufsdump 0uf /dev/rmt/0n /u2 /usr/sbin/ufsdump 0uf /dev/rmt/0n /u3 /usr/sbin/ufsdump 0uf /dev/rmt/0n /u4 now on the new sol10 box, to restore i use this commands: cd /u1... (3 Replies)
Discussion started by: pinoy43v3r
3 Replies

6. Solaris

what is that 1 in the instruction!~ (please help fast)

Hi all, make_lofs /.cdrom/<something>/<something> 1 what does this instruction mean? Note:both the "something" are obviously different . I would like to know what that 1 means, the rest of the instruction is clear!! Thanks (6 Replies)
Discussion started by: wrapster
6 Replies

7. UNIX for Advanced & Expert Users

Need help fast

I am trying to reset the IP address on a Unix HP box here in my office and I am stuck in this EM100 mode and cant issue any commands. Any help would be great. By the way I no zero about unix. Thanks (0 Replies)
Discussion started by: zx6ninja
0 Replies
Login or Register to Ask a Question