fast sequence extraction

07-02-2012

Registered User

43, 0

Join Date: Jul 2010

Last Activity: 7 September 2015, 1:38 AM EDT

Posts: 43

Thanks Given: 11

Thanked 0 Times in 0 Posts

fast sequence extraction

Hi everyone,

I have a large text file containing DNA sequences in fasta format as follows:

Code:

>someseq
GAACTTGAGATCCGGGGAGCAGTGGATCTC
CACCAGCGGCCAGAACTGGTGCACCTCCAG
GCCAGCCTCGTCCTGCGTGTC
>another seq 
GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT
GACATTTTCATTACTACCATTTTGGAGTACA
>seq3450
TTTTCCTGTTCACTGCTGCTTTTCTATAGACAGCA
GCAGCAAGCAGTAAGAGAAAGTA
etc.

In a separate (tab/space delimited) file, I have indexes as follows:

Code:

someseq   5   10
another seq   1   12
seq3450   3   10
etc.

(above column 1 is sequence name, column 2 sequence start position and column 3 sequence end position)
I want to extract sequences from file 1 based on the indexes on file 2. For example, 'someseq 5 10' will extract characters 5-10 from 'someseq' of file 1.
Example output is:

Code:

>someseq 5 10
TTGAGA
>another seq
GGCATTTTTGTG
>seq3450   3   10
TTCCTGTT

any solution is greatly appreciated.

Moderator's Comments:

Please use code tags, thanks!

Last edited by zaxxon; 07-02-2012 at 05:32 AM.. Reason: code tags

Fahmida

View Public Profile for Fahmida

Find all posts by Fahmida

07-02-2012

Registered User

3,149, 702

Join Date: Apr 2010

Last Activity: 10 July 2019, 11:33 PM EDT

Posts: 3,149

Thanks Given: 46

Thanked 702 Times in 677 Posts

Code:

 
while read a b c; do nawk -v pattern="$a" -v start="$b" -v end="$c" '$0~pattern{getline;print substr($0,start,end);exit}' largefile.txt; done < index.txt

itkamaraj

View Public Profile for itkamaraj

Find all posts by itkamaraj

07-02-2012

Registered User

43, 0

Join Date: Jul 2010

Last Activity: 7 September 2015, 1:38 AM EDT

Posts: 43

Thanks Given: 11

Thanked 0 Times in 0 Posts

Thanks. I don't have 'nawk' in my MAC-OSX. So replacing 'nawk' with 'awk' and with your code and the data files above I get the following output, which appears incorrect:
TTGAGATCCG
G
TTCCTGTTCA

Fahmida

View Public Profile for Fahmida

Find all posts by Fahmida

07-02-2012

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

While the rest of your problem description is to the point, a few points remain to be clarified:

Quote:

Originally Posted by Fahmida

I have a large text file containing DNA sequences in fasta format as follows:

Code:

>someseq
GAACTTGAGATCCGGGGAGCAGTGGATCTC
CACCAGCGGCCAGAACTGGTGCACCTCCAG
GCCAGCCTCGTCCTGCGTGTC
>another seq 
GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT
GACATTTTCATTACTACCATTTTGGAGTACA
>seq3450
TTTTCCTGTTCACTGCTGCTTTTCTATAGACAGCA
GCAGCAAGCAGTAAGAGAAAGTA
etc.

Does the file really look like that (with the line breaks) or have you just broken the long lines for better readability? Because it will change a possible solution it is important how you answer this question.

Which shell are you using? What you need is, by and large, a substring-function and some shells have such a function built in, others haven't. Therefore a solution in, for instance, ksh93 (which has a substring-function) will be a lot easier and a lot faster than, say, a solution in Bourne shell or ksh88 (both of which lack such a device).

As it might happen that one of the tools used has some special feature in one OS and doesn't so in the other you might as well tell us which OS you are using.

In short: ask questions the smart way, please! How this is done you can read here in detail.

I hope this helps.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

07-02-2012

Registered User

1,801, 116

Join Date: Oct 2003

Last Activity: 15 May 2015, 11:55 AM EDT

Location: 54.23, -4.53

Posts: 1,801

Thanks Given: 1

Thanked 116 Times in 101 Posts

Try...

Code:

$ head file[12]
==> file1 <==
>someseq
GAACTTGAGATCCGGGGAGCAGTGGATCTC
CACCAGCGGCCAGAACTGGTGCACCTCCAG
GCCAGCCTCGTCCTGCGTGTC
>another seq
GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT
GACATTTTCATTACTACCATTTTGGAGTACA
>seq3450
TTTTCCTGTTCACTGCTGCTTTTCTATAGACAGCA
GCAGCAAGCAGTAAGAGAAAGTA

==> file2 <==
someseq 5       10
another seq     1       12
seq3450 3       10

$ awk 'NR==FNR{if($0~/^>/){i=substr($0,2);getline};a[i]=a[i] $0;next}{print ">" $1 ORS substr(a[$1], $2, $3-$2+1)}' file1 FS=\\t file2
>someseq
TTGAGA
>another seq
GGCATTTTTGTG
>seq3450
TTCCTGTT

$

Ygor

View Public Profile for Ygor

Find all posts by Ygor

UNIX for Dummies Questions & Answers

fast sequence extraction

7 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extraction of upstream and downstream regions from long sequence file

Discussion started by: harpreetmanku04

2. Shell Programming and Scripting

Sequence extraction

Discussion started by: harpreetmanku04

3. Shell Programming and Scripting

Help me in this script fast

Discussion started by: teefa

4. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Discussion started by: manigrover

5. Solaris

How do you ufsrestore the fast way?

Discussion started by: pinoy43v3r

6. Solaris

what is that 1 in the instruction!~ (please help fast)

Discussion started by: wrapster

7. UNIX for Advanced & Expert Users

Need help fast

Discussion started by: zx6ninja