Go Back   The UNIX and Linux Forums > Top Forums > UNIX for Dummies Questions & Answers


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 07-02-2012
Registered User
 
Join Date: Jul 2010
Posts: 30
Thanks: 4
Thanked 0 Times in 0 Posts
fast sequence extraction

Hi everyone,

I have a large text file containing DNA sequences in fasta format as follows:

Code:
>someseq
GAACTTGAGATCCGGGGAGCAGTGGATCTC
CACCAGCGGCCAGAACTGGTGCACCTCCAG
GCCAGCCTCGTCCTGCGTGTC
>another seq 
GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT
GACATTTTCATTACTACCATTTTGGAGTACA
>seq3450
TTTTCCTGTTCACTGCTGCTTTTCTATAGACAGCA
GCAGCAAGCAGTAAGAGAAAGTA
etc.

In a separate (tab/space delimited) file, I have indexes as follows:

Code:
someseq   5   10
another seq   1   12
seq3450   3   10
etc.

(above column 1 is sequence name, column 2 sequence start position and column 3 sequence end position)
I want to extract sequences from file 1 based on the indexes on file 2. For example, 'someseq 5 10' will extract characters 5-10 from 'someseq' of file 1.
Example output is:

Code:
>someseq 5 10
TTGAGA
>another seq
GGCATTTTTGTG
>seq3450   3   10
TTCCTGTT

any solution is greatly appreciated.

Moderator's Comments:
Please use code tags, thanks!

Last edited by zaxxon; 07-02-2012 at 04:32 AM.. Reason: code tags
Sponsored Links
    #2  
Old 07-02-2012
itkamaraj's Avatar
^Kamaraj^
 
Join Date: Apr 2010
Posts: 3,025
Thanks: 33
Thanked 647 Times in 625 Posts

Code:
 
while read a b c; do nawk -v pattern="$a" -v start="$b" -v end="$c" '$0~pattern{getline;print substr($0,start,end);exit}' largefile.txt; done < index.txt

Sponsored Links
    #3  
Old 07-02-2012
Registered User
 
Join Date: Jul 2010
Posts: 30
Thanks: 4
Thanked 0 Times in 0 Posts
Thanks. I don't have 'nawk' in my MAC-OSX. So replacing 'nawk' with 'awk' and with your code and the data files above I get the following output, which appears incorrect:
TTGAGATCCG
G
TTCCTGTTCA
    #4  
Old 07-02-2012
bakunin bakunin is offline Forum Staff  
Bughunter Extraordinaire
 
Join Date: May 2005
Location: In the leftmost byte of /dev/kmem
Posts: 3,297
Thanks: 27
Thanked 453 Times in 353 Posts
While the rest of your problem description is to the point, a few points remain to be clarified:

Quote:
Originally Posted by Fahmida View Post
I have a large text file containing DNA sequences in fasta format as follows:

Code:
>someseq
GAACTTGAGATCCGGGGAGCAGTGGATCTC
CACCAGCGGCCAGAACTGGTGCACCTCCAG
GCCAGCCTCGTCCTGCGTGTC
>another seq 
GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT
GACATTTTCATTACTACCATTTTGGAGTACA
>seq3450
TTTTCCTGTTCACTGCTGCTTTTCTATAGACAGCA
GCAGCAAGCAGTAAGAGAAAGTA
etc.

Does the file really look like that (with the line breaks) or have you just broken the long lines for better readability? Because it will change a possible solution it is important how you answer this question.

Which shell are you using? What you need is, by and large, a substring-function and some shells have such a function built in, others haven't. Therefore a solution in, for instance, ksh93 (which has a substring-function) will be a lot easier and a lot faster than, say, a solution in Bourne shell or ksh88 (both of which lack such a device).

As it might happen that one of the tools used has some special feature in one OS and doesn't so in the other you might as well tell us which OS you are using.

In short: ask questions the smart way, please! How this is done you can read here in detail.

I hope this helps.

bakunin
The Following User Says Thank You to bakunin For This Useful Post:
elixir_sinari (07-02-2012)
Sponsored Links
    #5  
Old 07-02-2012
Ygor's Avatar
Ygor Ygor is offline Forum Staff  
Moderator
 
Join Date: Oct 2003
Location: 54.23, -4.53
Posts: 1,792
Thanks: 1
Thanked 101 Times in 91 Posts
Try...
Code:
$ head file[12]
==> file1 <==
>someseq
GAACTTGAGATCCGGGGAGCAGTGGATCTC
CACCAGCGGCCAGAACTGGTGCACCTCCAG
GCCAGCCTCGTCCTGCGTGTC
>another seq
GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT
GACATTTTCATTACTACCATTTTGGAGTACA
>seq3450
TTTTCCTGTTCACTGCTGCTTTTCTATAGACAGCA
GCAGCAAGCAGTAAGAGAAAGTA

==> file2 <==
someseq 5       10
another seq     1       12
seq3450 3       10

$ awk 'NR==FNR{if($0~/^>/){i=substr($0,2);getline};a[i]=a[i] $0;next}{print ">" $1 ORS substr(a[$1], $2, $3-$2+1)}' file1 FS=\\t file2
>someseq
TTGAGA
>another seq
GGCATTTTTGTG
>seq3450
TTCCTGTT

$

Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Too Fast For Me Linux Bot Cartoons for Geeks 0 05-16-2011 05:00 PM
Fast way to cut fields madhunk Shell Programming and Scripting 0 07-13-2006 11:45 AM
Need help fast zx6ninja UNIX for Advanced & Expert Users 0 06-15-2004 09:32 PM



All times are GMT -4. The time now is 09:43 PM.