|
|||||||
| Forums | Search Forums | Register | Forum Rules | Man Pages | Albums | FAQ | Members | Calendar | Search | Today's Posts | Mark Forums Read |
| UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !! |
|
|
|
Thread Tools | Search this Thread | Display Modes |
|
#1
|
||||
|
||||
|
fast sequence extraction
Hi everyone, I have a large text file containing DNA sequences in fasta format as follows: Code:
>someseq GAACTTGAGATCCGGGGAGCAGTGGATCTC CACCAGCGGCCAGAACTGGTGCACCTCCAG GCCAGCCTCGTCCTGCGTGTC >another seq GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT GACATTTTCATTACTACCATTTTGGAGTACA >seq3450 TTTTCCTGTTCACTGCTGCTTTTCTATAGACAGCA GCAGCAAGCAGTAAGAGAAAGTA etc. In a separate (tab/space delimited) file, I have indexes as follows: Code:
someseq 5 10 another seq 1 12 seq3450 3 10 etc. (above column 1 is sequence name, column 2 sequence start position and column 3 sequence end position) I want to extract sequences from file 1 based on the indexes on file 2. For example, 'someseq 5 10' will extract characters 5-10 from 'someseq' of file 1. Example output is: Code:
>someseq 5 10 TTGAGA >another seq GGCATTTTTGTG >seq3450 3 10 TTCCTGTT any solution is greatly appreciated.
Last edited by zaxxon; 07-02-2012 at 04:32 AM.. Reason: code tags |
| Sponsored Links | ||
|
|
#2
|
||||
|
||||
|
Code:
while read a b c; do nawk -v pattern="$a" -v start="$b" -v end="$c" '$0~pattern{getline;print substr($0,start,end);exit}' largefile.txt; done < index.txt |
| Sponsored Links | ||
|
|
#3
|
|||
|
|||
|
Thanks. I don't have 'nawk' in my MAC-OSX. So replacing 'nawk' with 'awk' and with your code and the data files above I get the following output, which appears incorrect:
TTGAGATCCG G TTCCTGTTCA |
|
#4
|
|||
|
|||
|
While the rest of your problem description is to the point, a few points remain to be clarified:
Quote:
Which shell are you using? What you need is, by and large, a substring-function and some shells have such a function built in, others haven't. Therefore a solution in, for instance, ksh93 (which has a substring-function) will be a lot easier and a lot faster than, say, a solution in Bourne shell or ksh88 (both of which lack such a device). As it might happen that one of the tools used has some special feature in one OS and doesn't so in the other you might as well tell us which OS you are using. In short: ask questions the smart way, please! How this is done you can read here in detail. I hope this helps. bakunin |
| The Following User Says Thank You to bakunin For This Useful Post: | ||
elixir_sinari (07-02-2012) | ||
| Sponsored Links | |
|
|
#5
|
||||
|
||||
|
Try... Code:
$ head file[12]
==> file1 <==
>someseq
GAACTTGAGATCCGGGGAGCAGTGGATCTC
CACCAGCGGCCAGAACTGGTGCACCTCCAG
GCCAGCCTCGTCCTGCGTGTC
>another seq
GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT
GACATTTTCATTACTACCATTTTGGAGTACA
>seq3450
TTTTCCTGTTCACTGCTGCTTTTCTATAGACAGCA
GCAGCAAGCAGTAAGAGAAAGTA
==> file2 <==
someseq 5 10
another seq 1 12
seq3450 3 10
$ awk 'NR==FNR{if($0~/^>/){i=substr($0,2);getline};a[i]=a[i] $0;next}{print ">" $1 ORS substr(a[$1], $2, $3-$2+1)}' file1 FS=\\t file2
>someseq
TTGAGA
>another seq
GGCATTTTTGTG
>seq3450
TTCCTGTT
$ |
| Sponsored Links | ||
|
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Too Fast For Me | Linux Bot | Cartoons for Geeks | 0 | 05-16-2011 05:00 PM |
| Fast way to cut fields | madhunk | Shell Programming and Scripting | 0 | 07-13-2006 11:45 AM |
| Need help fast | zx6ninja | UNIX for Advanced & Expert Users | 0 | 06-15-2004 09:32 PM |
|
|