I have a large text file containing DNA sequences in fasta format as follows:
In a separate (tab/space delimited) file, I have indexes as follows:
(above column 1 is sequence name, column 2 sequence start position and column 3 sequence end position)
I want to extract sequences from file 1 based on the indexes on file 2. For example, 'someseq 5 10' will extract characters 5-10 from 'someseq' of file 1.
Example output is:
any solution is greatly appreciated.
Thanks. I don't have 'nawk' in my MAC-OSX. So replacing 'nawk' with 'awk' and with your code and the data files above I get the following output, which appears incorrect:
TTGAGATCCG
G
TTCCTGTTCA
While the rest of your problem description is to the point, a few points remain to be clarified:
Quote:
Originally Posted by Fahmida
I have a large text file containing DNA sequences in fasta format as follows:
Does the file really look like that (with the line breaks) or have you just broken the long lines for better readability? Because it will change a possible solution it is important how you answer this question.
Which shell are you using? What you need is, by and large, a substring-function and some shells have such a function built in, others haven't. Therefore a solution in, for instance, ksh93 (which has a substring-function) will be a lot easier and a lot faster than, say, a solution in Bourne shell or ksh88 (both of which lack such a device).
As it might happen that one of the tools used has some special feature in one OS and doesn't so in the other you might as well tell us which OS you are using.
In short: ask questions the smart way, please! How this is done you can read here in detail.
Hello, here I am posting my query again with modified data input files.
see my query is :
i have two input files file1 and file2.
file1 is smalldata.fasta
>gi|546671471|gb|AWWX01449637.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig449636, whole genome shotgun sequence... (20 Replies)
i want to extract specific region of interest from big file. i have only start position, end position and seq id, see my query is:
I have file1 is this
>GL3482.1
GAACTTGAGATCCGGGGA
GCAGTGGATCTCCACCAG
CGGCCAGAACTGGTGCAC
CTCCAGGCCAGCCTCGTC
CTGCGTGTC
>GL3550.1... (14 Replies)
i have log files that represent names, times and countries,
each name come once in country but may in diff times
i need at end each name visited which country and its
USA | Tony | 12:25:22:431
Italy | Tony | 09:33:11:212 ****
Italy| John | 08:22:12:349
France | Adam | 14:22:42:981... (2 Replies)
Hi all,
I have a file like this
ID 3BP5L_HUMAN Reviewed; 393 AA.
AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3;
DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot.
DT 05-JUL-2004, sequence version 1.
DT 05-SEP-2012, entry version 71.
FT COILED 59 140 ... (1 Reply)
hi, on my sol9 box i create my backup using the below command:
/usr/sbin/ufsdump 0uf /dev/rmt/0n /u1
/usr/sbin/ufsdump 0uf /dev/rmt/0n /u2
/usr/sbin/ufsdump 0uf /dev/rmt/0n /u3
/usr/sbin/ufsdump 0uf /dev/rmt/0n /u4
now on the new sol10 box, to restore i use this commands:
cd /u1... (3 Replies)
Hi all,
make_lofs /.cdrom/<something>/<something> 1
what does this instruction mean?
Note:both the "something" are obviously different .
I would like to know what that 1 means, the rest of the instruction is clear!!
Thanks (6 Replies)
I am trying to reset the IP address on a Unix HP box here in my office and I am stuck in this EM100 mode and cant issue any commands. Any help would be great. By the way I no zero about unix. Thanks (0 Replies)