Sequence extraction


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Sequence extraction
# 1  
Old 08-05-2015
Sequence extraction

i want to extract specific region of interest from big file. i have only start position, end position and seq id, see my query is:

I have file1 is this

Code:
>GL3482.1
GAACTTGAGATCCGGGGA
GCAGTGGATCTCCACCAG
CGGCCAGAACTGGTGCAC
CTCCAGGCCAGCCTCGTC
CTGCGTGTC
>GL3550.1
GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT
GACATTTTCATTACTACCATTTTGGAGTACA
>GL3472.1
TTTTCCTGTTCACTGCTGCTTTTCTATAGACAGCA
GCAGCAAGCAGTAAGAGAAAGTA


file2 is this :

Code:
seq id     start    end
GL3482.1   323100   323743
GL3550.1   41911    40888
GL3472.1   274408   272617

and i want result like this:

Code:
>GL3482.1 
TTGAGA
>GL3550.1 
GGCATTTTTGTG
>GL3472.1
TTCCTGTT

when i run this command :

Code:
$ awk 'NR==FNR{if($0~/^>/){i=substr($0,2);getline};a[i]=a[i] $0;next}{print ">" $1 ORS substr(a[$1], $2, $3-$2+1)}' file1 FS=\\t file2

i got no output or output file. why so?
Moderator's Comments:
Mod Comment Please use CODE tags as required by the forum rules you agreed to when you joined this forum.


---------- Post updated at 01:28 AM ---------- Previous update was at 01:21 AM ----------

i am using Ubuntu OS

Last edited by Scrutinizer; 08-05-2015 at 04:07 AM.. Reason: Add CODE tags, remove extraneous closing COLOR tag. Removed spurious space and a newline from first sample file
# 2  
Old 08-05-2015
Maybe because FS is set to '\t' and there appear to be no TABs in your second sample file, so the fields will not match. And you are setting the index to the entire record, save the first character. which is probably not what you want (you probably meant to use $1 here, but that would not be a sure way to do it either, because in the FASTA format the identifier is allowed to contain spaces). And the FASTA file is word wrapped, so you need to take out the newlines and not use getline to get only the second line ....

The best way to do that is is to use ">" as a record separator and use "\n" as the field separator. By setting OFS as the empty string, and assigning a value to one of the fields, all newlines will be replaced by empty strings, so this will effectively remove the word wrap. And the sequence will become one continuous string, which will make it suitable for substring selection.

Using your file order, we would get something like this.
Code:
awk 'NR==FNR{i=$1; $1=x; A[i]=$0; next} $1 in A{print ">" $1 ORS substr(A[$1], $2, $3-$2+1)}' RS=\> FS='\n' OFS= file1 FS=" " RS="\n" file2


If we read the files the other way around, then it becomes more memory efficient:
Code:
awk 'NR==FNR{S[$1]=$2; E[$1]=$3; next} $1 in S{i=$1; $1=x; print RS i FS substr($0,S[i],E[i]-S[i]+1)}' file2 RS=\> FS='\n' OFS= file1


Last edited by Scrutinizer; 08-05-2015 at 04:52 AM..
# 3  
Old 08-05-2015
With your two sample input files (with the combined lengths of the lines in each group that do not start with a > being less than 100 characters), I don't see how you would expect any output when the substring you are trying to extract from those strings starts more than 40,000 characters into that string, and in two of the three cases has an ending position in the string that comes before the starting position (thereby requesting a substring that has negative length).

In addition to those problems, as Scrutinizer said, your script specifies that the input field separator for file2 is a tab character, but there are no tab characters in the data you showed us. Therefore, you are requesting a substring of 1 character starting at position 0 (when arrays of characters in awk start at position 1).

Note also that although you might be able to create an array element in awk or gawk on Ubuntu that is more than 323,000 characters long; on most UNIX systems and BSD-based systems, awk won't let you read a line, write a single output string, or create a variable whose value is much more that LINE_MAX bytes long (on most systems LINE_MAX is 2,048).
This User Gave Thanks to Don Cragun For This Post:
# 4  
Old 08-05-2015
let me clear my query again :

i have 2 files, file1 is 1.fasta sequences file which are as follow:

Code:
>gi|547177824|gb|AWWX01000001.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig0, whole genome shotgun sequence
ACATAATCCCCGAGGCAAAATAAGTCTCTAATGAACTTGACCCTATGAGTGTCAAGGTGAGGGAGTCCTA
AGAGACTGACAAGGGCTGTGAAGGCTACAAGGGAAGAAGACAACAGGATCAGCTGGAAAAGGCTTAAAAT
TGCAGCCCTTGATTCTTCCACTGTGCCCTGGGGCCATGAATCGCTACAGCCTCACTGAAGGAATCTGAGG
TAGCATCTCAGAGCTCCCATGCCAAGACCATGGGGAACAAATTTGAGTTGGATGTGGCAGCATGGCCCAG
AGGTCATGAAATAACCAGCACAAACTTTGTTAGTGGATGACAACTGGCCCTTTTAAGTGATCACCTTAAT
AGACTATCATTCAGGGACTAAAAGAGAAAACTTAGTACCCCAAAAATTGTACCAAAGGACAGAAATTCAC
CTCAGAACTTTCCAGCAAAG
>gi|547177823|gb|AWWX01000002.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig1, whole genome shotgun sequence
CAATTTGATTCCAGCTTGTGTTCATCCATCTGGGAATTTTCATGATGTACTGTGCATATAACTTAAATAA
TCAGGGTGACAATATGCAGCCTTAATGTACTTGTGTCCCAATTTTGAACCTGTCTGTTGTTCCGTGTCCA
CTTCTAACTGTTGCTTCTTGACCTGTATACCAGTTTTTCAGAAGGCAGGTAAGGTGCTCTGGTATTCCCA
TCTCTTTAATAATTTTGCAGTTTCCTGTGATACACACAGACAAAGGCTTTTACATAGTCAAACAAGTAGA
AGTTTTTCTGGAGTTCTCTTGCTTTGCCTA
>gi|547177822|gb|AWWX01000003.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig2, whole genome shotgun sequence
GGTCCAAGAGGCTCCTTCTTTTGTCCAGAAAGAGTTCCTGTGAACAGGGCTCAGTGGCTGACTGATGGTT
AGTATCATCTGGAGAAGGGAAAGGGCTTTTGTAGGCAAATTTAGCATTGCCAAGCCAGCAGAGTCTACCT
GGCAAGTCTTTCAAAGTTCAGAATTTTTCCTAAAGGCAAGGAAAACATGGACATAAGGGACCTCGCTGGA
ATCCTAAGCGCAAATCTTCAGCAATGGACAGTCCCATCTGAAAAGGAAATGATGAATGCTCATAAGGGGC
TTCGTGTGTCACTGAATTCCCTTATGGACTTTTCCTGGCTGTGGCATCCCTCATCTGTGATCTTGCTGGG
CATAATGTGGTTCAGTTCTCGCCCCAGGGGCCGTTGTGGCTTCCCTGAGTATTTTTTTAAGAGATTATTT
ATGTGTTTAATTTTGGTTGCACTGGCTCTTCATTGCTCCATGTGTGTTCTCTAGTTTCGAGGAGCAGGGG
CTA
>gi|547177821|gb|AWWX01000004.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig3, whole genome shotgun sequence
CTATATAAATAAATCTATTTCTTGCCTAACACTTTGCCTGTTGCTGAATTCCTTCTGTGCCGAGACACAA
ACAATGTGAGCCACAGTACATCCAGGTACCAGATAAGTGATTCTAAATAAAAGACCATGGTTCAAGACTC
AAACTGGATTTTGGGGAGGGGGGGAGGTTGGAGTCCTGGCCATGTGGATTTTAGTCCAAACCTGATGTGA
TAGGTTTCAGGTAGAGATCATCTCCATGCCACCACATTGGAAATTTTATACTATGGTCTGTTATACTGGT
TCA
>gi|547177820|gb|AWWX01000005.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig4, whole genome shotgun sequence
TCTCAAGCAGACTTTACCTATAAAAAGGGCAATGTGTCCTGGAACTGCAAGCTGAGCTTAGGAACAGAGA
ACTGCAAAAACTATTGGCATGAACAACTCTGTCCAAACATCTACATTAGGAACCTTGACTGACCTCTAGC
ATGGTTTCTAGCAGCAACCTAAGGCCACGTTCTAGGACAACTCAGCTACCCCTGAGTTCCTGTCTAGAAA
ATTTCAAGGCTACCAAAGGAATCTGCTCCAGCCAACATCTGACATAAGCCCCTCATCTTCCTTTACTTAG
AGTGTCTATTTAAAACAAAGACCAAAAAAAAAAAAAACAAAAACAAAAAACCCTCACGATTACAAGAAAA
GTGTGACGGAAATAAACTAGAAATTGATAACAAAAAGACAGCAGGAAAATTAAAGATTATTTGCAGACTA
AGCAGAAAACATCTAAA
>gi|547177819|gb|AWWX01000006.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig5, whole genome shotgun sequence
ATCAGCAAGCTCAACCTCGGGCAAACGATTGTCCGAGTGATGATGGGGGCTGCCCAAGTCCGGTCTCCTC
ATGGTGCTCCTGGGGGTCATCTTCATGAATGGTAACCACGCCACCGAGGAGGAGGTCTGGGAATTCCTGA
GTATGTTGGGGATCTATGCTGGGAGGAGGCACTGGATCTTTGGGGAGCCCAGAAGGCTCATCACCAAAGA
TCTGGTGCAGAAGGAGTACCTGAACTACCGCCGGGTGCCCAATAGTGATCCTCCGCGCTACCAGTTGCTG
TGGGGCCCGAGAGCTTGTGCTGAGACCAGTAAGATGAAGGTACTGGAGGTTCTAGCCAAGTTCCACGGTA
GGGTCCCTAGTTCCTTCCCAGACCTATATGACGAGGCTCTGAGAGATCAGGCGGAGACAGCAGGGCGGAG
AGGTGTGGCCAGGGCTCCCGACCATGGCTGAGGCCAGTGCCC
>gi|547177818|gb|AWWX01000007.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig6, whole genome shotgun sequence
ATGAAGAAATGAAGAGAAAAAAAATGAGCACAAATACCTCTCATGAACATAGAAGCAAAATCCCCAACCA
AATATAAGAAAAGAAACCTAAAAATCCATAGAAGGAATTACATACTATATCTAAGTAGGATGCATTTCAG
ATGTGCAACATTTGAAAATCAGCAATCATAAATCATAATAACAATAGTCTAAGGAAGATAATCACAAGAT
GAGATCAACAGATGAAGGGAAAAATCATCTGATAAATCCAACACCCATTCATCATAAAAATTCTACAGCA
AACTAGATACAGACAGGCATTTCCTCAGCTTCATATAAAACATACTAAAAAGAAATACAGGTAAAACATA
CTTCATAGT
>gi|547177817|gb|AWWX01000008.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig7, whole genome shotgun sequence
AACTTTGGTGCCCCGTGGTCCTGGCATGGGGCCTCGGAGATGCTGCCTCAGATTCCTTCAGTGAAGCAGT
GGAGATTCATGATCCCAGGGCACAGTGCAAGAATCCAGGGCTCAAGTTTAAAGCATTTTTGCAGCTCATC
CTGTTGTCTGCTTCCCTTGCAGGCTGCACATCTATGCCAGTCTCTTAAGTACTCCCCCACCTAGGTACAC
ATAGGGTCAAGTTCAGTGGAGACTTACTGTCCATCAGGGATTATTTAATTTTATGTTTTCACTTTTGTTA
ATAGTTTTCTTCATTCCTTCAATAAGATTTCCGTGGTTACTTCTGGGTACAAAAA
>gi|547177816|gb|AWWX01000009.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig8, whole genome shotgun sequence
CCTTCTTTTTCCTGAGACTCTCAGGAAATTCTCCTTGCCTCCATGAAGACTGTATACATTTTTCTAAAGT
TTCTTTGTAGGTAAATTTTCTCTATGGTTGCTTTTGTTTGTGGTATATTTTCTGTTACTTGGTGTAAGTA
ACTGTTGATTCTTCCAGACCAACTGGATGATTTTGCTGCCATGATCTTTTTATTATTCCAGATTTTTGAA
GTTTCTTACATGTTTGAATATTTCCTGTCCTGTTTTCTAAGATTTAATTCGAGAATCAAATTGTCATTGT
GATCTTTTGCTTTTCATTACTTCTGACTTTTATTCATTTGTCATTGTTGTATATAACTCTAGTGGCTATA
TGTACTT

file2 is result.ods file which is as :

Code:
gi|546687122|gb|AWWX01446731.1|	       13172	13194
gi|546693672|gb|AWWX01441057.1|	       6859	        6837
gi|546698969|gb|AWWX01436431.1|	       18753	18775
gi|546703077|gb|AWWX01432778.1|   	4132	        4154
gi|546670495|gb|AWWX01450063.1|	        4111	        4133
gi|546689695|gb|AWWX01444610.1|	       14602	14580
gi|546691352|gb|AWWX01443112.1|	       10073	10051
gi|546880329|gb|AWWX01275531.1|	       1158	        1136
gi|546670216|gb|AWWX01450333.1|	       10633	10655
gi|546678257|gb|AWWX01448205.1|	        1112	        1134
gi|546693672|gb|AWWX01441057.1|	        6832	        6854
gi|546704475|gb|AWWX01431394.1|	        18135	18113
gi|546699347|gb|AWWX01436084.1|	        26840	26862
gi|546702960|gb|AWWX01432895.1|	        13515	13493
gi|546833971|gb|AWWX01313367.1|	        615	        593
gi|546860287|gb|AWWX01291803.1|	         2188	2210
gi|546689115|gb|AWWX01445179.1|	        10761	10739
gi|546701370|gb|AWWX01434384.1|	        2616	        2638
gi|546694075|gb|AWWX01440674.1|	        9568	        9546
gi|546701082|gb|AWWX01434635.1|	        8423	        8445
gi|547071923|gb|AWWX01098172.1|	        135	        157
gi|546705086|gb|AWWX01430793.1|	        3181	        3203
gi|546704429|gb|AWWX01431440.1|	       1352	        1330
gi|546709146|gb|AWWX01426952.1|	       52962	52984

and we want a awk script which creates an output file which contain sequences extracted from these coordinates along with id with > symbol .
for example:

Code:
>gi|546709146|gb|AWWX01426952.1|
acctgctgcatgcgtgcgtggcgtgcaaaatgcagtcaaggcaggtcagtccatgcatgacgtgcaatgcattgcatggcgtgcaaaatgcaggcgtggcgtgcaaaatgcagtcaaaaaattgccgtgcaatgggcc

output file should be like this.

I hope now it is clear to you.

Last edited by Don Cragun; 08-05-2015 at 06:57 AM.. Reason: Add CODE and ICODE tags.
# 5  
Old 08-05-2015
The comments by Don Cragun still apply: most of the sequences are shorter than the coordinates that you have in your file2, and there are coordinates that result in negative string lengths.
On top, none of the samples in file2 match any of those in file1.

How about posting some meaningful samples?
# 6  
Old 08-05-2015
Please explain your example; it is not at all clear to me.

You say you want the output:
Code:
>gi|546709146|gb|AWWX01426952.1|
acctgctgcatgcgtgcgtggcgtgcaaaatgcagtcaaggcaggtcagtccatgcatgacgtgcaatgcattgcatggcgtgcaaaatgcaggcgtggcgtgcaaaatgcagtcaaaaaattgccgtgcaatgggcc

which seems to have been selected by the line:
Code:
gi|546709146|gb|AWWX01426952.1|	       52962	52984

from the file result.ods which I thought meant that you wanted characters 52962 through 52984 from the data following a line starting with:
Code:
>gi|546709146|gb|AWWX01426952.1|

in the file 1.fasta. But, that string does not appear in 1.fasta and, even if it did, the data from characters 52962 through 52984 would be 23 characters long; not the 138 characters you have said you want to have output.

What am I misunderstanding???
# 7  
Old 08-05-2015
may be that entries would be at the bottom of the file actually sir, i just copied and posted very few entries from file1 and file2 may be matched would be skipped. i cant paste my whole data file here i have 4565005 sequence entries in my data.

and about negative value, please provide me the script for like 300 250 = 50
no problem if there is 250-300 =-50 i will correct it and will then apply script.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extraction of upstream and downstream regions from long sequence file

Hello, here I am posting my query again with modified data input files. see my query is : i have two input files file1 and file2. file1 is smalldata.fasta >gi|546671471|gb|AWWX01449637.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig449636, whole genome shotgun sequence... (20 Replies)
Discussion started by: harpreetmanku04
20 Replies

2. Shell Programming and Scripting

String Extraction

I am trying to extract a time from the below string in perl but not able to get the time properly I just want to extract the time from the above line I am using the below syntax x=~ /(.*) (\d+)\:(\d+)\:(\d+),(.*)\.com/ $time = $2 . ':' . $3 . ':' . $4; print $time Can... (1 Reply)
Discussion started by: karan8810
1 Replies

3. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Hi all, I have a file like this ID 3BP5L_HUMAN Reviewed; 393 AA. AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3; DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot. DT 05-JUL-2004, sequence version 1. DT 05-SEP-2012, entry version 71. FT COILED 59 140 ... (1 Reply)
Discussion started by: manigrover
1 Replies

4. UNIX for Dummies Questions & Answers

fast sequence extraction

Hi everyone, I have a large text file containing DNA sequences in fasta format as follows: >someseq GAACTTGAGATCCGGGGAGCAGTGGATCTC CACCAGCGGCCAGAACTGGTGCACCTCCAG GCCAGCCTCGTCCTGCGTGTC >another seq GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT GACATTTTCATTACTACCATTTTGGAGTACA >seq3450... (4 Replies)
Discussion started by: Fahmida
4 Replies

5. Shell Programming and Scripting

extraction

I have following input @xxxxxx@ I want to extract what's between @....@ that is : xxxx using SED command (6 Replies)
Discussion started by: xerox
6 Replies

6. Programming

extraction from a path

Hi, Can you help me on this two problems? how can i get : from input: /ect/exp/hom/bin ==> output: exp and from input: aex1234 =====>output: ex thanks, (1 Reply)
Discussion started by: yeclota
1 Replies

7. Shell Programming and Scripting

Regex extraction

Hello, I need your help to extract text from following: ./sherg_fyd_rur:blkabl="R23.21_BL2008_0122_1" ./serge_a75:rlwual="/main/r23.21=26-Mar-2008.05:00:20UTC@R11.31_BL2008_0325" ./serge_a75:blkabl="R23.21_BL2008_0325" ./sherg_proto_npiv:bkguals="R23.21_BL2008_0302 I80_11.31_LR" I... (11 Replies)
Discussion started by: abdurrouf
11 Replies

8. Shell Programming and Scripting

extraction of last but one char

I need to extract the character before the last "|" in the following lines, which are 'N' and 'U'. The last "|" shouldn't be extracted. Also the no.s of "|" may vary in a line, but I need only the character before the last one. ... (5 Replies)
Discussion started by: hidnana
5 Replies

9. Shell Programming and Scripting

AWK extraction

Hi all, I have a data file from which i would like to extract only certain fields, which are not adjacent to each other. Following is the format of data file (data.txt) that i have, which has about 6 fields delimited by "|" HARRIS|23|IT|PROGRAMMER|CHICAGO|EMP JOHN|35|IT|JAVA|NY|CON... (2 Replies)
Discussion started by: harris2107
2 Replies

10. Shell Programming and Scripting

Help with tar extraction!

I have this tar file which has files of (.ksh, .ini &.sql) and their hard and soft links. Later when the original files and their directories are deleted (or rather lost as in a system crash), I have this tar file as the only source to restore all of them. In such a case when I do, tar... (4 Replies)
Discussion started by: manthasirisha
4 Replies
Login or Register to Ask a Question