Sequence extraction

08-05-2015

Registered User

35, 0

Join Date: Aug 2015

Last Activity: 13 November 2015, 4:12 AM EST

Posts: 35

Thanks Given: 0

Thanked 0 Times in 0 Posts

Sequence extraction

i want to extract specific region of interest from big file. i have only start position, end position and seq id, see my query is:

I have file1 is this

Code:

>GL3482.1
GAACTTGAGATCCGGGGA
GCAGTGGATCTCCACCAG
CGGCCAGAACTGGTGCAC
CTCCAGGCCAGCCTCGTC
CTGCGTGTC
>GL3550.1
GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT
GACATTTTCATTACTACCATTTTGGAGTACA
>GL3472.1
TTTTCCTGTTCACTGCTGCTTTTCTATAGACAGCA
GCAGCAAGCAGTAAGAGAAAGTA

file2 is this :

Code:

seq id     start    end
GL3482.1   323100   323743
GL3550.1   41911    40888
GL3472.1   274408   272617

and i want result like this:

Code:

>GL3482.1 
TTGAGA
>GL3550.1 
GGCATTTTTGTG
>GL3472.1
TTCCTGTT

when i run this command :

Code:

$ awk 'NR==FNR{if($0~/^>/){i=substr($0,2);getline};a[i]=a[i] $0;next}{print ">" $1 ORS substr(a[$1], $2, $3-$2+1)}' file1 FS=\\t file2

i got no output or output file. why so?

Moderator's Comments:

Please use CODE tags as required by the forum rules you agreed to when you joined this forum.

---------- Post updated at 01:28 AM ---------- Previous update was at 01:21 AM ----------

i am using Ubuntu OS

Last edited by Scrutinizer; 08-05-2015 at 04:07 AM.. Reason: Add CODE tags, remove extraneous closing COLOR tag. Removed spurious space and a newline from first sample file

harpreetmanku04

View Public Profile for harpreetmanku04

Find all posts by harpreetmanku04

08-05-2015

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Maybe because FS is set to '\t' and there appear to be no TABs in your second sample file, so the fields will not match. And you are setting the index to the entire record, save the first character. which is probably not what you want (you probably meant to use $1 here, but that would not be a sure way to do it either, because in the FASTA format the identifier is allowed to contain spaces). And the FASTA file is word wrapped, so you need to take out the newlines and not use getline to get only the second line ....

The best way to do that is is to use ">" as a record separator and use "\n" as the field separator. By setting OFS as the empty string, and assigning a value to one of the fields, all newlines will be replaced by empty strings, so this will effectively remove the word wrap. And the sequence will become one continuous string, which will make it suitable for substring selection.

Using your file order, we would get something like this.

Code:

awk 'NR==FNR{i=$1; $1=x; A[i]=$0; next} $1 in A{print ">" $1 ORS substr(A[$1], $2, $3-$2+1)}' RS=\> FS='\n' OFS= file1 FS=" " RS="\n" file2

If we read the files the other way around, then it becomes more memory efficient:

Code:

awk 'NR==FNR{S[$1]=$2; E[$1]=$3; next} $1 in S{i=$1; $1=x; print RS i FS substr($0,S[i],E[i]-S[i]+1)}' file2 RS=\> FS='\n' OFS= file1

Last edited by Scrutinizer; 08-05-2015 at 04:52 AM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

08-05-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

With your two sample input files (with the combined lengths of the lines in each group that do not start with a > being less than 100 characters), I don't see how you would expect any output when the substring you are trying to extract from those strings starts more than 40,000 characters into that string, and in two of the three cases has an ending position in the string that comes before the starting position (thereby requesting a substring that has negative length).

In addition to those problems, as Scrutinizer said, your script specifies that the input field separator for file2 is a tab character, but there are no tab characters in the data you showed us. Therefore, you are requesting a substring of 1 character starting at position 0 (when arrays of characters in awk start at position 1).

Note also that although you might be able to create an array element in awk or gawk on Ubuntu that is more than 323,000 characters long; on most UNIX systems and BSD-based systems, awk won't let you read a line, write a single output string, or create a variable whose value is much more that LINE_MAX bytes long (on most systems LINE_MAX is 2,048).

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-05-2015

Registered User

35, 0

Join Date: Aug 2015

Last Activity: 13 November 2015, 4:12 AM EST

Posts: 35

Thanks Given: 0

Thanked 0 Times in 0 Posts

let me clear my query again :

i have 2 files, file1 is 1.fasta sequences file which are as follow:

Code:

>gi|547177824|gb|AWWX01000001.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig0, whole genome shotgun sequence
ACATAATCCCCGAGGCAAAATAAGTCTCTAATGAACTTGACCCTATGAGTGTCAAGGTGAGGGAGTCCTA
AGAGACTGACAAGGGCTGTGAAGGCTACAAGGGAAGAAGACAACAGGATCAGCTGGAAAAGGCTTAAAAT
TGCAGCCCTTGATTCTTCCACTGTGCCCTGGGGCCATGAATCGCTACAGCCTCACTGAAGGAATCTGAGG
TAGCATCTCAGAGCTCCCATGCCAAGACCATGGGGAACAAATTTGAGTTGGATGTGGCAGCATGGCCCAG
AGGTCATGAAATAACCAGCACAAACTTTGTTAGTGGATGACAACTGGCCCTTTTAAGTGATCACCTTAAT
AGACTATCATTCAGGGACTAAAAGAGAAAACTTAGTACCCCAAAAATTGTACCAAAGGACAGAAATTCAC
CTCAGAACTTTCCAGCAAAG
>gi|547177823|gb|AWWX01000002.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig1, whole genome shotgun sequence
CAATTTGATTCCAGCTTGTGTTCATCCATCTGGGAATTTTCATGATGTACTGTGCATATAACTTAAATAA
TCAGGGTGACAATATGCAGCCTTAATGTACTTGTGTCCCAATTTTGAACCTGTCTGTTGTTCCGTGTCCA
CTTCTAACTGTTGCTTCTTGACCTGTATACCAGTTTTTCAGAAGGCAGGTAAGGTGCTCTGGTATTCCCA
TCTCTTTAATAATTTTGCAGTTTCCTGTGATACACACAGACAAAGGCTTTTACATAGTCAAACAAGTAGA
AGTTTTTCTGGAGTTCTCTTGCTTTGCCTA
>gi|547177822|gb|AWWX01000003.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig2, whole genome shotgun sequence
GGTCCAAGAGGCTCCTTCTTTTGTCCAGAAAGAGTTCCTGTGAACAGGGCTCAGTGGCTGACTGATGGTT
AGTATCATCTGGAGAAGGGAAAGGGCTTTTGTAGGCAAATTTAGCATTGCCAAGCCAGCAGAGTCTACCT
GGCAAGTCTTTCAAAGTTCAGAATTTTTCCTAAAGGCAAGGAAAACATGGACATAAGGGACCTCGCTGGA
ATCCTAAGCGCAAATCTTCAGCAATGGACAGTCCCATCTGAAAAGGAAATGATGAATGCTCATAAGGGGC
TTCGTGTGTCACTGAATTCCCTTATGGACTTTTCCTGGCTGTGGCATCCCTCATCTGTGATCTTGCTGGG
CATAATGTGGTTCAGTTCTCGCCCCAGGGGCCGTTGTGGCTTCCCTGAGTATTTTTTTAAGAGATTATTT
ATGTGTTTAATTTTGGTTGCACTGGCTCTTCATTGCTCCATGTGTGTTCTCTAGTTTCGAGGAGCAGGGG
CTA
>gi|547177821|gb|AWWX01000004.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig3, whole genome shotgun sequence
CTATATAAATAAATCTATTTCTTGCCTAACACTTTGCCTGTTGCTGAATTCCTTCTGTGCCGAGACACAA
ACAATGTGAGCCACAGTACATCCAGGTACCAGATAAGTGATTCTAAATAAAAGACCATGGTTCAAGACTC
AAACTGGATTTTGGGGAGGGGGGGAGGTTGGAGTCCTGGCCATGTGGATTTTAGTCCAAACCTGATGTGA
TAGGTTTCAGGTAGAGATCATCTCCATGCCACCACATTGGAAATTTTATACTATGGTCTGTTATACTGGT
TCA
>gi|547177820|gb|AWWX01000005.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig4, whole genome shotgun sequence
TCTCAAGCAGACTTTACCTATAAAAAGGGCAATGTGTCCTGGAACTGCAAGCTGAGCTTAGGAACAGAGA
ACTGCAAAAACTATTGGCATGAACAACTCTGTCCAAACATCTACATTAGGAACCTTGACTGACCTCTAGC
ATGGTTTCTAGCAGCAACCTAAGGCCACGTTCTAGGACAACTCAGCTACCCCTGAGTTCCTGTCTAGAAA
ATTTCAAGGCTACCAAAGGAATCTGCTCCAGCCAACATCTGACATAAGCCCCTCATCTTCCTTTACTTAG
AGTGTCTATTTAAAACAAAGACCAAAAAAAAAAAAAACAAAAACAAAAAACCCTCACGATTACAAGAAAA
GTGTGACGGAAATAAACTAGAAATTGATAACAAAAAGACAGCAGGAAAATTAAAGATTATTTGCAGACTA
AGCAGAAAACATCTAAA
>gi|547177819|gb|AWWX01000006.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig5, whole genome shotgun sequence
ATCAGCAAGCTCAACCTCGGGCAAACGATTGTCCGAGTGATGATGGGGGCTGCCCAAGTCCGGTCTCCTC
ATGGTGCTCCTGGGGGTCATCTTCATGAATGGTAACCACGCCACCGAGGAGGAGGTCTGGGAATTCCTGA
GTATGTTGGGGATCTATGCTGGGAGGAGGCACTGGATCTTTGGGGAGCCCAGAAGGCTCATCACCAAAGA
TCTGGTGCAGAAGGAGTACCTGAACTACCGCCGGGTGCCCAATAGTGATCCTCCGCGCTACCAGTTGCTG
TGGGGCCCGAGAGCTTGTGCTGAGACCAGTAAGATGAAGGTACTGGAGGTTCTAGCCAAGTTCCACGGTA
GGGTCCCTAGTTCCTTCCCAGACCTATATGACGAGGCTCTGAGAGATCAGGCGGAGACAGCAGGGCGGAG
AGGTGTGGCCAGGGCTCCCGACCATGGCTGAGGCCAGTGCCC
>gi|547177818|gb|AWWX01000007.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig6, whole genome shotgun sequence
ATGAAGAAATGAAGAGAAAAAAAATGAGCACAAATACCTCTCATGAACATAGAAGCAAAATCCCCAACCA
AATATAAGAAAAGAAACCTAAAAATCCATAGAAGGAATTACATACTATATCTAAGTAGGATGCATTTCAG
ATGTGCAACATTTGAAAATCAGCAATCATAAATCATAATAACAATAGTCTAAGGAAGATAATCACAAGAT
GAGATCAACAGATGAAGGGAAAAATCATCTGATAAATCCAACACCCATTCATCATAAAAATTCTACAGCA
AACTAGATACAGACAGGCATTTCCTCAGCTTCATATAAAACATACTAAAAAGAAATACAGGTAAAACATA
CTTCATAGT
>gi|547177817|gb|AWWX01000008.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig7, whole genome shotgun sequence
AACTTTGGTGCCCCGTGGTCCTGGCATGGGGCCTCGGAGATGCTGCCTCAGATTCCTTCAGTGAAGCAGT
GGAGATTCATGATCCCAGGGCACAGTGCAAGAATCCAGGGCTCAAGTTTAAAGCATTTTTGCAGCTCATC
CTGTTGTCTGCTTCCCTTGCAGGCTGCACATCTATGCCAGTCTCTTAAGTACTCCCCCACCTAGGTACAC
ATAGGGTCAAGTTCAGTGGAGACTTACTGTCCATCAGGGATTATTTAATTTTATGTTTTCACTTTTGTTA
ATAGTTTTCTTCATTCCTTCAATAAGATTTCCGTGGTTACTTCTGGGTACAAAAA
>gi|547177816|gb|AWWX01000009.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig8, whole genome shotgun sequence
CCTTCTTTTTCCTGAGACTCTCAGGAAATTCTCCTTGCCTCCATGAAGACTGTATACATTTTTCTAAAGT
TTCTTTGTAGGTAAATTTTCTCTATGGTTGCTTTTGTTTGTGGTATATTTTCTGTTACTTGGTGTAAGTA
ACTGTTGATTCTTCCAGACCAACTGGATGATTTTGCTGCCATGATCTTTTTATTATTCCAGATTTTTGAA
GTTTCTTACATGTTTGAATATTTCCTGTCCTGTTTTCTAAGATTTAATTCGAGAATCAAATTGTCATTGT
GATCTTTTGCTTTTCATTACTTCTGACTTTTATTCATTTGTCATTGTTGTATATAACTCTAGTGGCTATA
TGTACTT

file2 is result.ods file which is as :

Code:

gi|546687122|gb|AWWX01446731.1|	       13172	13194
gi|546693672|gb|AWWX01441057.1|	       6859	        6837
gi|546698969|gb|AWWX01436431.1|	       18753	18775
gi|546703077|gb|AWWX01432778.1|   	4132	        4154
gi|546670495|gb|AWWX01450063.1|	        4111	        4133
gi|546689695|gb|AWWX01444610.1|	       14602	14580
gi|546691352|gb|AWWX01443112.1|	       10073	10051
gi|546880329|gb|AWWX01275531.1|	       1158	        1136
gi|546670216|gb|AWWX01450333.1|	       10633	10655
gi|546678257|gb|AWWX01448205.1|	        1112	        1134
gi|546693672|gb|AWWX01441057.1|	        6832	        6854
gi|546704475|gb|AWWX01431394.1|	        18135	18113
gi|546699347|gb|AWWX01436084.1|	        26840	26862
gi|546702960|gb|AWWX01432895.1|	        13515	13493
gi|546833971|gb|AWWX01313367.1|	        615	        593
gi|546860287|gb|AWWX01291803.1|	         2188	2210
gi|546689115|gb|AWWX01445179.1|	        10761	10739
gi|546701370|gb|AWWX01434384.1|	        2616	        2638
gi|546694075|gb|AWWX01440674.1|	        9568	        9546
gi|546701082|gb|AWWX01434635.1|	        8423	        8445
gi|547071923|gb|AWWX01098172.1|	        135	        157
gi|546705086|gb|AWWX01430793.1|	        3181	        3203
gi|546704429|gb|AWWX01431440.1|	       1352	        1330
gi|546709146|gb|AWWX01426952.1|	       52962	52984

and we want a awk script which creates an output file which contain sequences extracted from these coordinates along with id with > symbol .
for example:

Code:

>gi|546709146|gb|AWWX01426952.1|
acctgctgcatgcgtgcgtggcgtgcaaaatgcagtcaaggcaggtcagtccatgcatgacgtgcaatgcattgcatggcgtgcaaaatgcaggcgtggcgtgcaaaatgcagtcaaaaaattgccgtgcaatgggcc

output file should be like this.

I hope now it is clear to you.

Last edited by Don Cragun; 08-05-2015 at 06:57 AM.. Reason: Add CODE and ICODE tags.

harpreetmanku04

View Public Profile for harpreetmanku04

Find all posts by harpreetmanku04

08-05-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

The comments by Don Cragun still apply: most of the sequences are shorter than the coordinates that you have in your file2, and there are coordinates that result in negative string lengths.
On top, none of the samples in file2 match any of those in file1.

How about posting some meaningful samples?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-05-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Please explain your example; it is not at all clear to me.

You say you want the output:

Code:

>gi|546709146|gb|AWWX01426952.1|
acctgctgcatgcgtgcgtggcgtgcaaaatgcagtcaaggcaggtcagtccatgcatgacgtgcaatgcattgcatggcgtgcaaaatgcaggcgtggcgtgcaaaatgcagtcaaaaaattgccgtgcaatgggcc

which seems to have been selected by the line:

Code:

gi|546709146|gb|AWWX01426952.1|	       52962	52984

from the file result.ods which I thought meant that you wanted characters 52962 through 52984 from the data following a line starting with:

Code:

>gi|546709146|gb|AWWX01426952.1|

in the file 1.fasta. But, that string does not appear in 1.fasta and, even if it did, the data from characters 52962 through 52984 would be 23 characters long; not the 138 characters you have said you want to have output.

What am I misunderstanding???

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-05-2015

Registered User

35, 0

Join Date: Aug 2015

Last Activity: 13 November 2015, 4:12 AM EST

Posts: 35

Thanks Given: 0

Thanked 0 Times in 0 Posts

may be that entries would be at the bottom of the file actually sir, i just copied and posted very few entries from file1 and file2 may be matched would be skipped. i cant paste my whole data file here i have 4565005 sequence entries in my data.

and about negative value, please provide me the script for like 300 250 = 50
no problem if there is 250-300 =-50 i will correct it and will then apply script.

harpreetmanku04

View Public Profile for harpreetmanku04

Find all posts by harpreetmanku04

Shell Programming and Scripting

Sequence extraction

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extraction of upstream and downstream regions from long sequence file

Discussion started by: harpreetmanku04

2. Shell Programming and Scripting

String Extraction

Discussion started by: karan8810

3. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Discussion started by: manigrover

4. UNIX for Dummies Questions & Answers

fast sequence extraction

Discussion started by: Fahmida

5. Shell Programming and Scripting

extraction

Discussion started by: xerox

6. Programming

extraction from a path

Discussion started by: yeclota

7. Shell Programming and Scripting

Regex extraction

Discussion started by: abdurrouf

8. Shell Programming and Scripting

extraction of last but one char

Discussion started by: hidnana

9. Shell Programming and Scripting

AWK extraction

Discussion started by: harris2107

10. Shell Programming and Scripting

Help with tar extraction!

Discussion started by: manthasirisha