Extraction of upstream and downstream regions from long sequence file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extraction of upstream and downstream regions from long sequence file
# 8  
Old 08-07-2015
but i want new result file in the same folder where my other files are present file1 and file2 in the terminal it is showing nothing and also not creating any new file.

kindly help me out this is a major step of my research and am stuck here from 3 weeks Smilie
# 9  
Old 08-07-2015
I guess you are stuck because your specifications are a) unclear, and b) moving.

This is what I get from your samples above (lines chopped at 178 chars):
Code:
sequence id     extracted region small  extracted region big upstream and downstream
>gi|546671471|gb|AWWX01449637.1|        CACAAGACCACCAGGGAAGT    TGGGGTTGTACTGGGTCTTGGTTACAGGATCTTTAGTTGCAGCATGTGGGATCTAGATCCCTGTCCAGGGCCCTGAGTATGGGGAGCTCAGAGTCTTAGCCACAAGACCACCAG
>gi|546671514|gb|AWWX01449617.1|        ACACATACACATGCACACACAAC CCCCCGTAGTGGGGGTAGGTTGCTCTGTCAAGACCAAGGGCCAATTATTTTCTTACCATGAAAACCAAGAAGAAGGTGACTACAGGTGATTCAACCTCTAACACATACACATGC
>gi|546669842|gb|AWWX01450698.1|        GCCCAGCCCAGCCCAGCCCA    GCCGAGTTCAGCTCAGCTCAGCCCAGCAAAATTCAGCCCAGCTCAGCCCAGCAAAGCTCAGCCCAGCTCAGCCCAGCTCACCCAAGCTCAGCTCAGCTCAGCCCAGCCCAGCCC
>gi|546669842|gb|AWWX01450698.1|        GCCCAGCCCAGCCCAGCCCA    CAGCTCACCCACTCTGCCCAGCTCAGCCCAGCAAAGCTCAGCCAAGCTCAGCTCAGCTCAACAAAGCCCAGCTCAGCTCAGCCCAGGTCAACCCAACTAAGCCCAGCCCAGCCC
>gi|546669842|gb|AWWX01450698.1|        GCCCAGCCCAGCCCAGCCCA    CCCAGCTCAGCTCAGCCTAACCCAGCTCAGCCCAGCTCACCCACTCCGCCCAGCTCCGCCCAGCTCAGCCCAGCTCACCCACTCCGCCCAGCTCCGCTCAGCCCAGCCCAGCCC
>gi|546669842|gb|AWWX01450698.1|        GCCCAGCCCAGCCCAGCCCA    GCTCAGCCCAGCAAAGCTCAGCCAAGCTCAGCCCAGCTCAACAAAGCCCAGCTAAGCTCAGCCCAGGTAAACCCAACTAAGCCCAGCTCAGCTCAGCTCAGCCCAGCCCAGCCC
>gi|546669842|gb|AWWX01450698.1|        GCCCAGCCCAGCCCAGCCCA    GCCCAGCAAAGCTCAGCCAAGCTCAGCCCAGCTCAACAAAGCCCAGCTAAGCTCAGCCCAGGTAAACCCAACTAAGCCCAGCTCAGCTCAGCTCAGCCCAGCCCAGCCCAGCCC
>gi|546669977|gb|AWWX01450566.1|        AAAAAGTTTGTTTGGGTTTTT   TAAAGAATATGTATTACAAGGTTACTCCTAACTGTGAGAATCATTAAGCCTTTTTTTTCTATGAGATAATGTGGATGGTCGCCTATGTATGGGGTTGGCCAAAAAGTTTGTTTG
>gi|546669925|gb|AWWX01450616.1|        CACCTTGATCTTGGACTTCTAGCC        CCAGAGAAAAAAGAAGAGAAAAAAAATCACTTGGGGACATAGCAAGAAGGTGGCCATCCTCAAACCAAGGAGAGAAGCCAGAAGAAACCAAACTTTCCAACACCTT

Use full paths for your input files, and redirect the output to taste.
# 10  
Old 08-07-2015
yes Rudic sir i want exactly this format but why it is chopping off character which i have marked as *

Code:
TGGGGTTGTACTGGGTCTTGGTTACAGGATCTTTAGTTGCAGCATGTGGGATCTAGATCCCTGTCCAGGGCCCTGAGTATGGGGAGCTCAGAGTCTTAGCCACAAGACCACCAG*************************************************************************************

actually i want next characters also upto 230 but it is just showing 100+ chatacters

can you please provide me the script which you are using. Ishall be thankfull to you

Last edited by Scrutinizer; 08-07-2015 at 07:45 AM..
# 11  
Old 08-07-2015
Post#9 is the result of the script in #7, lines chopped at 178 chars to keep the post reasonably small. The real result has the desired -100 to +100 chars.
# 12  
Old 08-07-2015
yes it seems it is working. Thanku very very much Sir indeed!!!!!!! Smilie Smilie
# 13  
Old 08-07-2015
To get the output format requested in post #6 (no > at the start of the output lines, line breaks at 70 characters in the last field, and double quotes around the last field, you could also try:
Code:
#!/bin/ksh
results_file="${1:-result.ods}"
fasta_file="${2:-smalldata.fasta}"
awk '
BEGIN {	printf("%s\t%s\t%s\n", "sequence id", "extracted region small",
		"extracted region big upstream and downstream")
}
FNR == NR {
	# Process 1st input file.
	if(NR > 1) {
		# Skip header.
		beg[$1, ++si[$1]] = $2
		len[$1, si[$1]] = $3 - $2 + 1
		ubeg[$1, si[$1]] = ($2 > 100) ? $2 - 100 : 1
		ulen[$1, si[$1]] = $3 - ubeg[$1, si[$1]] + 101
	}
	next
}
/^>/ {	if(f) {	# Process previous entry and clear accumulated data.
		psi()
		b = 0
		d = ""
	}
	# Grab sequence ID from this line in the 2nd file.
	if(!(f = substr($1, 2)) in si)
		# This sequence ID is not in our list of those to be processed.
		f = ""
	next
}
f {	# We have a line in the 2nd file associated with a sequence ID to be
	# processed, gather data:
	d = d $0
}
function psi(	i, j) {
	# Produce output for each requested region of the sequence ID specified
	# by f.
	for(i = 1; i <= si[f]; i++) {
		printf("%s\t%s\t\"", f, substr(d, beg[f,i], len[f,i]))
		spot = ubeg[f, i]
		left = ulen[f, i]
		for(left = ulen[f, i]; left > 0; left -= 70) {
			printf("%s%s\n",
				substr(d, spot, (left > 70) ? 70 : left),
				(left > 70) ? "" : "\"")
			spot += 70
		}
	}
}
END {	if(f)	# Process last entry.
		psi()
}' "$results_file" "$fasta_file"

which, with the data provided in post #1 in this thread produces the output:
Code:
equence id	extracted region small	extracted region big upstream and downstream
gi|546671471|gb|AWWX01449637.1|	CACAAGACCACCAGGGAAGT	"TGGGGTTGTACTGGGTCTTGGTTACAGGATCTTTAGTTGCAGCATGTGGGATCTAGATCCCTGTCCAGGG
CCCTGAGTATGGGGAGCTCAGAGTCTTAGCCACAAGACCACCAGGGAAGTTTCCAGTTACACGATCATTT
TAGTTAGATAAATATTTTGTGTTTACATTATTACTGTATCAGTGATATTCACACTGAATTATACAATGTG
ATTTTTACAC"
gi|546671514|gb|AWWX01449617.1|	ACACATACACATGCACACACAAC	"CCCCCGTAGTGGGGGTAGGTTGCTCTGTCAAGACCAAGGGCCAATTATTTTCTTACCATGAAAACCAAGA
AGAAGGTGACTACAGGTGATTCAACCTCTAACACATACACATGCACACACAACGTGGACACTCAGAGAGT
TGAGTTAAAGCATAACTATTTTACCTCCAAATTACTGCTAATGCTGAAAAGTACAGGTATTTATCTAATG
TGTTTCAGGGTCA"
gi|546669842|gb|AWWX01450698.1|	GCCCAGCCCAGCCCAGCCCA	"GCCGAGTTCAGCTCAGCTCAGCCCAGCAAAATTCAGCCCAGCTCAGCCCAGCAAAGCTCAGCCCAGCTCA
GCCCAGCTCACCCAAGCTCAGCTCAGCTCAGCCCAGCCCAGCCCAGCCCAGCTCACCCACTCTGCCCAGC
TCAGCCCAGCAAAGCTCAGCCAAGCTCAGCTCAGCTCAACAAAGCCCAGCTCAGCTCAGCCCAGGTCAAC
CCAACTAAGC"
gi|546669842|gb|AWWX01450698.1|	GCCCAGCCCAGCCCAGCCCA	"CAGCTCACCCACTCTGCCCAGCTCAGCCCAGCAAAGCTCAGCCAAGCTCAGCTCAGCTCAACAAAGCCCA
GCTCAGCTCAGCCCAGGTCAACCCAACTAAGCCCAGCCCAGCCCAGCCCAGCTCACTCATGCCACCCTGC
TCAGGCCAGCTCAACCCAGCTCAGGCCAGCTCAGCCCAGCTCAACCCAGCCCAGCCCAGCTCACCCACTC
TGCCCAGCTC"
gi|546669842|gb|AWWX01450698.1|	GCCCAGCCCAGCCCAGCCCA	"CCCAGCTCAGCTCAGCCTAACCCAGCTCAGCCCAGCTCACCCACTCCGCCCAGCTCCGCCCAGCTCAGCC
CAGCTCACCCACTCCGCCCAGCTCCGCTCAGCCCAGCCCAGCCCAGCCCAGCTCCGCTTAGCCCAGCCCA
GCCCAACCCAGCTCACCCACTCTGCCCAGCTCAGGGCAGCTCAACCCAGCTCAGGCCAGCTCAACCCAGC
CCAGCCCAGC"
gi|546669842|gb|AWWX01450698.1|	GCCCAGCCCAGCCCAGCCCA	"GCTCAGCCCAGCAAAGCTCAGCCAAGCTCAGCCCAGCTCAACAAAGCCCAGCTAAGCTCAGCCCAGGTAA
ACCCAACTAAGCCCAGCTCAGCTCAGCTCAGCCCAGCCCAGCCCAGCCCAGCCCAGCTCACTCATGCCAC
CCTGCTCAGGCCAGCTCAACCCTGCTCAGGCCAGCTCAACCCAGCTCAGGCCAGCTCAGCCCAGCTCAAC
CCAGCCCAGC"
gi|546669842|gb|AWWX01450698.1|	GCCCAGCCCAGCCCAGCCCA	"GCCCAGCAAAGCTCAGCCAAGCTCAGCCCAGCTCAACAAAGCCCAGCTAAGCTCAGCCCAGGTAAACCCA
ACTAAGCCCAGCTCAGCTCAGCTCAGCCCAGCCCAGCCCAGCCCAGCCCAGCTCACTCATGCCACCCTGC
TCAGGCCAGCTCAACCCTGCTCAGGCCAGCTCAACCCAGCTCAGGCCAGCTCAGCCCAGCTCAACCCAGC
CCAGCTCACC"
gi|546669977|gb|AWWX01450566.1|	AAAAAGTTTGTTTGGGTTTTT	"TAAAGAATATGTATTACAAGGTTACTCCTAACTGTGAGAATCATTAAGCCTTTTTTTTCTATGAGATAAT
GTGGATGGTCGCCTATGTATGGGGTTGGCCAAAAAGTTTGTTTGGGTTTTTCCACATGCTGGTATAGAAA
ACTTGAATACACTTTTTGGCCAACCCAGTAAGGGCTTTGCCTCATCTCTGTCTAGCCAAATTGCCACCTT
CCCTGCTAAGC"
gi|546669925|gb|AWWX01450616.1|	CACCTTGATCTTGGACTTCTAGCC	"CCAGAGAAAAAAGAAGAGAAAAAAAATCACTTGGGGACATAGCAAGAAGGTGGCCATCCTCAAACCAAGG
AGAGAAGCCAGAAGAAACCAAACTTTCCAACACCTTGATCTTGGACTTCTAGCCTCCAGAACTGTGAGAA
AATAAATTTCTGTAGAGTCACCCAGTCTGTGGTATTTTGTTATGGCAGACCTAGCAGACTGATATGCTCC
TTAAGGCAAGATGT"

I also have a version that will work with versions of awk that can't handle "long" strings, but it splits the last field on boundaries based on the uncombined input lines instead of the hard coded 70 character maximum segments used by the code above.

Note also that the above code will work even if the requested region starts before the 101st character. In that case it will truncate the leading context to start at character position 1. All of the code provided so far obviously truncates the trailing context if less than 100 characters of trailing context are present in the input.

I don't understand how this output is useful when there are multiple outputs for a single sequence ID and nothing in the output identifies what range from the input sequence is included in the output; but this is what you said you wanted...
# 14  
Old 08-08-2015
cragun sir, how to run this command from terminal ? i mean i need to save it with .awk extension or to paste these lines directly into the terminal? i did do the later but it shows infinite loop.

---------- Post updated at 12:31 AM ---------- Previous update was at 12:29 AM ----------

your output format and output is excellent output as like rudic sir in post 7 and 9 but i also want to run this too sucessfully. Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Sequence extraction

i want to extract specific region of interest from big file. i have only start position, end position and seq id, see my query is: I have file1 is this >GL3482.1 GAACTTGAGATCCGGGGA GCAGTGGATCTCCACCAG CGGCCAGAACTGGTGCAC CTCCAGGCCAGCCTCGTC CTGCGTGTC >GL3550.1... (14 Replies)
Discussion started by: harpreetmanku04
14 Replies

2. Shell Programming and Scripting

Parsing and masking regions from a single fasta file with subsequence

HI, I have a Complete genome fasta file and I have list of sub sequence regions in the format as : 4353..5633 6795..9354 1034..14456 I want a script which can mask these region in a single complete genome fasta file with the alphabet N kindly help (2 Replies)
Discussion started by: margarita
2 Replies

3. IP Networking

Newbie BIND DNS question: resolving upstream hosts?

Old skool UNIX and Linux geek here, but newbie to the world of DNS and bind. I've recently been tasked with replacing our DNS infrastructure, currently on Windows, with a RHEL based solution. And I assume that means using bind, which I've not used before. Here's my question: Suppose our company... (3 Replies)
Discussion started by: lupin..the..3rd
3 Replies

4. Shell Programming and Scripting

Obtain the names of the flanking regions

Hi I have 2 files; usually the end position in the file1 is the start position in the file2 and the end position in file2 will be the start position in file1 (flanks) file1 Id start end aaa1 0 3000070 aaa1 3095270 3095341 aaa1 3100822 3100894 aaa1 ... (1 Reply)
Discussion started by: anurupa777
1 Replies

5. Shell Programming and Scripting

FILE_ID extraction from file name and save it in CSV file after looping through each folders

FILE_ID extraction from file name and save it in CSV file after looping through each folders My files are located in UNIX Server, i want to extract file_id and file_name from each file .and save it in a CSV file. How do I do that? I have folders in unix environment, directory structure is... (15 Replies)
Discussion started by: princetd001
15 Replies

6. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Hi all, I have a file like this ID 3BP5L_HUMAN Reviewed; 393 AA. AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3; DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot. DT 05-JUL-2004, sequence version 1. DT 05-SEP-2012, entry version 71. FT COILED 59 140 ... (1 Reply)
Discussion started by: manigrover
1 Replies

7. UNIX for Dummies Questions & Answers

extract regions of file based on start and end position

Hi, I have a file1 of many long sequences, each preceded by a unique header line. file2 is 3-columns list: headers name, start position, end position. I'd like to extract the sequence region of file1 specified in file2. Based on a post elsewhere, I found the code: awk... (2 Replies)
Discussion started by: pathunkathunk
2 Replies

8. UNIX for Dummies Questions & Answers

fast sequence extraction

Hi everyone, I have a large text file containing DNA sequences in fasta format as follows: >someseq GAACTTGAGATCCGGGGAGCAGTGGATCTC CACCAGCGGCCAGAACTGGTGCACCTCCAG GCCAGCCTCGTCCTGCGTGTC >another seq GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT GACATTTTCATTACTACCATTTTGGAGTACA >seq3450... (4 Replies)
Discussion started by: Fahmida
4 Replies

9. Shell Programming and Scripting

awk: union regions

Hi all, I have difficulty to solve the followign problem. mydata: StartPoint EndPoint 22 55 2222 2230 33 66 44 58 222 240 11 25 22 60 33 45 The union of above... (2 Replies)
Discussion started by: phoeberunner
2 Replies

10. Programming

selecting rows with specific IDs for downstream analysis

Hi, I'm working hard on SQL and I came across a hurdle I'm hoping you can help me out with. I have two tables table1 headers: chrom start end name score strand 11 9720685 9720721 U0 0 + 21 9721043 9721079 U0 0 - 1 9721093 9721129 U0 0 + 20 ... (2 Replies)
Discussion started by: labrazil
2 Replies
Login or Register to Ask a Question