extract regions of file based on start and end position


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers extract regions of file based on start and end position
# 1  
Old 09-01-2012
extract regions of file based on start and end position

Hi, I have a file1 of many long sequences, each preceded by a unique header line. file2 is 3-columns list: headers name, start position, end position. I'd like to extract the sequence region of file1 specified in file2.

Based on a post elsewhere, I found the code:
Code:
awk 'NR==FNR{if($0~/^>/){i=substr($0,2);getline};a[i]=a[i] $0;next}{print ">" $1 ORS substr(a[$1], $2, $3-$2+1)}' file1 FS=\\t file2

But with the files I have, regions are extracted from only a subset of the specified sequences.

file1 (my real file is much longer, >47000 lines, and each sequence is much longer too):
Code:
>GL349621.1TTTACAATTGCTATTGTAACAATATATCAGGAGCCTTGTATTAAATTTTCACGCATTTTTACCAAACAAA
>GL3422.1
TTACCGGCACTTCATTCACTTTTAAATACCCGTGTTTCGATTTTGTTGTTTTAGCCCCGCGTAAAGCAAA
>GL3423.1
104-151:aphid_project aman$ head -10 Ap_assembly2_scaffolds_corrected.fasta 
>GL3421.1
TTTACAATTGCTATTGTAACAATATATCAGGAGCCTTGTATTAAATTTTCACGCATTTTTACCAAACAAA
>GL3422.1
TTACCGGCACTTCATTCACTTTTAAATACCCGTGTTTCGATTTTGTTGTTTTAGCCCCGCGTAAAGCAAA[/QUOTE]file2: 
Quote:
GL3550.1 61498 52100 GL3465.1 293074 297989 GL3567.1 234944 236074 GL3482.1 323100 323743 GL3550.1 41911 40888 GL3472.1 274408 272617 GL3550.1 75587 73909 GL3465.1 869051 867919 GL3472.1 294812 289561 GL3567.1 238848 236284
Result after using awk command above: [QUOTE]>GL350250.1 >GL3465.1 ATGGTCTCAAAGCCGTTTGTTTTCGTGTTTGTGCTCATGTCCGTGGTCGGTGTGTCGTTCTCCGTGACCGAAGAAAACGATGGTACAAAGGTCGTGGACAAGGAAGAGGACCATCATCAGAACATCCAAGAGGAACTCAAGAAATTTTTGTCGACGTTAGAGAAAATCGACATAGACCAGATATTGAACAACCACAGGTTGATGTCGAACAACGTCAAATGTTTTCTGAACGAAGGACCGTGTACGGCACAACTGAGGGAAATGAAAAGTAAGAGTGTGTGCGACACGTAATATTACTGATTGTACGATGACTTCGCTAAATACTATTTGGTTGAAATATATTATTATGCCTAAATTACTGTAAATCATGTATTTAATAATTTACTAGTACCTATAATATATTATTTTGATAAATTTATATTTCATGTTTAGTAGTTATTTAGTACAACGTACGAGTATTAATAATTTGAATAAATTATGTAAAGCCATTTTGTAAGCATAGCCAATACATATTTCAATATATTGACATTTTCGTCATCAATCATTATAGTTTGCTGTAAAAATAATATTATTTAAATTCGACAATTTGTATTGACAAGATAATCAAAACACCAAATATTTTAGTAAAATTTAATAAATCAAGTTATTATAATATGCCAATTTAGTAACATTTTAAAAAGAATAATACCAAAAGGTCATCGCGTACTCCTCGAATCAGCCACTACTTAGTACTTTCTACGTAGTACTTACAAGCAAAAACACTTATAAAGTTATAAATTATATATCTATTTAAAGAAAAGCCAATTCCAAATTGTTTGTTTTTTAAACCCAGCAACATCATAATAAAAACCTACATTTTAAATATAGTTGAGAACAGAGTAGTTCTGACTAAGGAGAAAAAATATAATAATAACAACCAAAAGGTTCTATTTGAAGAACAAAGACGATCTTGGTTCTAAACGGAATTTTTTCGAAACTATTATTTTCTTTTTATTTTAAGTTAATATTTTCTAAGAACAAAACACAAATGTCAAAGTTTAATGCAAAACAACACTTTGATAATTTTTAGATTCAACAAACCGTTCTATATGAAAGAAGATATTACAGTATGAAGTTTAAAATGTTT >GL3567.1 ATGGCTCATCTTAACTTATTTGTTGTTCTGGTCGCGTCGCTTGTGTGCTTCACGTTGGCCGAAGAAAAATACACAACTAAATTCGACAACTTCGACGTGGACAAGGTGTTGAACAACGACAGAATTTTAACCAGTTACATCAAGTGTTTGCTGGACCAGGGAAATTGCACCAATGAAGGCCGAGAATTGAAGAGTAAGTCATATAAGAACAAACGTATATTATATTTTAGTATTTCAATTTTCAAGTATGTTATACAATTAAAATTATAATGACATTGATTTAACTATCGGTACATTTTTAAAAATTGTAATATCTATATACTTCTGGGGTATACTATAATCTATACTCTATAACAACTTCTATATATATATATATAATATATATTATATAAGTATCAAAAAATTACCAATATAAAAAAAAAAACCATAACATATATATTTAATATTGTACCTACTTTATAAAATGCTTTAAGTTCGTCAGCATAATATAATTTTATTTATCTGACAATCGTACTAAATTATCATCTTGAGTACATATCGATATAACAATAATACATACCGTATTATAATGTTATACTTAACCAAGTTACCGTAAACGATAAATAAGAACAACACTAACAATTATAATTTTATTTGTATTCTTAAGATTACGGTTTTCACTTTCACATTAAACATAATATTATATCGACAAGACCTTGCTTACTCAGTCAGACTCACTTTCACCGGGTGTTGTAATTATTTTTCATCTGAGCCGAATTACCGTGTATAGTATTTAGAAAGAAACAAAGTCGATACGTAAAATGTGTGGTTAATCCTTAAATTTTTCCTTGCTATATTAAAATGTTTATTAATAATTGTTTTTAAACTTCTTTTTCACAATAATGTACATGTGTTATAATTATATTTTCTTGTCTGATTAAATTTGCAGAAGTTTTACCAGACGCATTGAAAACTGACTGCAGCAAATGTACAGCTGTACAAAAAGACAGATCAGAAAAAGTAATCAAGTTTTTGATCAAAAACCGTTCTAAAGACTTTGATAATTTGACCGCCAAATACGACCCATCAGGCGAATACAAAAAGAAGATCGAAAAATTCGACGCGGAAAGGGCTGCGGCTGCTAAACATTAA >GL3482.1 ATGGTGCATCTTAACTTATTTGTTGTTGTGGTCGCGTCACTGGTGTGCTTCACATTGGCTCAAGAAAAGTACTCAACTAAATACGAAAACTTTGACGAGGACAAGGTGTTAAACAACGACAGCCTTTTAACCAGCTACATCAATTGTTTGCTCGACGAGGGAAATTGTACCGAGGAAGGCCAAGCGTTGAAAAGTAAATCACAATAACAAATGTATTATATTTTAGTATTTTAACTTTCAAGTATATTATACAATTAAAATCATAATGACATTGATTTAACTATCGGTACATTTTTAAAAAAATGGATATCTAATATCTATACTATACTTTTGGGGTATACTAGCTATAGTCTAAACTCTAAAACAACTACTAGGTACATTATATGTATCAAATAAATATCAACAAACTTCCATTTGGCATAGGTGTAATAATTATATTAATTTCATATCTAATTAAATGTACAGGAATTTTACCAGACGCATTAAAAACTAACTGTGGAAAATGTACGGATGCACAAAAATTGAAAATAGAAAAAATAATGAAATTTTTGATCAAGAACCGTTCTATTGACTTTGATCGTTTGACCGCCAAATACGACCCATCGGGAGAATACAAGAAAAAGCTCGAAAAATTCTCCGCTTAA >GL3550.1 >GL3472.1 >GL3550.1 >GL3465.1 >GL3472.1 >GL3567.1

So the specified region is only extracted for 3 out of 10 queries. I have checked and all headers that appear in file2 are also represented in file1. The sequences are long enough to contain all of the beginning and end points. Any ideas on what's going wrong?
# 2  
Old 09-01-2012
OK. There are at least three problems here.

First: Look at the first entry in you sample file1:
Code:
>GL349621.1TTTACAATTGCTATTGTAACAATATATCAGGAGCCTTGTATTAAATTTTCACGCATTTTTACCAAACAAA

There is no space between>GL3496211and the start of data for that sequence. So, you won't have an element in the array a forGL349621.1. Your code will instead produce an entry in the a array with subscript
Code:
GL349621.1TTTACAATTGCTATTGTAACAATATATCAGGAGCCTTGTATTAAATTTTCACGCATTTTTACCAAACAAA

which will not be matched by any entry in file2.

Second: Let's look at file2:
Code:
GL3550.1 61498 52100
GL3465.1 293074 297989
GL3567.1 234944 236074
GL3482.1 323100 323743
GL3550.1 41911 40888
GL3472.1 274408 272617
GL3550.1 75587 73909
GL3465.1 869051 867919
GL3472.1 294812 289561
GL3567.1 238848 236284

Note that the end points marked in red come before the corresponding start points for those ranges. This will result in a call to substr(a[i], start, length)with a value for length that is a negative number. This will result in no output for that substring.

Third: The standards say that awk's print and printf output statements are only required to work when the output produced is no more than {LINE_MAX} bytes long. The value of {LINE_MAX} is allowed to be as small as 2048. The line marked in blue in your sample file2 would require the print statement to output a line that is about 4927 bytes long. (I note that the sample output you provided produced output that is 1148 bytes long for this entry.)

So you have to clean up file1 so that the keys are as you described. Then you need to either clean up file2 so that columns 2 and 3 always specify an end point that comes after the start point OR you need to modify your awk script to fix the arguments passed to substr() to specify the smaller value as the start point and calculate the length to be a positive value.

Then you need to see how much output your awk's print command can produce in a single call and modify your script to split the output into chunks small enough to print (probably using printf instead of print).

Note also that you are reading all of file1 into memory. With some entries in file2 that are at least 869051 bytes long (not including the key on the first line of the entry), you may also be run into memory allocation limits for user processes. You don't seem that have this problem, but others planning to use a solution like this for their own scripts should know to watch for a diagnostic message indicating that they have hit a limit like this.
This User Gave Thanks to Don Cragun For This Post:
# 3  
Old 09-01-2012
Thanks for the input Don Cragun. The missing newline was a problem in the copy-and-paste that I didn't catch. You pointed out what is the obvious issue, which I somehow didn't notice--starts that come after ends. That's what I get for assuming I had a coding issue! Anyway, I have looked more into the data from which this file was generated, and it's clear I need to become more familiar in that area before coming back to this.
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Splitting week start date and end date based on custom period start dates

Below are my custom period start and end dates based on a calender, these dates are placed in a file, for each period i need to split into three weeks for each period row, example is given below. Could you please help out to achieve solution through shell script.. File content: ... (2 Replies)
Discussion started by: nani2019
2 Replies

2. Shell Programming and Scripting

Retrieving sequences corresponding to start and end position

Hi all, I have a fasta file of a reference sequnce, I will like to retrieve sequences corresponding to a list of start and end position in another file >my_ref_seq GCCCTATAAGGGCAGAAGCTTGTCCTTCTTGTGCCAGTTATGACGTTTGTCCTAACTGCACATCTGGTAG... (4 Replies)
Discussion started by: Ibk
4 Replies

3. Shell Programming and Scripting

Extract Big and continuous regions

Hi all, I have a file like this I want to extract only those regions which are big and continous chr1 3280000 3440000 chr1 3440000 3920000 chr1 3600000 3920000 # region coming within the 3440000 3920000. so i don't want it to be printed in output chr1 3920000 4800000 chr1 ... (2 Replies)
Discussion started by: amrutha_sastry
2 Replies

4. Shell Programming and Scripting

How to extract start/end times from log file to CSV file?

Hi, I have a log file (log.txt) that which contains lines of date/time. I need to create a script to extract a CSV file (out.csv) that gets all the sequential times (with only 1 minute difference) together by stating the start time and end time of this period. Sample log file (log.txt) ... (7 Replies)
Discussion started by: Mr.Zizo
7 Replies

5. Shell Programming and Scripting

Generate Codes based on start and End values of numbers in a column

Hello All, Could you please help with this. This is what I have: 506234.222 2 506234.222 2 506234.222 2 506234.222 2 508212.200 2 508212.200 2 333456.111 2 333456.111 2 333456.111 2 333456.111 2 But this is what I want: 506234.222 1 506234.222 2 506234.222 2 506234.222 3 (5 Replies)
Discussion started by: canimba
5 Replies

6. Shell Programming and Scripting

Extract data based on position

The file has record length 200. And i have 100 search strings which are ten digits of character from 1 to 10 characters all of them are unique, they need to searched in a file. Please help me to pull the records based on position (say from 1-10). test data 1FAHP2DW0BG115206RASHEED ... (6 Replies)
Discussion started by: zooby
6 Replies

7. UNIX for Dummies Questions & Answers

To Extract words from File based on Position

Hi Guys, While I was writing one shell script , I just got struck at this point. I need to extract words from a file at some specified position and do some comparison operation and need to replace the extracted word with another word. Eg : I like Orange very much. I need to replace... (19 Replies)
Discussion started by: kuttu123
19 Replies

8. UNIX for Dummies Questions & Answers

Extract a specific number from an XML file based on the start and end tags

Hello People, I have the following contents in an XML file ........... ........... .......... ........... <Details = "Sample Details"> <Name>Bob</Name> <Age>34</Age> <Address>CA</Address> <ContactNumber>1234</ContactNumber> </Details> ........... ............. .............. (4 Replies)
Discussion started by: sushant172
4 Replies

9. UNIX for Dummies Questions & Answers

find if a position is between a given start and end position

Hi, I am a newbie in unix programming so maybe this is a simple question. I would like to know how can I make a script that outputs only the values that are not between any given start and end positions Example file1: 2 30 40 80 82 100 file2: ID1 1 ID2 35 ID3 80 ID4 81 ID6... (9 Replies)
Discussion started by: fadista
9 Replies

10. Shell Programming and Scripting

extract a particular start and end pattern from a line

hi In the foll example the whole text in a single line.... i want to extract text from IPTel to RTCPBase.h. want to use this acrooss the whole file Updated: IPTel\platform\core\include\RTCPBase.h \main\MWS2051_Sablime_Int\1... (7 Replies)
Discussion started by: manish205
7 Replies
Login or Register to Ask a Question