extract regions of file based on start and end position
Hi, I have a file1 of many long sequences, each preceded by a unique header line. file2 is 3-columns list: headers name, start position, end position. I'd like to extract the sequence region of file1 specified in file2.
Based on a post elsewhere, I found the code:
But with the files I have, regions are extracted from only a subset of the specified sequences.
file1 (my real file is much longer, >47000 lines, and each sequence is much longer too):
So the specified region is only extracted for 3 out of 10 queries. I have checked and all headers that appear in file2 are also represented in file1. The sequences are long enough to contain all of the beginning and end points. Any ideas on what's going wrong?
First: Look at the first entry in you sample file1:
There is no space between>GL3496211and the start of data for that sequence. So, you won't have an element in the array a forGL349621.1. Your code will instead produce an entry in the a array with subscript
which will not be matched by any entry in file2.
Second: Let's look at file2:
Note that the end points marked in red come before the corresponding start points for those ranges. This will result in a call to substr(a[i], start, length)with a value for length that is a negative number. This will result in no output for that substring.
Third: The standards say that awk's print and printf output statements are only required to work when the output produced is no more than {LINE_MAX} bytes long. The value of {LINE_MAX} is allowed to be as small as 2048. The line marked in blue in your sample file2 would require the print statement to output a line that is about 4927 bytes long. (I note that the sample output you provided produced output that is 1148 bytes long for this entry.)
So you have to clean up file1 so that the keys are as you described. Then you need to either clean up file2 so that columns 2 and 3 always specify an end point that comes after the start point OR you need to modify your awk script to fix the arguments passed to substr() to specify the smaller value as the start point and calculate the length to be a positive value.
Then you need to see how much output your awk's print command can produce in a single call and modify your script to split the output into chunks small enough to print (probably using printf instead of print).
Note also that you are reading all of file1 into memory. With some entries in file2 that are at least 869051 bytes long (not including the key on the first line of the entry), you may also be run into memory allocation limits for user processes. You don't seem that have this problem, but others planning to use a solution like this for their own scripts should know to watch for a diagnostic message indicating that they have hit a limit like this.
This User Gave Thanks to Don Cragun For This Post:
Thanks for the input Don Cragun. The missing newline was a problem in the copy-and-paste that I didn't catch. You pointed out what is the obvious issue, which I somehow didn't notice--starts that come after ends. That's what I get for assuming I had a coding issue! Anyway, I have looked more into the data from which this file was generated, and it's clear I need to become more familiar in that area before coming back to this.
Below are my custom period start and end dates based on a calender, these dates are placed in a file, for each period i need to split into three weeks for each period row, example is given below.
Could you please help out to achieve solution through shell script..
File content:
... (2 Replies)
Hi all,
I have a fasta file of a reference sequnce, I will like to retrieve sequences corresponding to a list of start and end position in another file
>my_ref_seq
GCCCTATAAGGGCAGAAGCTTGTCCTTCTTGTGCCAGTTATGACGTTTGTCCTAACTGCACATCTGGTAG... (4 Replies)
Hi all,
I have a file like this I want to extract only those regions which are big and continous
chr1 3280000 3440000
chr1 3440000 3920000
chr1 3600000 3920000 # region coming within the 3440000 3920000. so i don't want it to be printed in output
chr1 3920000 4800000
chr1 ... (2 Replies)
Hi,
I have a log file (log.txt) that which contains lines of date/time.
I need to create a script to extract a CSV file (out.csv) that gets all the sequential times (with only 1 minute difference) together by stating the start time and end time of this period.
Sample log file (log.txt)
... (7 Replies)
Hello All,
Could you please help with this.
This is what I have:
506234.222 2
506234.222 2
506234.222 2
506234.222 2
508212.200 2
508212.200 2
333456.111 2
333456.111 2
333456.111 2
333456.111 2
But this is what I want:
506234.222 1
506234.222 2
506234.222 2
506234.222 3 (5 Replies)
The file has record length 200. And i have 100 search strings which are ten digits of character from 1 to 10 characters all of them are unique, they need to searched in a file. Please help me to pull the records based on position (say from 1-10).
test data
1FAHP2DW0BG115206RASHEED ... (6 Replies)
Hi Guys,
While I was writing one shell script , I just got struck at this point.
I need to extract words from a file at some specified position and do some comparison operation and need to replace the extracted word with another word.
Eg : I like Orange very much.
I need to replace... (19 Replies)
Hello People,
I have the following contents in an XML file
...........
...........
..........
...........
<Details = "Sample Details">
<Name>Bob</Name>
<Age>34</Age>
<Address>CA</Address>
<ContactNumber>1234</ContactNumber>
</Details>
...........
.............
.............. (4 Replies)
Hi,
I am a newbie in unix programming so maybe this is a simple question.
I would like to know how can I make a script that outputs only the values that are not between any given start and end positions
Example
file1:
2 30
40 80
82 100
file2:
ID1 1
ID2 35
ID3 80
ID4 81
ID6... (9 Replies)
hi
In the foll example the whole text in a single line....
i want to extract text from IPTel to RTCPBase.h.
want to use this acrooss the whole file
Updated: IPTel\platform\core\include\RTCPBase.h \main\MWS2051_Sablime_Int\1... (7 Replies)