extract regions of file based on start and end position

09-01-2012

Registered User

57, 0

Join Date: Mar 2012

Last Activity: 4 June 2019, 4:18 AM EDT

Posts: 57

Thanks Given: 23

Thanked 0 Times in 0 Posts

extract regions of file based on start and end position

Hi, I have a file1 of many long sequences, each preceded by a unique header line. file2 is 3-columns list: headers name, start position, end position. I'd like to extract the sequence region of file1 specified in file2.

Based on a post elsewhere, I found the code:

Code:

awk 'NR==FNR{if($0~/^>/){i=substr($0,2);getline};a[i]=a[i] $0;next}{print ">" $1 ORS substr(a[$1], $2, $3-$2+1)}' file1 FS=\\t file2

But with the files I have, regions are extracted from only a subset of the specified sequences.

file1 (my real file is much longer, >47000 lines, and each sequence is much longer too):

Code:

>GL349621.1TTTACAATTGCTATTGTAACAATATATCAGGAGCCTTGTATTAAATTTTCACGCATTTTTACCAAACAAA
>GL3422.1
TTACCGGCACTTCATTCACTTTTAAATACCCGTGTTTCGATTTTGTTGTTTTAGCCCCGCGTAAAGCAAA
>GL3423.1
104-151:aphid_project aman$ head -10 Ap_assembly2_scaffolds_corrected.fasta 
>GL3421.1
TTTACAATTGCTATTGTAACAATATATCAGGAGCCTTGTATTAAATTTTCACGCATTTTTACCAAACAAA
>GL3422.1
TTACCGGCACTTCATTCACTTTTAAATACCCGTGTTTCGATTTTGTTGTTTTAGCCCCGCGTAAAGCAAA[/QUOTE]file2: 

Quote:

    
    GL3550.1    61498    52100
GL3465.1    293074    297989
GL3567.1    234944    236074
GL3482.1    323100    323743
GL3550.1    41911    40888
GL3472.1    274408    272617
GL3550.1    75587    73909
GL3465.1    869051    867919
GL3472.1    294812    289561
GL3567.1    238848    236284
    


Result after using awk command above:
[QUOTE]>GL350250.1

>GL3465.1
ATGGTCTCAAAGCCGTTTGTTTTCGTGTTTGTGCTCATGTCCGTGGTCGGTGTGTCGTTCTCCGTGACCGAAGAAAACGATGGTACAAAGGTCGTGGACAAGGAAGAGGACCATCATCAGAACATCCAAGAGGAACTCAAGAAATTTTTGTCGACGTTAGAGAAAATCGACATAGACCAGATATTGAACAACCACAGGTTGATGTCGAACAACGTCAAATGTTTTCTGAACGAAGGACCGTGTACGGCACAACTGAGGGAAATGAAAAGTAAGAGTGTGTGCGACACGTAATATTACTGATTGTACGATGACTTCGCTAAATACTATTTGGTTGAAATATATTATTATGCCTAAATTACTGTAAATCATGTATTTAATAATTTACTAGTACCTATAATATATTATTTTGATAAATTTATATTTCATGTTTAGTAGTTATTTAGTACAACGTACGAGTATTAATAATTTGAATAAATTATGTAAAGCCATTTTGTAAGCATAGCCAATACATATTTCAATATATTGACATTTTCGTCATCAATCATTATAGTTTGCTGTAAAAATAATATTATTTAAATTCGACAATTTGTATTGACAAGATAATCAAAACACCAAATATTTTAGTAAAATTTAATAAATCAAGTTATTATAATATGCCAATTTAGTAACATTTTAAAAAGAATAATACCAAAAGGTCATCGCGTACTCCTCGAATCAGCCACTACTTAGTACTTTCTACGTAGTACTTACAAGCAAAAACACTTATAAAGTTATAAATTATATATCTATTTAAAGAAAAGCCAATTCCAAATTGTTTGTTTTTTAAACCCAGCAACATCATAATAAAAACCTACATTTTAAATATAGTTGAGAACAGAGTAGTTCTGACTAAGGAGAAAAAATATAATAATAACAACCAAAAGGTTCTATTTGAAGAACAAAGACGATCTTGGTTCTAAACGGAATTTTTTCGAAACTATTATTTTCTTTTTATTTTAAGTTAATATTTTCTAAGAACAAAACACAAATGTCAAAGTTTAATGCAAAACAACACTTTGATAATTTTTAGATTCAACAAACCGTTCTATATGAAAGAAGATATTACAGTATGAAGTTTAAAATGTTT
>GL3567.1
ATGGCTCATCTTAACTTATTTGTTGTTCTGGTCGCGTCGCTTGTGTGCTTCACGTTGGCCGAAGAAAAATACACAACTAAATTCGACAACTTCGACGTGGACAAGGTGTTGAACAACGACAGAATTTTAACCAGTTACATCAAGTGTTTGCTGGACCAGGGAAATTGCACCAATGAAGGCCGAGAATTGAAGAGTAAGTCATATAAGAACAAACGTATATTATATTTTAGTATTTCAATTTTCAAGTATGTTATACAATTAAAATTATAATGACATTGATTTAACTATCGGTACATTTTTAAAAATTGTAATATCTATATACTTCTGGGGTATACTATAATCTATACTCTATAACAACTTCTATATATATATATATAATATATATTATATAAGTATCAAAAAATTACCAATATAAAAAAAAAAACCATAACATATATATTTAATATTGTACCTACTTTATAAAATGCTTTAAGTTCGTCAGCATAATATAATTTTATTTATCTGACAATCGTACTAAATTATCATCTTGAGTACATATCGATATAACAATAATACATACCGTATTATAATGTTATACTTAACCAAGTTACCGTAAACGATAAATAAGAACAACACTAACAATTATAATTTTATTTGTATTCTTAAGATTACGGTTTTCACTTTCACATTAAACATAATATTATATCGACAAGACCTTGCTTACTCAGTCAGACTCACTTTCACCGGGTGTTGTAATTATTTTTCATCTGAGCCGAATTACCGTGTATAGTATTTAGAAAGAAACAAAGTCGATACGTAAAATGTGTGGTTAATCCTTAAATTTTTCCTTGCTATATTAAAATGTTTATTAATAATTGTTTTTAAACTTCTTTTTCACAATAATGTACATGTGTTATAATTATATTTTCTTGTCTGATTAAATTTGCAGAAGTTTTACCAGACGCATTGAAAACTGACTGCAGCAAATGTACAGCTGTACAAAAAGACAGATCAGAAAAAGTAATCAAGTTTTTGATCAAAAACCGTTCTAAAGACTTTGATAATTTGACCGCCAAATACGACCCATCAGGCGAATACAAAAAGAAGATCGAAAAATTCGACGCGGAAAGGGCTGCGGCTGCTAAACATTAA
>GL3482.1
ATGGTGCATCTTAACTTATTTGTTGTTGTGGTCGCGTCACTGGTGTGCTTCACATTGGCTCAAGAAAAGTACTCAACTAAATACGAAAACTTTGACGAGGACAAGGTGTTAAACAACGACAGCCTTTTAACCAGCTACATCAATTGTTTGCTCGACGAGGGAAATTGTACCGAGGAAGGCCAAGCGTTGAAAAGTAAATCACAATAACAAATGTATTATATTTTAGTATTTTAACTTTCAAGTATATTATACAATTAAAATCATAATGACATTGATTTAACTATCGGTACATTTTTAAAAAAATGGATATCTAATATCTATACTATACTTTTGGGGTATACTAGCTATAGTCTAAACTCTAAAACAACTACTAGGTACATTATATGTATCAAATAAATATCAACAAACTTCCATTTGGCATAGGTGTAATAATTATATTAATTTCATATCTAATTAAATGTACAGGAATTTTACCAGACGCATTAAAAACTAACTGTGGAAAATGTACGGATGCACAAAAATTGAAAATAGAAAAAATAATGAAATTTTTGATCAAGAACCGTTCTATTGACTTTGATCGTTTGACCGCCAAATACGACCCATCGGGAGAATACAAGAAAAAGCTCGAAAAATTCTCCGCTTAA
>GL3550.1

>GL3472.1

>GL3550.1

>GL3465.1

>GL3472.1

>GL3567.1

So the specified region is only extracted for 3 out of 10 queries. I have checked and all headers that appear in file2 are also represented in file1. The sequences are long enough to contain all of the beginning and end points. Any ideas on what's going wrong?

pathunkathunk

View Public Profile for pathunkathunk

Find all posts by pathunkathunk

09-01-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

OK. There are at least three problems here.

First: Look at the first entry in you sample file1:

Code:

>GL349621.1TTTACAATTGCTATTGTAACAATATATCAGGAGCCTTGTATTAAATTTTCACGCATTTTTACCAAACAAA

There is no space between>GL3496211and the start of data for that sequence. So, you won't have an element in the array a forGL349621.1. Your code will instead produce an entry in the a array with subscript

Code:

GL349621.1TTTACAATTGCTATTGTAACAATATATCAGGAGCCTTGTATTAAATTTTCACGCATTTTTACCAAACAAA

which will not be matched by any entry in file2.

Second: Let's look at file2:

Code:

GL3550.1 61498 52100
GL3465.1 293074 297989
GL3567.1 234944 236074
GL3482.1 323100 323743
GL3550.1 41911 40888
GL3472.1 274408 272617
GL3550.1 75587 73909
GL3465.1 869051 867919
GL3472.1 294812 289561
GL3567.1 238848 236284

Note that the end points marked in red come before the corresponding start points for those ranges. This will result in a call to substr(a[i], start, length)with a value for length that is a negative number. This will result in no output for that substring.

Third: The standards say that awk's print and printf output statements are only required to work when the output produced is no more than {LINE_MAX} bytes long. The value of {LINE_MAX} is allowed to be as small as 2048. The line marked in blue in your sample file2 would require the print statement to output a line that is about 4927 bytes long. (I note that the sample output you provided produced output that is 1148 bytes long for this entry.)

So you have to clean up file1 so that the keys are as you described. Then you need to either clean up file2 so that columns 2 and 3 always specify an end point that comes after the start point OR you need to modify your awk script to fix the arguments passed to substr() to specify the smaller value as the start point and calculate the length to be a positive value.

Then you need to see how much output your awk's print command can produce in a single call and modify your script to split the output into chunks small enough to print (probably using printf instead of print).

Note also that you are reading all of file1 into memory. With some entries in file2 that are at least 869051 bytes long (not including the key on the first line of the entry), you may also be run into memory allocation limits for user processes. You don't seem that have this problem, but others planning to use a solution like this for their own scripts should know to watch for a diagnostic message indicating that they have hit a limit like this.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

09-01-2012

Registered User

57, 0

Join Date: Mar 2012

Last Activity: 4 June 2019, 4:18 AM EDT

Posts: 57

Thanks Given: 23

Thanked 0 Times in 0 Posts

Thanks for the input Don Cragun. The missing newline was a problem in the copy-and-paste that I didn't catch. You pointed out what is the obvious issue, which I somehow didn't notice--starts that come after ends. That's what I get for assuming I had a coding issue! Anyway, I have looked more into the data from which this file was generated, and it's clear I need to become more familiar in that area before coming back to this.

pathunkathunk

View Public Profile for pathunkathunk

Find all posts by pathunkathunk

UNIX for Dummies Questions & Answers

extract regions of file based on start and end position

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Splitting week start date and end date based on custom period start dates

Discussion started by: nani2019

2. Shell Programming and Scripting

Retrieving sequences corresponding to start and end position

Discussion started by: Ibk

3. Shell Programming and Scripting

Extract Big and continuous regions

Discussion started by: amrutha_sastry

4. Shell Programming and Scripting

How to extract start/end times from log file to CSV file?

Discussion started by: Mr.Zizo

5. Shell Programming and Scripting

Generate Codes based on start and End values of numbers in a column

Discussion started by: canimba

6. Shell Programming and Scripting

Extract data based on position

Discussion started by: zooby

7. UNIX for Dummies Questions & Answers

To Extract words from File based on Position

Discussion started by: kuttu123

8. UNIX for Dummies Questions & Answers

Extract a specific number from an XML file based on the start and end tags

Discussion started by: sushant172

9. UNIX for Dummies Questions & Answers

find if a position is between a given start and end position

Discussion started by: fadista

10. Shell Programming and Scripting

extract a particular start and end pattern from a line

Discussion started by: manish205