"extract regions of file based on start and end position"

Post #302694999 by pathunkathunk on Saturday 1st of September 2012 12:45:47 AM

extract regions of file based on start and end position

Hi, I have a file1 of many long sequences, each preceded by a unique header line. file2 is 3-columns list: headers name, start position, end position. I'd like to extract the sequence region of file1 specified in file2.

Based on a post elsewhere, I found the code:
Code:
awk 'NR==FNR{if($0~/^>/){i=substr($0,2);getline};a[i]=a[i] $0;next}{print ">" $1 ORS substr(a[$1], $2, $3-$2+1)}' file1 FS=\\t file2

But with the files I have, regions are extracted from only a subset of the specified sequences.

file1 (my real file is much longer, >47000 lines, and each sequence is much longer too):
Code:
>GL349621.1TTTACAATTGCTATTGTAACAATATATCAGGAGCCTTGTATTAAATTTTCACGCATTTTTACCAAACAAA
>GL3422.1
TTACCGGCACTTCATTCACTTTTAAATACCCGTGTTTCGATTTTGTTGTTTTAGCCCCGCGTAAAGCAAA
>GL3423.1
104-151:aphid_project aman$ head -10 Ap_assembly2_scaffolds_corrected.fasta 
>GL3421.1
TTTACAATTGCTATTGTAACAATATATCAGGAGCCTTGTATTAAATTTTCACGCATTTTTACCAAACAAA
>GL3422.1
TTACCGGCACTTCATTCACTTTTAAATACCCGTGTTTCGATTTTGTTGTTTTAGCCCCGCGTAAAGCAAA[/QUOTE]file2: 
Quote:
GL3550.1 61498 52100 GL3465.1 293074 297989 GL3567.1 234944 236074 GL3482.1 323100 323743 GL3550.1 41911 40888 GL3472.1 274408 272617 GL3550.1 75587 73909 GL3465.1 869051 867919 GL3472.1 294812 289561 GL3567.1 238848 236284
Result after using awk command above: [QUOTE]>GL350250.1 >GL3465.1 ATGGTCTCAAAGCCGTTTGTTTTCGTGTTTGTGCTCATGTCCGTGGTCGGTGTGTCGTTCTCCGTGACCGAAGAAAACGATGGTACAAAGGTCGTGGACAAGGAAGAGGACCATCATCAGAACATCCAAGAGGAACTCAAGAAATTTTTGTCGACGTTAGAGAAAATCGACATAGACCAGATATTGAACAACCACAGGTTGATGTCGAACAACGTCAAATGTTTTCTGAACGAAGGACCGTGTACGGCACAACTGAGGGAAATGAAAAGTAAGAGTGTGTGCGACACGTAATATTACTGATTGTACGATGACTTCGCTAAATACTATTTGGTTGAAATATATTATTATGCCTAAATTACTGTAAATCATGTATTTAATAATTTACTAGTACCTATAATATATTATTTTGATAAATTTATATTTCATGTTTAGTAGTTATTTAGTACAACGTACGAGTATTAATAATTTGAATAAATTATGTAAAGCCATTTTGTAAGCATAGCCAATACATATTTCAATATATTGACATTTTCGTCATCAATCATTATAGTTTGCTGTAAAAATAATATTATTTAAATTCGACAATTTGTATTGACAAGATAATCAAAACACCAAATATTTTAGTAAAATTTAATAAATCAAGTTATTATAATATGCCAATTTAGTAACATTTTAAAAAGAATAATACCAAAAGGTCATCGCGTACTCCTCGAATCAGCCACTACTTAGTACTTTCTACGTAGTACTTACAAGCAAAAACACTTATAAAGTTATAAATTATATATCTATTTAAAGAAAAGCCAATTCCAAATTGTTTGTTTTTTAAACCCAGCAACATCATAATAAAAACCTACATTTTAAATATAGTTGAGAACAGAGTAGTTCTGACTAAGGAGAAAAAATATAATAATAACAACCAAAAGGTTCTATTTGAAGAACAAAGACGATCTTGGTTCTAAACGGAATTTTTTCGAAACTATTATTTTCTTTTTATTTTAAGTTAATATTTTCTAAGAACAAAACACAAATGTCAAAGTTTAATGCAAAACAACACTTTGATAATTTTTAGATTCAACAAACCGTTCTATATGAAAGAAGATATTACAGTATGAAGTTTAAAATGTTT >GL3567.1 ATGGCTCATCTTAACTTATTTGTTGTTCTGGTCGCGTCGCTTGTGTGCTTCACGTTGGCCGAAGAAAAATACACAACTAAATTCGACAACTTCGACGTGGACAAGGTGTTGAACAACGACAGAATTTTAACCAGTTACATCAAGTGTTTGCTGGACCAGGGAAATTGCACCAATGAAGGCCGAGAATTGAAGAGTAAGTCATATAAGAACAAACGTATATTATATTTTAGTATTTCAATTTTCAAGTATGTTATACAATTAAAATTATAATGACATTGATTTAACTATCGGTACATTTTTAAAAATTGTAATATCTATATACTTCTGGGGTATACTATAATCTATACTCTATAACAACTTCTATATATATATATATAATATATATTATATAAGTATCAAAAAATTACCAATATAAAAAAAAAAACCATAACATATATATTTAATATTGTACCTACTTTATAAAATGCTTTAAGTTCGTCAGCATAATATAATTTTATTTATCTGACAATCGTACTAAATTATCATCTTGAGTACATATCGATATAACAATAATACATACCGTATTATAATGTTATACTTAACCAAGTTACCGTAAACGATAAATAAGAACAACACTAACAATTATAATTTTATTTGTATTCTTAAGATTACGGTTTTCACTTTCACATTAAACATAATATTATATCGACAAGACCTTGCTTACTCAGTCAGACTCACTTTCACCGGGTGTTGTAATTATTTTTCATCTGAGCCGAATTACCGTGTATAGTATTTAGAAAGAAACAAAGTCGATACGTAAAATGTGTGGTTAATCCTTAAATTTTTCCTTGCTATATTAAAATGTTTATTAATAATTGTTTTTAAACTTCTTTTTCACAATAATGTACATGTGTTATAATTATATTTTCTTGTCTGATTAAATTTGCAGAAGTTTTACCAGACGCATTGAAAACTGACTGCAGCAAATGTACAGCTGTACAAAAAGACAGATCAGAAAAAGTAATCAAGTTTTTGATCAAAAACCGTTCTAAAGACTTTGATAATTTGACCGCCAAATACGACCCATCAGGCGAATACAAAAAGAAGATCGAAAAATTCGACGCGGAAAGGGCTGCGGCTGCTAAACATTAA >GL3482.1 ATGGTGCATCTTAACTTATTTGTTGTTGTGGTCGCGTCACTGGTGTGCTTCACATTGGCTCAAGAAAAGTACTCAACTAAATACGAAAACTTTGACGAGGACAAGGTGTTAAACAACGACAGCCTTTTAACCAGCTACATCAATTGTTTGCTCGACGAGGGAAATTGTACCGAGGAAGGCCAAGCGTTGAAAAGTAAATCACAATAACAAATGTATTATATTTTAGTATTTTAACTTTCAAGTATATTATACAATTAAAATCATAATGACATTGATTTAACTATCGGTACATTTTTAAAAAAATGGATATCTAATATCTATACTATACTTTTGGGGTATACTAGCTATAGTCTAAACTCTAAAACAACTACTAGGTACATTATATGTATCAAATAAATATCAACAAACTTCCATTTGGCATAGGTGTAATAATTATATTAATTTCATATCTAATTAAATGTACAGGAATTTTACCAGACGCATTAAAAACTAACTGTGGAAAATGTACGGATGCACAAAAATTGAAAATAGAAAAAATAATGAAATTTTTGATCAAGAACCGTTCTATTGACTTTGATCGTTTGACCGCCAAATACGACCCATCGGGAGAATACAAGAAAAAGCTCGAAAAATTCTCCGCTTAA >GL3550.1 >GL3472.1 >GL3550.1 >GL3465.1 >GL3472.1 >GL3567.1

So the specified region is only extracted for 3 out of 10 queries. I have checked and all headers that appear in file2 are also represented in file1. The sequences are long enough to contain all of the beginning and end points. Any ideas on what's going wrong?
 
Test Your Knowledge in Computers #329
Difficulty: Easy
The NVidia named the GTX 1080 because it can only render at a 1920x1080 screen resolution.
True or False?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

extract a particular start and end pattern from a line

hi In the foll example the whole text in a single line.... i want to extract text from IPTel to RTCPBase.h. want to use this acrooss the whole file Updated: IPTel\platform\core\include\RTCPBase.h \main\MWS2051_Sablime_Int\1... (7 Replies)
Discussion started by: manish205
7 Replies

2. UNIX for Dummies Questions & Answers

find if a position is between a given start and end position

Hi, I am a newbie in unix programming so maybe this is a simple question. I would like to know how can I make a script that outputs only the values that are not between any given start and end positions Example file1: 2 30 40 80 82 100 file2: ID1 1 ID2 35 ID3 80 ID4 81 ID6... (9 Replies)
Discussion started by: fadista
9 Replies

3. Shell Programming and Scripting

sed: Find start of pattern and extract text to end of line, including the pattern

This is my first post, please be nice. I have tried to google and read different tutorials. The task at hand is: Input file input.txt (example) abc123defhij-E-1234jslo 456ujs-W-abXjklp From this file the task is to grep the -E- and -W- strings that are unique and write a new file... (5 Replies)
Discussion started by: TestTomas
5 Replies

4. UNIX for Dummies Questions & Answers

Extract a specific number from an XML file based on the start and end tags

Hello People, I have the following contents in an XML file ........... ........... .......... ........... <Details = "Sample Details"> <Name>Bob</Name> <Age>34</Age> <Address>CA</Address> <ContactNumber>1234</ContactNumber> </Details> ........... ............. .............. (4 Replies)
Discussion started by: sushant172
4 Replies

5. UNIX for Dummies Questions & Answers

To Extract words from File based on Position

Hi Guys, While I was writing one shell script , I just got struck at this point. I need to extract words from a file at some specified position and do some comparison operation and need to replace the extracted word with another word. Eg : I like Orange very much. I need to replace... (19 Replies)
Discussion started by: kuttu123
19 Replies

6. Shell Programming and Scripting

Extract data based on position

The file has record length 200. And i have 100 search strings which are ten digits of character from 1 to 10 characters all of them are unique, they need to searched in a file. Please help me to pull the records based on position (say from 1-10). test data 1FAHP2DW0BG115206RASHEED ... (6 Replies)
Discussion started by: zooby
6 Replies

7. Shell Programming and Scripting

Generate Codes based on start and End values of numbers in a column

Hello All, Could you please help with this. This is what I have: 506234.222 2 506234.222 2 506234.222 2 506234.222 2 508212.200 2 508212.200 2 333456.111 2 333456.111 2 333456.111 2 333456.111 2 But this is what I want: 506234.222 1 506234.222 2 506234.222 2 506234.222 3 (5 Replies)
Discussion started by: canimba
5 Replies

8. Shell Programming and Scripting

How to extract start/end times from log file to CSV file?

Hi, I have a log file (log.txt) that which contains lines of date/time. I need to create a script to extract a CSV file (out.csv) that gets all the sequential times (with only 1 minute difference) together by stating the start time and end time of this period. Sample log file (log.txt) ... (7 Replies)
Discussion started by: Mr.Zizo
7 Replies

9. Shell Programming and Scripting

Extract Big and continuous regions

Hi all, I have a file like this I want to extract only those regions which are big and continous chr1 3280000 3440000 chr1 3440000 3920000 chr1 3600000 3920000 # region coming within the 3440000 3920000. so i don't want it to be printed in output chr1 3920000 4800000 chr1 ... (2 Replies)
Discussion started by: amrutha_sastry
2 Replies

10. Shell Programming and Scripting

Retrieving sequences corresponding to start and end position

Hi all, I have a fasta file of a reference sequnce, I will like to retrieve sequences corresponding to a list of start and end position in another file >my_ref_seq GCCCTATAAGGGCAGAAGCTTGTCCTTCTTGTGCCAGTTATGACGTTTGTCCTAACTGCACATCTGGTAG... (4 Replies)
Discussion started by: Ibk
4 Replies

Featured Tech Videos

All times are GMT -4. The time now is 02:39 AM.
Unix & Linux Forums Content Copyright 1993-2019. All Rights Reserved.
Privacy Policy