read regex from ID file, print regex and line below from source file | Unix Linux Forums | UNIX for Dummies Questions & Answers

  Go Back    


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

read regex from ID file, print regex and line below from source file

UNIX for Dummies Questions & Answers


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 10-08-2012
pathunkathunk pathunkathunk is offline
Registered User
 
Join Date: Mar 2012
Last Activity: 29 April 2014, 4:54 AM EDT
Posts: 49
Thanks: 21
Thanked 0 Times in 0 Posts
read regex from ID file, print regex and line below from source file

I have a file of protein sequences with headers (my source file). Based on a list of IDs (which are included in some of the headers), I'd like to print out only the specified sequences, with only the ID as header.

In other words, I'd like to search source.txt for the terms in IDs.txt, and print the ID as well as the sequence. Ideally the process would continue even if an ID is not found in the source file. All headers in source.txt are of similar format.

source.txt
Quote:
>m.49518 g.49518 ORF g.49518 m.49518 type:internal len:169 (-) comp100001_c0_seq1:3-509(-)
FHPPVSDSCKRCDMYKNQIKIAPENEKIQLNADHELHLRKAESARNGMNNDVELCKTDPNKVTVIAFDLMKTLSTPSLSVGVAYYKRQLSTYNLGIHNLT TNDAYMYVWNESMASRGPQEIGSCLLHFIKNYVHTEQLIMYSDQCGGQNRNIKMALICNFVVGSNDYLP
>m.54555 g.54555 ORF g.54555 m.54555 type:internal len:137 (+) comp1001102_c0_seq1:3-416(+)
YGDLDDSALDAEGPAGPVYRFSRRKSDTKSDDNSQSNGEGVMMMINGELVKVEQLKREEIINCTCGYTEEDGLMIQCDLCLCWQHGHCNGIEREKDVPEK YICYICSHPYRQRPSRKYIHDQDWIKEGKLVSLTKRK
>m.54557 g.54557 ORF g.54557 m.54557 type:internal len:113 (+) comp1002314_c0_seq1:2-343(+)
SIKARQIYDSRGNPTVEVDLVTENGLFRAAVPSGASTGVHEALELRDNDKSMYHGKSVFKAVDNINSIIAPELLKANIEVTEQAEIDNFLLKLDGTPNKS KLGANAILGVSLA
IDs.txt
Quote:
comp100001_c0_seq1
comp1002314_c0_seq1
desired output:
Quote:
>comp100001_c0_seq1
FHPPVSDSCKRCDMYKNQIKIAPENEKIQLNADHELHLRKAESARNGMNNDVELCKTDPNKVTVIAFDLMKTLSTPSLSVGVAYYKRQLSTYNLGIHNLT TNDAYMYVWNESMASRGPQEIGSCLLHFIKNYVHTEQLIMYSDQCGGQNRNIKMALICNFVVGSNDYLP
>comp1002314_c0_seq1
SIKARQIYDSRGNPTVEVDLVTENGLFRAAVPSGASTGVHEALELRDNDKSMYHGKSVFKAVDNINSIIAPELLKANIEVTEQAEIDNFLLKLDGTPNKS KLGANAILGVSLA
I am able to pull out the sequences based on the ID one-by-one, but this is slow and doesn't give me the header.

Code:
awk '/comp51893_c0_seq1/ { getline; print $0 }' source.txt

I also tried extracting the entire header and the sequence by modifying a script I had for a sequence file with different header type, but again it's one-by-one it only prints the header.

Code:
awk '{lines[NR] = $0} /comp47911_c0_seq1/ {print lines [NR]; print lines [NR+1]}' source.txt

As is probably clear, I'm still pretty low on the learning curve. Any help would be really appreciated!
Sponsored Links
    #2  
Old 10-08-2012
jim mcnamara jim mcnamara is offline Forum Staff  
...@...
 
Join Date: Feb 2004
Last Activity: 19 September 2014, 7:05 PM EDT
Location: NM
Posts: 10,211
Thanks: 278
Thanked 796 Times in 743 Posts

Code:
awk ' FILENAME=="ID.txt" {arr[$0]++}
        FILENAME=="source.txt"
        {for(i in arr) {if (i ~ $0)
                             {print ">", i; getline; print $0; getline; print $0  }
                          }
        } ' ID.txt source.txt  > newfile

try that for starters.
Sponsored Links
    #3  
Old 10-08-2012
pathunkathunk pathunkathunk is offline
Registered User
 
Join Date: Mar 2012
Last Activity: 29 April 2014, 4:54 AM EDT
Posts: 49
Thanks: 21
Thanked 0 Times in 0 Posts
jim, thanks for taking a look.

Using the code you provide, I get the following in terminal:
Quote:
awk: illegal primary in regular expression >m.54555 g.54555 ORF g.54555 m.54555 type:internal len:137 (+) comp1001102_c0_seq1:3-416(+) at ) comp1001102_c0_seq1:3-416(+)
input record number 3, file source.txt
source line number 3
cat newfile returns:
Quote:
> comp100001_c0_seq1
comp1002314_c0_seq1
>m.49518 g.49518 ORF g.49518 m.49518 type:internal len:169 (-) comp100001_c0_seq1:3-509(-)
FHPPVSDSCKRCDMYKNQIKIAPENEKIQLNADHELHLRKAESARNGMNNDVELCKTDPNKVTVIAFDLMKTLSTPSLSVGVAYYKRQLSTYNLGIHNLT TNDAYMYVWNESMASRGPQEIGSCLLHFIKNYVHTEQLIMYSDQCGGQNRNIKMALICNFVVGSNDYLP
>m.54555 g.54555 ORF g.54555 m.54555 type:internal len:137 (+) comp1001102_c0_seq1:3-416(+)
Just to verify, here are the input files:
Quote:
$ cat source.txt
>m.49518 g.49518 ORF g.49518 m.49518 type:internal len:169 (-) comp100001_c0_seq1:3-509(-)
FHPPVSDSCKRCDMYKNQIKIAPENEKIQLNADHELHLRKAESARNGMNNDVELCKTDPNKVTVIAFDLMKTLSTPSLSVGVAYYKRQLSTYNLGIHNLT TNDAYMYVWNESMASRGPQEIGSCLLHFIKNYVHTEQLIMYSDQCGGQNRNIKMALICNFVVGSNDYLP
>m.54555 g.54555 ORF g.54555 m.54555 type:internal len:137 (+) comp1001102_c0_seq1:3-416(+)
YGDLDDSALDAEGPAGPVYRFSRRKSDTKSDDNSQSNGEGVMMMINGELVKVEQLKREEIINCTCGYTEEDGLMIQCDLCLCWQHGHCNGIEREKDVPEK YICYICSHPYRQRPSRKYIHDQDWIKEGKLVSLTKRK
>m.54557 g.54557 ORF g.54557 m.54557 type:internal len:113 (+) comp1002314_c0_seq1:2-343(+)
SIKARQIYDSRGNPTVEVDLVTENGLFRAAVPSGASTGVHEALELRDNDKSMYHGKSVFKAVDNINSIIAPELLKANIEVTEQAEIDNFLLKLDGTPNKS KLGANAILGVSLA
$ cat ID.txt
comp100001_c0_seq1
comp1002314_c0_seq1
    #4  
Old 10-08-2012
scottaazz scottaazz is offline
Registered User
 
Join Date: Sep 2012
Last Activity: 6 November 2012, 8:00 PM EST
Posts: 28
Thanks: 0
Thanked 7 Times in 7 Posts
Jim is correct (cool way of performing the task) but you need to switch the comparison operator in the if statement


Code:
awk 'FILENAME=="ID.txt" {arr[$0]++}
FILENAME=="source.txt" { for(i in arr) {if ($0 ~ i) {print ">", i; getline; print $0; getline; print $0  }   } }' ID.txt source.txt

Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Every regex tp new file jamie_123 Shell Programming and Scripting 7 05-16-2012 03:19 AM
read file line by line print column wise rocking77 Shell Programming and Scripting 2 12-07-2010 07:02 AM
print first few lines, then apply regex on a specific column to print results. kchinnam Shell Programming and Scripting 4 08-24-2010 03:24 PM
sed - print only matching regex domi55 Shell Programming and Scripting 5 05-11-2009 10:51 AM
awk - print file contents except regex rmsagar Shell Programming and Scripting 6 08-09-2008 12:29 PM



All times are GMT -4. The time now is 07:53 PM.