read regex from ID file, print regex and line below from source file

10-08-2012

Registered User

57, 0

Join Date: Mar 2012

Last Activity: 4 June 2019, 4:18 AM EDT

Posts: 57

Thanks Given: 23

Thanked 0 Times in 0 Posts

read regex from ID file, print regex and line below from source file

I have a file of protein sequences with headers (my source file). Based on a list of IDs (which are included in some of the headers), I'd like to print out only the specified sequences, with only the ID as header.

In other words, I'd like to search source.txt for the terms in IDs.txt, and print the ID as well as the sequence. Ideally the process would continue even if an ID is not found in the source file. All headers in source.txt are of similar format.

source.txt

Quote:

>m.49518 g.49518 ORF g.49518 m.49518 type:internal len:169 (-) comp100001_c0_seq1:3-509(-)
FHPPVSDSCKRCDMYKNQIKIAPENEKIQLNADHELHLRKAESARNGMNNDVELCKTDPNKVTVIAFDLMKTLSTPSLSVGVAYYKRQLSTYNLGIHNLT TNDAYMYVWNESMASRGPQEIGSCLLHFIKNYVHTEQLIMYSDQCGGQNRNIKMALICNFVVGSNDYLP
>m.54555 g.54555 ORF g.54555 m.54555 type:internal len:137 (+) comp1001102_c0_seq1:3-416(+)
YGDLDDSALDAEGPAGPVYRFSRRKSDTKSDDNSQSNGEGVMMMINGELVKVEQLKREEIINCTCGYTEEDGLMIQCDLCLCWQHGHCNGIEREKDVPEK YICYICSHPYRQRPSRKYIHDQDWIKEGKLVSLTKRK
>m.54557 g.54557 ORF g.54557 m.54557 type:internal len:113 (+) comp1002314_c0_seq1:2-343(+)
SIKARQIYDSRGNPTVEVDLVTENGLFRAAVPSGASTGVHEALELRDNDKSMYHGKSVFKAVDNINSIIAPELLKANIEVTEQAEIDNFLLKLDGTPNKS KLGANAILGVSLA

IDs.txt

Quote:

comp100001_c0_seq1
comp1002314_c0_seq1

desired output:

Quote:

>comp100001_c0_seq1
FHPPVSDSCKRCDMYKNQIKIAPENEKIQLNADHELHLRKAESARNGMNNDVELCKTDPNKVTVIAFDLMKTLSTPSLSVGVAYYKRQLSTYNLGIHNLT TNDAYMYVWNESMASRGPQEIGSCLLHFIKNYVHTEQLIMYSDQCGGQNRNIKMALICNFVVGSNDYLP
>comp1002314_c0_seq1
SIKARQIYDSRGNPTVEVDLVTENGLFRAAVPSGASTGVHEALELRDNDKSMYHGKSVFKAVDNINSIIAPELLKANIEVTEQAEIDNFLLKLDGTPNKS KLGANAILGVSLA

I am able to pull out the sequences based on the ID one-by-one, but this is slow and doesn't give me the header.

Code:

awk '/comp51893_c0_seq1/ { getline; print $0 }' source.txt

I also tried extracting the entire header and the sequence by modifying a script I had for a sequence file with different header type, but again it's one-by-one it only prints the header.

Code:

awk '{lines[NR] = $0} /comp47911_c0_seq1/ {print lines [NR]; print lines [NR+1]}' source.txt

As is probably clear, I'm still pretty low on the learning curve. Any help would be really appreciated!

pathunkathunk

View Public Profile for pathunkathunk

Find all posts by pathunkathunk

10-08-2012

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Code:

awk ' FILENAME=="ID.txt" {arr[$0]++}
        FILENAME=="source.txt"
        {for(i in arr) {if (i ~ $0)
                             {print ">", i; getline; print $0; getline; print $0  }
                          }
        } ' ID.txt source.txt  > newfile

try that for starters.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

10-08-2012

Registered User

57, 0

Join Date: Mar 2012

Last Activity: 4 June 2019, 4:18 AM EDT

Posts: 57

Thanks Given: 23

Thanked 0 Times in 0 Posts

jim, thanks for taking a look.

Using the code you provide, I get the following in terminal:

Quote:

awk: illegal primary in regular expression >m.54555 g.54555 ORF g.54555 m.54555 type:internal len:137 (+) comp1001102_c0_seq1:3-416(+) at ) comp1001102_c0_seq1:3-416(+)
input record number 3, file source.txt
source line number 3

cat newfile returns:

Quote:

> comp100001_c0_seq1
comp1002314_c0_seq1
>m.49518 g.49518 ORF g.49518 m.49518 type:internal len:169 (-) comp100001_c0_seq1:3-509(-)
FHPPVSDSCKRCDMYKNQIKIAPENEKIQLNADHELHLRKAESARNGMNNDVELCKTDPNKVTVIAFDLMKTLSTPSLSVGVAYYKRQLSTYNLGIHNLT TNDAYMYVWNESMASRGPQEIGSCLLHFIKNYVHTEQLIMYSDQCGGQNRNIKMALICNFVVGSNDYLP
>m.54555 g.54555 ORF g.54555 m.54555 type:internal len:137 (+) comp1001102_c0_seq1:3-416(+)

Just to verify, here are the input files:

Quote:

$ cat source.txt
>m.49518 g.49518 ORF g.49518 m.49518 type:internal len:169 (-) comp100001_c0_seq1:3-509(-)
FHPPVSDSCKRCDMYKNQIKIAPENEKIQLNADHELHLRKAESARNGMNNDVELCKTDPNKVTVIAFDLMKTLSTPSLSVGVAYYKRQLSTYNLGIHNLT TNDAYMYVWNESMASRGPQEIGSCLLHFIKNYVHTEQLIMYSDQCGGQNRNIKMALICNFVVGSNDYLP
>m.54555 g.54555 ORF g.54555 m.54555 type:internal len:137 (+) comp1001102_c0_seq1:3-416(+)
YGDLDDSALDAEGPAGPVYRFSRRKSDTKSDDNSQSNGEGVMMMINGELVKVEQLKREEIINCTCGYTEEDGLMIQCDLCLCWQHGHCNGIEREKDVPEK YICYICSHPYRQRPSRKYIHDQDWIKEGKLVSLTKRK
>m.54557 g.54557 ORF g.54557 m.54557 type:internal len:113 (+) comp1002314_c0_seq1:2-343(+)
SIKARQIYDSRGNPTVEVDLVTENGLFRAAVPSGASTGVHEALELRDNDKSMYHGKSVFKAVDNINSIIAPELLKANIEVTEQAEIDNFLLKLDGTPNKS KLGANAILGVSLA
$ cat ID.txt
comp100001_c0_seq1
comp1002314_c0_seq1

pathunkathunk

View Public Profile for pathunkathunk

Find all posts by pathunkathunk

10-08-2012

Registered User

28, 7

Join Date: Sep 2012

Last Activity: 6 November 2012, 8:00 PM EST

Posts: 28

Thanks Given: 0

Thanked 7 Times in 7 Posts

Jim is correct (cool way of performing the task) but you need to switch the comparison operator in the if statement

Code:

awk 'FILENAME=="ID.txt" {arr[$0]++}
FILENAME=="source.txt" { for(i in arr) {if ($0 ~ i) {print ">", i; getline; print $0; getline; print $0  }   } }' ID.txt source.txt

scottaazz

View Public Profile for scottaazz

Find all posts by scottaazz

UNIX for Dummies Questions & Answers

read regex from ID file, print regex and line below from source file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Sendmail K command regex: adding exclusion/negative lookahead to regex -a@MATCH

Discussion started by: RobbieTheK

2. Shell Programming and Scripting

(n)awk: print regex search output lines in one line

Discussion started by: Tobias-Reiper

3. UNIX for Advanced & Expert Users

sed REGEX to print multiple occurrences of a pattern from a line

Discussion started by: Vidhyaprakash

4. Shell Programming and Scripting

Failure using regex with awk in 'while read file' loop

Discussion started by: pathunkathunk

5. Shell Programming and Scripting

Regex: print matched line and exact pattern match

Discussion started by: stresing

6. Shell Programming and Scripting

Using regex's from file1, print line and line after matches in file2

Discussion started by: pathunkathunk

7. Shell Programming and Scripting

Bash script to send lines of file to new file based on Regex

Discussion started by: newbie2010

8. Shell Programming and Scripting

read file line by line print column wise

Discussion started by: rocking77

9. Shell Programming and Scripting

print first few lines, then apply regex on a specific column to print results.

Discussion started by: kchinnam

10. Shell Programming and Scripting

awk - print file contents except regex

Discussion started by: rmsagar