read regex from ID file, print regex and line below from source file Post: 302712145

Sponsored Content

Top Forums UNIX for Dummies Questions & Answers read regex from ID file, print regex and line below from source file Post 302712145 by pathunkathunk on Monday 8th of October 2012 07:12:09 PM

10-08-2012

Registered User

read regex from ID file, print regex and line below from source file

I have a file of protein sequences with headers (my source file). Based on a list of IDs (which are included in some of the headers), I'd like to print out only the specified sequences, with only the ID as header.

In other words, I'd like to search source.txt for the terms in IDs.txt, and print the ID as well as the sequence. Ideally the process would continue even if an ID is not found in the source file. All headers in source.txt are of similar format.

source.txt

Quote:

>m.49518 g.49518 ORF g.49518 m.49518 type:internal len:169 (-) comp100001_c0_seq1:3-509(-)
FHPPVSDSCKRCDMYKNQIKIAPENEKIQLNADHELHLRKAESARNGMNNDVELCKTDPNKVTVIAFDLMKTLSTPSLSVGVAYYKRQLSTYNLGIHNLT TNDAYMYVWNESMASRGPQEIGSCLLHFIKNYVHTEQLIMYSDQCGGQNRNIKMALICNFVVGSNDYLP
>m.54555 g.54555 ORF g.54555 m.54555 type:internal len:137 (+) comp1001102_c0_seq1:3-416(+)
YGDLDDSALDAEGPAGPVYRFSRRKSDTKSDDNSQSNGEGVMMMINGELVKVEQLKREEIINCTCGYTEEDGLMIQCDLCLCWQHGHCNGIEREKDVPEK YICYICSHPYRQRPSRKYIHDQDWIKEGKLVSLTKRK
>m.54557 g.54557 ORF g.54557 m.54557 type:internal len:113 (+) comp1002314_c0_seq1:2-343(+)
SIKARQIYDSRGNPTVEVDLVTENGLFRAAVPSGASTGVHEALELRDNDKSMYHGKSVFKAVDNINSIIAPELLKANIEVTEQAEIDNFLLKLDGTPNKS KLGANAILGVSLA

IDs.txt

Quote:

comp100001_c0_seq1
comp1002314_c0_seq1

desired output:

Quote:

>comp100001_c0_seq1
FHPPVSDSCKRCDMYKNQIKIAPENEKIQLNADHELHLRKAESARNGMNNDVELCKTDPNKVTVIAFDLMKTLSTPSLSVGVAYYKRQLSTYNLGIHNLT TNDAYMYVWNESMASRGPQEIGSCLLHFIKNYVHTEQLIMYSDQCGGQNRNIKMALICNFVVGSNDYLP
>comp1002314_c0_seq1
SIKARQIYDSRGNPTVEVDLVTENGLFRAAVPSGASTGVHEALELRDNDKSMYHGKSVFKAVDNINSIIAPELLKANIEVTEQAEIDNFLLKLDGTPNKS KLGANAILGVSLA

I am able to pull out the sequences based on the ID one-by-one, but this is slow and doesn't give me the header.

Code:

awk '/comp51893_c0_seq1/ { getline; print $0 }' source.txt

I also tried extracting the entire header and the sequence by modifying a script I had for a sequence file with different header type, but again it's one-by-one it only prints the header.

Code:

awk '{lines[NR] = $0} /comp47911_c0_seq1/ {print lines [NR]; print lines [NR+1]}' source.txt

As is probably clear, I'm still pretty low on the learning curve. Any help would be really appreciated!

pathunkathunk

View Public Profile for pathunkathunk

Find all posts by pathunkathunk

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk - print file contents except regex

Hello, I have a file which has user information. Each user has 2 variables with the same name like Email: testuser1 Email: testuser1@test.com Email: testuser2 Email: testuser2@test.com My intention is to delete the ones without the '@' symbol. When I run this statement awk '/^Email:/&&!/@/'...

2. Shell Programming and Scripting

print first few lines, then apply regex on a specific column to print results.

abc.dat tty cpu tin tout us sy wt id 0 0 7 3 19 71 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 133.2 0.0 682.9 0.0 1.0 0.0 7.2 0 79 c1t0d0 0.2 180.4 0.1 5471.2 3.0 2.8 16.4 15.6 15 52 aaaaaa1-xx I want to skip first 5 line...

3. Shell Programming and Scripting

read file line by line print column wise

I have a .csv file which is seperated with (;) inputfile --------- ZZZZ;AAAA;BBB;CCCC;DDD;EEE; YYYY;BBBB;CCC;DDDD;EEE;FFF; ... ... reading file line by line till end of file. while reading each line output format should be . i need to print only specific columns let say 5th...

4. Shell Programming and Scripting

Bash script to send lines of file to new file based on Regex

I have a file that looks like this: cat includes CORP-CRASHTEST-BU e:\crashplan\ CORP-TEST /usr/openv/java /usr/openv/logs /usr/openv/man CORP-LABS_TEST /usr/openv/java /usr/openv/logs /usr/openv/man What I want to do is make three new files with just those selections. So the three...

5. Shell Programming and Scripting

Using regex's from file1, print line and line after matches in file2

Good day, I have a list of regular expressions in file1. For each match in file2, print the containing line and the line after. file1: file2: Output: I can match a regex and print the line and line after awk '{lines = $0} /Macrosiphum_rosae/ {print lines ; print lines } ' ...

6. Shell Programming and Scripting

Regex: print matched line and exact pattern match

Hi experts, I have a file with regexes which is used for automatic searches on several files (40+ GB). To do some postprocessing with the grep result I need the matching line as well as the match itself. I know that the latter could be achieved with grep's -o option. But I'm not aware of a...

7. Shell Programming and Scripting

Failure using regex with awk in 'while read file' loop

I have a file1.txt with several 100k lines, each of which has a column 9 containing one of 60 "label" identifiers. Using an labels.txt file containing a list of labels, I'd like to extract 200 random lines from file1.txt for each of the labels in index.txt. Using a contrived mini-example: $ cat...

8. UNIX for Advanced & Expert Users

sed REGEX to print multiple occurrences of a pattern from a line

I have a line that I need to parse through and extract a pattern that occurs multiple times in it. Example line: getInfoCall: info received please proceed, getInfoCall: info received please proceed, getInfoCall: info received please proceed, getInfoCall: info received please proceed,...

9. Shell Programming and Scripting

(n)awk: print regex search output lines in one line

Hello. I have been looking high and low for the solution for this. I seems there should be a simple answer, but alas. I have a big xml file, and I need to extract certain information from specific items. The information I need can be found between a specific set of tags. let's call them...

10. Shell Programming and Scripting

Sendmail K command regex: adding exclusion/negative lookahead to regex -a@MATCH

I'm trying to get some exclusions into our sendmail regular expression for the K command. The following configuration & regex works: LOCAL_CONFIG # Kcheckaddress regex -a@MATCH +<@+?\.++?\.(us|info|to|br|bid|cn|ru) LOCAL_RULESETS SLocal_check_mail # check address against various regex...

LEARN ABOUT CENTOS

gensprep

gensprep(8)							 ICU 50.1.2 Manual						       gensprep(8)

NAME

       gensprep - compile StringPrep data from files filtered by filterRFC3454.pl

SYNOPSIS

       gensprep [ -h, -?, --help ] [ -v, --verbose ] [ -c, --copyright ] [ -s, --sourcedir source ] [ -d, --destdir destination ]

DESCRIPTION

       gensprep reads filtered RFC 3454 files and compiles their information into a binary form.  The resulting file, <name>.icu, can then be read
       directly by ICU, or used by pkgdata(8) for incorporation into a larger archive or library.

       The files read by gensprep are described in the FILES section.

OPTIONS

       -h, -?, --help
	      Print help about usage and exit.

       -v, --verbose
	      Display extra informative messages during execution.

       -c, --copyright
	      Include a copyright notice into the binary data.

       -s, --sourcedir source
	      Set the source directory to source.  The default source directory is specified by the environment variable ICU_DATA.

       -d, --destdir destination
	      Set the destination directory to destination.  The default destination directory is specified by the environment variable ICU_DATA.

ENVIRONMENT

       ICU_DATA  Specifies the directory containing ICU data. Defaults to /usr/share/icu/50.1.2/.  Some tools in ICU depend on the presence of the
		 trailing slash. It is thus important to make sure that it is present if ICU_DATA is set.

FILES

       The  following files are read by gensprep and are looked for in the source /misc for rfc3454_*.txt files and in source /unidata for Normal-
       izationCorrections.txt.

       rfc3453_A_1.txt	   Contains the list of unassigned codepoints in Unicode version 3.2.0....

       rfc3454_B_1.txt	   Contains the list of code points that are commonly mapped to nothing....

       rfc3454_B_2.txt	   Contains the list of mappings for casefolding of  code points when Normalization form NFKC is specified....

       rfc3454_C_X.txt	   Contains the list of code points that are prohibited for IDNA.

       NormalizationCorrections.txt
			   Contains the list of code points whose normalization has changed since Unicode Version 3.2.0.

VERSION

       50.1.2

COPYRIGHT

       Copyright (C) 2000-2002 IBM, Inc. and others.

SEE ALSO

       pkgdata(8)

ICU MANPAGE
							   18 March 2003						       gensprep(8)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk - print file contents except regex

Discussion started by: rmsagar

2. Shell Programming and Scripting

print first few lines, then apply regex on a specific column to print results.

Discussion started by: kchinnam

3. Shell Programming and Scripting

read file line by line print column wise

Discussion started by: rocking77

4. Shell Programming and Scripting

Bash script to send lines of file to new file based on Regex

Discussion started by: newbie2010

5. Shell Programming and Scripting

Using regex's from file1, print line and line after matches in file2

Discussion started by: pathunkathunk

6. Shell Programming and Scripting

Regex: print matched line and exact pattern match

Discussion started by: stresing

7. Shell Programming and Scripting

Failure using regex with awk in 'while read file' loop

Discussion started by: pathunkathunk

8. UNIX for Advanced & Expert Users

sed REGEX to print multiple occurrences of a pattern from a line

Discussion started by: Vidhyaprakash

9. Shell Programming and Scripting

(n)awk: print regex search output lines in one line

Discussion started by: Tobias-Reiper

10. Shell Programming and Scripting

Sendmail K command regex: adding exclusion/negative lookahead to regex -a@MATCH

Discussion started by: RobbieTheK

LEARN ABOUT CENTOS

gensprep