Fgrep or grep or awk help - scanning for delimiters.


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Fgrep or grep or awk help - scanning for delimiters.
# 1  
Old 02-18-2010
Fgrep or grep or awk help - scanning for delimiters.

Hi,

I'm struggling a little here, so I figured it's time to ask for help.

I have a file with a list of several hundred IDs (the hit file- "hitfile.txt"), which is newline delimited, and a much bigger (~500Mb) text file, "FASTA.txt" with several thousand entries, delimited by ">". It's the FASTA format, for those interested.

On the same line as the >, several different IDs are contained, delimited by "/". One of them is an internal ID ("internalID" which is not much use) and the other an external ID ("externalID" which is much more useful). The file therefore looks like this:


Code:
>internalID1 / externalID1

GATTACA

>internalID2 / externalID2

GATTACA


I have been able to extract the Identifier containing lines and also extract the more useful external ID.

I used:
Code:
fgrep -f hitfile.txt FASTA.txt > outfile.txt

With a hitfile of:

Code:
internalID1
internalID2

This outputs the lines as:

Code:
>internalID1 / externalID1
>internalID2 / externalID2

From which it is trivial to further extract the externalIDs.

Now, I would like to not only pull out single lines, but pull out all lines from the ID (which is always the first item after the >) until the next >, which is the next entry. This will mean I have a file not only of the IDs but also the sequences therein. So with a hitfile of:
Code:
internalID1

The output is:
Code:
>internalID1 / externalID1

GATTACA




This is where my complete n00bism and lack of bash-fu get me stuck. I have tried a couple of promising looking awk scripts, to no avail...

Any help in this matter will be much, much appreciated.

Last edited by radoulov; 02-18-2010 at 07:12 AM.. Reason: Added code tags.
# 2  
Old 02-18-2010
Code:
awk -F'[>/ ]' 'END { 
  if (_2 in _) print r
 }
NR == FNR { _[$0]; next }
/^>/ { 
  if (_2 in _) print r 
  r = x; _2 = $2 
  }
{ r = r ? r RS $0 : $0 }  
' hitfile.txt FASTA.txt

Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.
# 3  
Old 02-18-2010
Hi.

The AT&T cgrep command was designed to extract sections of text within a window:
Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate extraction of context window, cgrep.
# http://www.bell-labs.com/project/wwexptools/cgrep/

echo
set +o nounset
LC_ALL=C ; LANG=C ; export LC_ALL LANG
echo "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version "=o" $(_eat $0 $1) cgrep
set -o nounset

FILE1=data1
FILE2=data2

echo
echo " Data file $FILE1:"
cat $FILE1

echo
echo " Data file $FILE2:"
cat $FILE2

echo
echo " Results:"
cgrep -D +I2 +w '>' -f $FILE2 $FILE1

exit 0

producing:
Code:
% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 
GNU bash 3.2.39
cgrep - (local: ~/executable/cgrep May 29 2009 )

 Data file data1:
>internalID1 / externalID1

GATTACA

>internalID2 / externalID2

GATTACA

 Data file data2:
internalID1

 Results:
>internalID1 / externalID1

GATTACA

See the URL noted in the script. You would need to download and compile the code, but I have done that in 32-bit and 64-bit environments without trouble.

If you are not comfortable with that, then someone may stop by shortly to offer an awk or perl code.

Best wishes ... cheers, drl
# 4  
Old 02-18-2010
Both wonderful replies. Both work very well. Thank you so much!
# 5  
Old 02-18-2010
use below:- (use gawk or nawk or /usr/xpg4/bin/awk)

Code:
nawk 'NR==FNR {a[$1] ; next} $1 in a{print RS$0}' hitfile.txt RS="\>" FASTA.txt

SmilieSmilieSmilie
# 6  
Old 03-06-2010
@ahmad.diab wow really great work, a true piece of art. I wonder if there is a similar short and concise way to do that in perl ?
# 7  
Old 03-06-2010
Note that some awk implementations have problems with long multiline records (high number of bytes in the < blocks, in this case).

With Perl, I'm not able to come up with more concise solution than this one:

Code:
perl -ne'
  @_{/([^\n]+)/} = 1 and next if @ARGV;
  if (/^>/ or eof) {
    @_{$x =~ /^>([^ \/]+)/} and print $x;
    $x = "";
    } 
  $x .= $_' hitfile.txt FASTA.txt

You can modify the input record separator $/ with Perl too, but, as far as I know, not the way awk allows you to.

Last edited by radoulov; 03-06-2010 at 11:39 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Delimiters with awk?

I have a file which is separated by delimiter "|", but the prob is one of my column do contain delimiter as description so how can i differentiate it? PS : the delmiter does have backslash coming before it, if occurring in column Annual|Beleagured|Desc|Denver... (2 Replies)
Discussion started by: nikhil jain
2 Replies

2. Shell Programming and Scripting

Grep lines only with 3 delimiters

Hi All, my file has following Data 04:38:34 02:03 24:40 02:09:58 09:13 03:04:11 02:09:58 35:00 I want to display only lines with 3 fields. ie.. 04:38:34 02:09:58 03:04:11 (6 Replies)
Discussion started by: Arunselvan
6 Replies

3. Shell Programming and Scripting

Fgrep/grep -f and literal strings

I have a file like this: cat file name = server jobname = 1010 snapshot_name = funky_Win2k12_20140213210409 I'm trying to use grep to isolate that first line (name = server), but grep -f "name = " file as well as fgrep "name = " file returns all 3 lines. How do I return... (1 Reply)
Discussion started by: ampsys
1 Replies

4. Shell Programming and Scripting

Use two delimiters in awk

I have a file having lines like: 14: <a="b" val="c"/> 18: <a="x" val="d"/> 54: <a="b" val="c"/> 58: <a="x" val="e"/> I need to create a file with output: 14 d 54 e So basically, for every odd line I need 1st word if delimiter is ':' and for every even... (14 Replies)
Discussion started by: shekhar2010us
14 Replies

5. Shell Programming and Scripting

Delimiters in awk

Line from input file a : b : c " d " e " f : g : h " i " j " k " l output k b a Its taking 7th word when " is the delimiter, 2nd and 1st word when : is the delimiter and returning all in one line.... I am on solaris Thanks..... (1 Reply)
Discussion started by: shekhar2010us
1 Replies

6. Shell Programming and Scripting

Two delimiters with AWK

Hello, this thread is more about scripting style than a specific issue. I've to grep from a output some lines and from them obtain a specific entry delimited by < and >. This is my way : 1) grep -i user list | awk '{FS="<";print $NF}' | sed -e 's/>//g' 2) grep -i user list | cut -d","... (10 Replies)
Discussion started by: gogol_bordello
10 Replies

7. Shell Programming and Scripting

grep/fgrep/egrep for a very large matrix

All, I have a problem with grep/fgrep/egrep. Basically I am building a 200 times 200 correlation matrix. The entries of this matrix need to be retrieved from another very large matrix (~100G). I tried to use the grep/fgrep/egrep to locate each entry and put them into one file. It looks very... (1 Reply)
Discussion started by: realwindfly
1 Replies

8. Shell Programming and Scripting

Awk Vs Fgrep

Hi All, I have 2 files new.txt and old.txt cat new.txt sku1|v1|v2|v3 sku2|v11|v22|v33 sku3|v11|v22|v33 cat old.txt sku1|vx1|vx2|vx3 sku2|vx11|vx22|vx33 sku3|v11|v22|v33 The key column in both files are first column itself. I want to get records in... (6 Replies)
Discussion started by: morbid_angel
6 Replies

9. UNIX Desktop Questions & Answers

Difference grep, egrep and fgrep

Hi All, Can anyone please explain me the difference between grep, egrep and fgrep with examples. I am new to unix environment.. Your help is highly appreciated. Regards, ravi (2 Replies)
Discussion started by: ravind27
2 Replies

10. UNIX for Dummies Questions & Answers

I need help with fgrep or grep

How can I do an and condition with fgrep. I want to do: ps -ef | fgrep -f searchvalues > tempmail.file mailx -s "Email Subject" email@domain.com < tempmail.file The search values file contains: opt/bea.*java.*80 mysqld What I want is to find things that contain: mysqld OR... (7 Replies)
Discussion started by: jimmy
7 Replies
Login or Register to Ask a Question