Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
google site



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Reply
English Japanese Spanish French German Portuguese Italian Powered by Powered by Google
 
Thread Tools Search this Thread Display Modes
  #1  
Old 02-18-2010
Registered User
 

Join Date: Feb 2010
Posts: 2
Thanks: 0
Thanked 0 Times in 0 Posts
Fgrep or grep or awk help - scanning for delimiters.

Hi,

I'm struggling a little here, so I figured it's time to ask for help.

I have a file with a list of several hundred IDs (the hit file- "hitfile.txt"), which is newline delimited, and a much bigger (~500Mb) text file, "FASTA.txt" with several thousand entries, delimited by ">". It's the FASTA format, for those interested.

On the same line as the >, several different IDs are contained, delimited by "/". One of them is an internal ID ("internalID" which is not much use) and the other an external ID ("externalID" which is much more useful). The file therefore looks like this:



Code:
>internalID1 / externalID1

GATTACA

>internalID2 / externalID2

GATTACA


I have been able to extract the Identifier containing lines and also extract the more useful external ID.

I used:

Code:
fgrep -f hitfile.txt FASTA.txt > outfile.txt

With a hitfile of:


Code:
internalID1
internalID2

This outputs the lines as:


Code:
>internalID1 / externalID1
>internalID2 / externalID2

From which it is trivial to further extract the externalIDs.

Now, I would like to not only pull out single lines, but pull out all lines from the ID (which is always the first item after the >) until the next >, which is the next entry. This will mean I have a file not only of the IDs but also the sequences therein. So with a hitfile of:

Code:
internalID1

The output is:

Code:
>internalID1 / externalID1

GATTACA




This is where my complete n00bism and lack of bash-fu get me stuck. I have tried a couple of promising looking awk scripts, to no avail...

Any help in this matter will be much, much appreciated.

Last edited by radoulov; 02-18-2010 at 06:12 AM.. Reason: Added code tags.
Sponsored Links
  #2  
Old 02-18-2010
radoulov's Avatar
--
 

Join Date: Jan 2007
Location: Варна, България / Milano, Italia
Posts: 3,619
Thanks: 15
Thanked 55 Times in 54 Posts

Code:
awk -F'[>/ ]' 'END { 
  if (_2 in _) print r
 }
NR == FNR { _[$0]; next }
/^>/ { 
  if (_2 in _) print r 
  r = x; _2 = $2 
  }
{ r = r ? r RS $0 : $0 }  
' hitfile.txt FASTA.txt

Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.
  #3  
Old 02-18-2010
drl's Avatar
drl drl is offline Forum Advisor  
Registered User
 

Join Date: Apr 2007
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 898
Thanks: 0
Thanked 10 Times in 10 Posts
Hi.

The AT&T cgrep command was designed to extract sections of text within a window:

Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate extraction of context window, cgrep.
# http://www.bell-labs.com/project/wwexptools/cgrep/

echo
set +o nounset
LC_ALL=C ; LANG=C ; export LC_ALL LANG
echo "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version "=o" $(_eat $0 $1) cgrep
set -o nounset

FILE1=data1
FILE2=data2

echo
echo " Data file $FILE1:"
cat $FILE1

echo
echo " Data file $FILE2:"
cat $FILE2

echo
echo " Results:"
cgrep -D +I2 +w '>' -f $FILE2 $FILE1

exit 0

producing:

Code:
% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 
GNU bash 3.2.39
cgrep - (local: ~/executable/cgrep May 29 2009 )

 Data file data1:
>internalID1 / externalID1

GATTACA

>internalID2 / externalID2

GATTACA

 Data file data2:
internalID1

 Results:
>internalID1 / externalID1

GATTACA

See the URL noted in the script. You would need to download and compile the code, but I have done that in 32-bit and 64-bit environments without trouble.

If you are not comfortable with that, then someone may stop by shortly to offer an awk or perl code.

Best wishes ... cheers, drl
  #4  
Old 02-18-2010
Registered User
 

Join Date: Feb 2010
Posts: 2
Thanks: 0
Thanked 0 Times in 0 Posts
Both wonderful replies. Both work very well. Thank you so much!
  #5  
Old 02-18-2010
ahmad.diab's Avatar
Registered User
 

Join Date: May 2008
Location: Amman Jordan in MEA
Posts: 537
Thanks: 0
Thanked 1 Time in 1 Post
use below:- (use gawk or nawk or /usr/xpg4/bin/awk)


Code:
nawk 'NR==FNR {a[$1] ; next} $1 in a{print RS$0}' hitfile.txt RS="\>" FASTA.txt

  #6  
Old 03-06-2010
Registered User
 

Join Date: Mar 2010
Location: Vienna
Posts: 1
Thanks: 0
Thanked 0 Times in 0 Posts
@ahmad.diab wow really great work, a true piece of art. I wonder if there is a similar short and concise way to do that in perl ?
  #7  
Old 03-06-2010
radoulov's Avatar
--
 

Join Date: Jan 2007
Location: Варна, България / Milano, Italia
Posts: 3,619
Thanks: 15
Thanked 55 Times in 54 Posts
Note that some awk implementations have problems with long multiline records (high number of bytes in the < blocks, in this case).

With Perl, I'm not able to come up with more concise solution than this one:


Code:
perl -ne'
  @_{/([^\n]+)/} = 1 and next if @ARGV;
  if (/^>/ or eof) {
    @_{$x =~ /^>([^ \/]+)/} and print $x;
    $x = "";
    } 
  $x .= $_' hitfile.txt FASTA.txt

You can modify the input record separator $/ with Perl too, but, as far as I know, not the way awk allows you to.

Last edited by radoulov; 03-06-2010 at 10:39 AM..
Sponsored Links
Reply

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Awk Vs Fgrep morbid_angel Shell Programming and Scripting 6 12-28-2009 09:20 AM
Difference grep, egrep and fgrep ravind27 UNIX Desktop for Dummies Questions & Answers 2 06-14-2009 07:37 AM
fgrep fails...!? r_sethu Shell Programming and Scripting 5 07-26-2007 08:41 AM
fgrep krishna UNIX for Advanced & Expert Users 2 02-05-2004 09:46 AM
I need help with fgrep or grep jimmy UNIX for Dummies Questions & Answers 7 11-21-2001 12:13 PM



All times are GMT -4. The time now is 07:59 AM.