|
google site
|
|||||||
| Forums | Register | Blog | Man Pages | Forum Rules | Links | Albums | FAQ | Users | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
![]() |
|
|
Thread Tools | Search this Thread | Display Modes |
|
#1
|
|||
|
|||
|
Fgrep or grep or awk help - scanning for delimiters.
Hi, I'm struggling a little here, so I figured it's time to ask for help. I have a file with a list of several hundred IDs (the hit file- "hitfile.txt"), which is newline delimited, and a much bigger (~500Mb) text file, "FASTA.txt" with several thousand entries, delimited by ">". It's the FASTA format, for those interested. On the same line as the >, several different IDs are contained, delimited by "/". One of them is an internal ID ("internalID" which is not much use) and the other an external ID ("externalID" which is much more useful). The file therefore looks like this: Code:
>internalID1 / externalID1 GATTACA >internalID2 / externalID2 GATTACA I have been able to extract the Identifier containing lines and also extract the more useful external ID. I used: Code:
fgrep -f hitfile.txt FASTA.txt > outfile.txt With a hitfile of: Code:
internalID1 internalID2 This outputs the lines as: Code:
>internalID1 / externalID1 >internalID2 / externalID2 From which it is trivial to further extract the externalIDs. Now, I would like to not only pull out single lines, but pull out all lines from the ID (which is always the first item after the >) until the next >, which is the next entry. This will mean I have a file not only of the IDs but also the sequences therein. So with a hitfile of: Code:
internalID1 The output is: Code:
>internalID1 / externalID1 GATTACA This is where my complete n00bism and lack of bash-fu get me stuck. I have tried a couple of promising looking awk scripts, to no avail... Any help in this matter will be much, much appreciated. Last edited by radoulov; 02-18-2010 at 06:12 AM.. Reason: Added code tags. |
| Sponsored Links | ||
|
|
|
#2
|
||||
|
||||
|
Code:
awk -F'[>/ ]' 'END {
if (_2 in _) print r
}
NR == FNR { _[$0]; next }
/^>/ {
if (_2 in _) print r
r = x; _2 = $2
}
{ r = r ? r RS $0 : $0 }
' hitfile.txt FASTA.txtUse gawk, nawk or /usr/xpg4/bin/awk on Solaris. |
|
#3
|
||||
|
||||
|
Hi. The AT&T cgrep command was designed to extract sections of text within a window: Code:
#!/usr/bin/env bash # @(#) s1 Demonstrate extraction of context window, cgrep. # http://www.bell-labs.com/project/wwexptools/cgrep/ echo set +o nounset LC_ALL=C ; LANG=C ; export LC_ALL LANG echo "Environment: LC_ALL = $LC_ALL, LANG = $LANG" echo "(Versions displayed with local utility \"version\")" version >/dev/null 2>&1 && version "=o" $(_eat $0 $1) cgrep set -o nounset FILE1=data1 FILE2=data2 echo echo " Data file $FILE1:" cat $FILE1 echo echo " Data file $FILE2:" cat $FILE2 echo echo " Results:" cgrep -D +I2 +w '>' -f $FILE2 $FILE1 exit 0 producing: Code:
% ./s1 Environment: LC_ALL = C, LANG = C (Versions displayed with local utility "version") OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64 Distribution : Debian GNU/Linux 5.0 GNU bash 3.2.39 cgrep - (local: ~/executable/cgrep May 29 2009 ) Data file data1: >internalID1 / externalID1 GATTACA >internalID2 / externalID2 GATTACA Data file data2: internalID1 Results: >internalID1 / externalID1 GATTACA See the URL noted in the script. You would need to download and compile the code, but I have done that in 32-bit and 64-bit environments without trouble. If you are not comfortable with that, then someone may stop by shortly to offer an awk or perl code. Best wishes ... cheers, drl |
|
#4
|
|||
|
|||
|
Both wonderful replies. Both work very well. Thank you so much!
|
|
#5
|
||||
|
||||
|
use below:- (use gawk or nawk or /usr/xpg4/bin/awk) Code:
nawk 'NR==FNR {a[$1] ; next} $1 in a{print RS$0}' hitfile.txt RS="\>" FASTA.txt![]() ![]()
|
|
#6
|
|||
|
|||
|
@ahmad.diab wow really great work, a true piece of art. I wonder if there is a similar short and concise way to do that in perl ?
|
|
#7
|
||||
|
||||
|
Note that some awk implementations have problems with long multiline records (high number of bytes in the < blocks, in this case). With Perl, I'm not able to come up with more concise solution than this one: Code:
perl -ne'
@_{/([^\n]+)/} = 1 and next if @ARGV;
if (/^>/ or eof) {
@_{$x =~ /^>([^ \/]+)/} and print $x;
$x = "";
}
$x .= $_' hitfile.txt FASTA.txtYou can modify the input record separator $/ with Perl too, but, as far as I know, not the way awk allows you to. Last edited by radoulov; 03-06-2010 at 10:39 AM.. |
| Sponsored Links | ||
|
|
![]() |
| Bookmarks |
| Thread Tools | Search this Thread |
| Display Modes | |
|
|
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Awk Vs Fgrep | morbid_angel | Shell Programming and Scripting | 6 | 12-28-2009 09:20 AM |
| Difference grep, egrep and fgrep | ravind27 | UNIX Desktop for Dummies Questions & Answers | 2 | 06-14-2009 07:37 AM |
| fgrep fails...!? | r_sethu | Shell Programming and Scripting | 5 | 07-26-2007 08:41 AM |
| fgrep | krishna | UNIX for Advanced & Expert Users | 2 | 02-05-2004 09:46 AM |
| I need help with fgrep or grep | jimmy | UNIX for Dummies Questions & Answers | 7 | 11-21-2001 12:13 PM |