Grep couple of consecutive lines if each lines contains certain string


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Grep couple of consecutive lines if each lines contains certain string
# 8  
Old 05-30-2012
Hi, Scrutinizer.
Quote:
Originally Posted by Scrutinizer
@drl, grep cannot do this and I do not think cgrep is present on Solaris, is it? cgrep looks nice though and it is fast indeed. I presume cgrep was tested against gawk, which is one of the slowest awks. Perhaps you could compare it to the fastest awk, which is mawk..
I have only the old Solaris-X86 running in a VM:
Code:
OS, ker|rel, machine: SunOS, 5.10, i86pc
Distribution        : Solaris 10 10/08 s10x_u6wos_07b X86

There are a number of repos which may have it, but I have not searched extensively. I can try to see if cgrep will compile on Solaris (it was an easy make on Linux, both 32-and-64-bit), but that will be a low-priority task.

An excerpt from a searching benchmark on a 100MB file shows:

Code:
By cpu:
        code   cpu   real system real/cpu cpu/best real/best sys/best
       cgrep  0.15   0.30   0.12     2.00     1.00      1.30     2.40
   fgrep (2)  0.16   0.24   0.06     1.50     1.07      1.04     1.20
        grep  0.16   0.23   0.06     1.44     1.07      1.00     1.20
       agrep  0.20   0.28   0.05     1.40     1.33      1.22     1.00
         awk  1.32   1.45   0.08     1.10     8.80      6.30     1.60
        mawk  1.36   1.46   0.06     1.07     9.07      6.35     1.20
        perl  1.36   1.50   0.07     1.10     9.07      6.52     1.40
         sed  1.36   1.46   0.06     1.07     9.07      6.35     1.20
        ruby  1.96   2.14   0.12     1.09    13.07      9.30     2.40
        java  2.19   2.51   0.12     1.15    14.60     10.91     2.40

So for that task the versions used were:
Code:
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
gawk GNU Awk 3.1.5

Best wishes ... cheers, drl
# 9  
Old 05-30-2012
Strange, are you sure the mawk numbers are correct? They should not be anywhere near the gawk numbers. I ran these tests on another 100 MB file
Code:
cgrep -a '\| REQUEST \|.*\n.*\| RESPONSE \|' infile

Code:
mawk -F\| '$2 ~ "REQUEST"{s=$0;next} s && $2~"RESPONSE"{print s RS $0; s=x}' infile

Code:
gawk -F\| '$2 ~ "REQUEST"{s=$0;next} s && $2~"RESPONSE"{print s RS $0; s=x}' infile


and I got (in seconds):

Code:
code   real     user    sys  
cgrep  11.160   10.845  0.300        
mawk   16.995   16.505  0.464
gawk   98.290   97.578  0.548

--
cgrep version 8.15, mawk 1.3.3, GNU Awk 3.1.6 on Ubuntu 10.04 LTS

Last edited by Scrutinizer; 05-30-2012 at 12:23 PM..
# 10  
Old 05-30-2012
Hi, Scrutinizer.

Thanks for spotting that anomaly. In fact, I was using GNU/awk for mawk. The new (interim) excerpt of the searching benchmark is:
Code:
By cpu:
	code   cpu   real system real/cpu cpu/best real/best sys/best
	grep  0.13   0.23   0.08     1.77     1.00	1.00	 1.60
   fgrep (2)  0.16   0.23   0.05     1.44     1.23	1.00	 1.00
       cgrep  0.17   0.30   0.11     1.76     1.31	1.30	 2.20
       agrep  0.20   0.28   0.06     1.40     1.54	1.22	 1.20
	mawk  0.51   0.64   0.09     1.25     3.92	2.78	 1.80
	 awk  1.33   1.42   0.06     1.07    10.23	6.17	 1.20
	perl  1.37   1.49   0.10     1.09    10.54	6.48	 2.00
	 sed  1.37   1.47   0.06     1.07    10.54	6.39	 1.20
	ruby  1.98   2.45   0.11     1.24    15.23     10.65	 2.20
	java  2.01   3.15   0.17     1.57    15.46     13.70	 3.40

which shows that for this task, mawk is 2-3 times faster than gawk in CPU time (although, like cgrep, the system time is greater).

I'm sure that Michael appreciates you defending his code's honor Smilie

Best wishes ... cheers, drl
# 11  
Old 05-30-2012
Still quite a discrepancy, because I get a factor 5 - 5.5 . Maybe you have a strange compile or could there be a caching effect with the others?

I just ran the tests also on OSX with
Code:
code   real     user    sys  
cgrep  0.8665   0.847  0.017        
mawk   0.954    0.923  0.031
awk(*) 4.582    4.492  0.037

--
cgrep version 8.15, mawk 1.3.3, (*)BWK Awk 20070501 on OSX 10.7.4

Last edited by Scrutinizer; 05-30-2012 at 01:36 PM..
# 12  
Old 05-30-2012
Hi.

This is a quickly-put-together script:
Code:
#!/usr/bin/env bash

# @(#) s2	Demonstrate comparison among cgrep, gawk, mawk.

pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
cs() { echo "$1" | perl -wp -e '1 while s/^([-+]?\d+)(\d{3})/$1,$2/; ' ; }
clock() { /usr/bin/time --format="real %e\nuser %U\nsys %S" $*; }
C=$HOME/bin/context && [ -f $C ] && $C cgrep gawk mawk

FILE=${1-/tmp/100-mb.txt}
lines=$( wc -l < $FILE )
chars=$( wc -c < $FILE )
pl " Input file $FILE is $( cs $lines ) lines, $( cs $chars ) characters:"
specimen $FILE

pl " Results for cgrep:"
time cgrep -a '\| REQUEST \|.*\n.*\| RESPONSE \|' $FILE

pl " Results for gawk:"
time gawk -F\| '$2 ~ "REQUEST"{s=$0;next} s && $2~"RESPONSE"{print s RS $0; s=x}' $FILE

pl " Results for mawk:"
time mawk -F\| '$2 ~ "REQUEST"{s=$0;next} s && $2~"RESPONSE"{print s RS $0; s=x}' $FILE

exit 0

producing:
Code:
% ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
cgrep ATT cgrep 8.15
gawk GNU Awk 3.1.5
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

-----
 Input file /tmp/100-mb.txt is 1,777,700 lines, 120,540,400 characters:
Edges: 5:0:5 of 1777700 lines in file "/tmp/100-mb.txt"
Preliminary Matter.  

This text of Melville's Moby-Dick is based on the Hendricks House edition.
It was prepared by Professor Eugene F. Irey at the University of Colorado.
Any subsequent copies of this data must include this notice  
   ---
AND FLOATED BY MY SIDE. +BUOYED UP BY THAT COFFIN, FOR ALMOST ONE WHOLE DAY
AND NIGHT, +I FLOATED ON A SOFT AND DIRGE-LIKE MAIN. +THE UNHARMING SHARKS,
THEY GLIDED BY AS IF WITH PADLOCKS ON THEIR MOUTHS; THE SAVAGE SEA-HAWKS SAILE
D WITH SHEATHED BEAKS. +ON THE SECOND DAY, A SAIL DREW NEAR, NEARER, AND PIC
KED ME UP AT LAST. +IT WAS THE DEVIOUS-CRUISING +RACHEL, THAT IN HER RETRACIN

-----
 Results for cgrep:

real	0m0.224s
user	0m0.104s
sys	0m0.100s

-----
 Results for gawk:

real	0m1.453s
user	0m1.328s
sys	0m0.092s

-----
 Results for mawk:

real	0m1.105s
user	0m0.988s
sys	0m0.096s

If there is something that takes a cache hit, it would be the wc, or at least the cgrep ... cheers, drl
# 13  
Old 05-30-2012
The input file matters

With an input file, similar to your Moby Dick and not directly related to the problem at hand in this thread (and with which there were no matches) I also get a factor 5 difference between gawk and mawk, so your result may be a compile thing?. The difference between cgrep and mawk is a factor 6.

With an input file that is a large version of the input file of the problem in this thread, mawk and cgrep are about the same speed, with mawk being 5-10% faster than cgrep, while the difference between mawk and gawk was still a factor 5 - 5.5
# 14  
Old 05-31-2012
Quote:
Originally Posted by Franklin52
Try this:
Code:
awk -F"|" '$2 ~ "REQUEST" {s=$0;f=1;next} f && $2 ~ "RESPONSE" {print s RS $0;f=0}' file

Thanks, this worked for me

---------- Post updated at 01:36 AM ---------- Previous update was at 01:34 AM ----------

Quote:
Originally Posted by drl
Hi.

This is a quickly-put-together script:
Code:
#!/usr/bin/env bash

# @(#) s2    Demonstrate comparison among cgrep, gawk, mawk.

pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
cs() { echo "$1" | perl -wp -e '1 while s/^([-+]?\d+)(\d{3})/$1,$2/; ' ; }
clock() { /usr/bin/time --format="real %e\nuser %U\nsys %S" $*; }
C=$HOME/bin/context && [ -f $C ] && $C cgrep gawk mawk

FILE=${1-/tmp/100-mb.txt}
lines=$( wc -l < $FILE )
chars=$( wc -c < $FILE )
pl " Input file $FILE is $( cs $lines ) lines, $( cs $chars ) characters:"
specimen $FILE

pl " Results for cgrep:"
time cgrep -a '\| REQUEST \|.*\n.*\| RESPONSE \|' $FILE

pl " Results for gawk:"
time gawk -F\| '$2 ~ "REQUEST"{s=$0;next} s && $2~"RESPONSE"{print s RS $0; s=x}' $FILE

pl " Results for mawk:"
time mawk -F\| '$2 ~ "REQUEST"{s=$0;next} s && $2~"RESPONSE"{print s RS $0; s=x}' $FILE

exit 0

producing:
Code:
% ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
cgrep ATT cgrep 8.15
gawk GNU Awk 3.1.5
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

-----
 Input file /tmp/100-mb.txt is 1,777,700 lines, 120,540,400 characters:
Edges: 5:0:5 of 1777700 lines in file "/tmp/100-mb.txt"
Preliminary Matter.  

This text of Melville's Moby-Dick is based on the Hendricks House edition.
It was prepared by Professor Eugene F. Irey at the University of Colorado.
Any subsequent copies of this data must include this notice  
   ---
AND FLOATED BY MY SIDE. +BUOYED UP BY THAT COFFIN, FOR ALMOST ONE WHOLE DAY
AND NIGHT, +I FLOATED ON A SOFT AND DIRGE-LIKE MAIN. +THE UNHARMING SHARKS,
THEY GLIDED BY AS IF WITH PADLOCKS ON THEIR MOUTHS; THE SAVAGE SEA-HAWKS SAILE
D WITH SHEATHED BEAKS. +ON THE SECOND DAY, A SAIL DREW NEAR, NEARER, AND PIC
KED ME UP AT LAST. +IT WAS THE DEVIOUS-CRUISING +RACHEL, THAT IN HER RETRACIN

-----
 Results for cgrep:

real    0m0.224s
user    0m0.104s
sys    0m0.100s

-----
 Results for gawk:

real    0m1.453s
user    0m1.328s
sys    0m0.092s

-----
 Results for mawk:

real    0m1.105s
user    0m0.988s
sys    0m0.096s

If there is something that takes a cache hit, it would be the wc, or at least the cgrep ... cheers, drl
Hello,

Thank you very much for your effort, looks like very good craftsmanship, unfortunately I cannot test anyware as I don;t have cgrep on any of my machines.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove duplicate consecutive lines with specific string

Hello, I'm trying to remove the duplicate consecutive lines with specific string "WARNING". File.txt abc; WARNING 2345 WARNING 2345 WARNING 2345 WARNING 2345 WARNING 2345 bcd; abc; 123 123 123 WARNING 1234 WARNING 2345 WARNING 2345 efgh; (6 Replies)
Discussion started by: Mannu2525
6 Replies

2. Shell Programming and Scripting

Grep three consecutive lines if each lines contains certain string

say we have : 2914 | REQUEST | whatever 2914 | RESPONSE | whatever 2914 | SUCCESS | whatever 2985 | RESPONSE | whatever 2986 | REQUEST | whatever 2990 | REQUEST | whatever 2985 | RESPONSE | whatever 2996 | REQUEST | whatever 2010 | SUCCESS | whatever 2013 | REQUEST | whatever 2013 |... (7 Replies)
Discussion started by: Saumitra Pandey
7 Replies

3. Shell Programming and Scripting

Grep a string and count following lines starting with another string

I have a large dataset with following structure; C 0001 Carbon D SAR001 methane D SAR002 ethane D SAR003 propane D SAR004 butane D SAR005 pentane C 0002 Hydrogen C 0003 Nitrogen C 0004 Oxygen D SAR011 ozone D SAR012 super oxide C 0005 Sulphur D SAR013... (3 Replies)
Discussion started by: Syeda Sumayya
3 Replies

4. Shell Programming and Scripting

Grep 2 consecutive lines and replace the second line in a file

I have a file lake this cat ex1.txt </DISCOUNTS> <B2B_SPECIFICATION elem="0"> <B2B_SPECIFICATION elem="0"> <DESCR>Netti 2 </DESCR> <NUMBER>D02021507505</NUMBER> </B2B_SPECIFICATION> <B2B_SPECIFICATION elem="1"> <DESCR>Puhepaketti</DESCR>... (2 Replies)
Discussion started by: Dhoni
2 Replies

5. Shell Programming and Scripting

Grep a string from input file and delete next three lines including the line contains string in xml

Hi, 1_strings file contains $ cat 1_strings /home/$USER/Src /home/Valid /home/Review$ cat myxml <projected value="some string" path="/home/$USER/Src"> <input 1/> <estimate value/> <somestring/> </projected> <few more lines > <projected value="some string" path="/home/$USER/check">... (4 Replies)
Discussion started by: greet_sed
4 Replies

6. Shell Programming and Scripting

Merge two non-consecutive lines based on line number or string

This is a variation of an earlier post found here: unixcom/shell-programming-scripting/159821-merge-two-non-consecutive-lines.html User Bartus11 was kind enough to solve that example. Previously, I needed help combining two lines that are non-consecutive in a file. Now I need to do the... (7 Replies)
Discussion started by: munkee
7 Replies

7. Shell Programming and Scripting

Print lines between two lines after grep for a text string

I have several very large file that are extracts from Oracle tables. These files are formatted in XML type syntax with multiple entries like: <ROW> some information more information </ROW> I want to grep for some words, then print all lines between <ROW> AND </ROW>. Can this be done with AWK?... (7 Replies)
Discussion started by: jbruce
7 Replies

8. Shell Programming and Scripting

grep string & a few lines after

i need to grep a STRING_A & the next few lines after the STRING_A example file: STRING_A yada yada line 1 line 2 STRING_B yada yada line 1 line 2 line 3 STRING_A yada yada line 1 line 2 line 3 line 4 STRING_A yada yada line 1 line 2 line 3 line 4 (7 Replies)
Discussion started by: ashterix
7 Replies

9. Shell Programming and Scripting

Grep string but also it will show the next 5 lines

Hi experts, I want to grep a number 9366109380 from a file but it will also show me the next 5 lines. Below is the example- when i grep 989366109380, i can also see the next 5 lines. Line 1. <fullOperation>MAKE:NUMBER:9366109380:PPAY2;</fullOperation> Line 2.... (10 Replies)
Discussion started by: thepurple
10 Replies

10. Shell Programming and Scripting

grep string & next n lines

need help on this. let say i hv 1 file contains as below: STRING Description bla bla bla Description yada yada yada Data bla bla Data yada yada how do i want to display n lines after the string? thanks in advance! (8 Replies)
Discussion started by: ashterix
8 Replies
Login or Register to Ask a Question