Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 2,288
Thanks Given: 430
Thanked 480 Times in 395 Posts
Brief timing note
Hi.
I created a 12 MB file that had the first pattern at the end of the file, and the other not present.
I timed 5 solutions: alister's suggested 2-grep, a similar method but with the more-featured cgrep, rapgrep (a perl code: "require all patterns"), glark (a Ruby code), and an awk script. The 2-grep, 2-cgrep, and awk were the fastest in that order, far faster than the perl and Ruby codes.
That 2 passes through the file would be faster than a single pass surprised me. The grep family appears to be very well-written, as is the gawk processor ... cheers, drl
Did you happen to test my suggestions as well? In brief testing I found them to be even faster, possibly due to the fact that the grep moves on to the next file after the first match and only those files that contain a match for the first pattern are grepped for the second...
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 2,288
Thanks Given: 430
Thanked 480 Times in 395 Posts
Hi, Scrutinizer.
In a way I did.
The "optimized" part of the timed code for the grep segment went like this:
I think that probably does much the same thing as your suggestion. I used alister's suggestion as the base.
Note that in the case of the 12 MB file, this probably does not matter because the first string to be matched is specifically placed in the last line forcing the first grep to go all the way through, succeeding, and then the second grep to take place, failing -- a worst-case situation. However, either the -l option or the -m 1 option should work the same -- i.e. bail out at the first match, although I admit that I did not compare one to the other. In a production environment, one or the other should be used to avoid wasted time.
I did something similar for cgrep -- the option differs in syntax but is the same semantically.
I did not use the glark feature to run recursively because I wanted most of the infrastructure to be the same -- find, xargs, etc. The find, xargs construction accounts for another possible worst-case, where one might have many (too many) files that pass the first test.
Thanks for your observations & feedback ... cheers, drl
Thanks for your observations & feedback in return. I guess probably the observed difference stems from a continual call to the grep program as part of the loop vs. 2 times a single call to grep scanning all required files in one go...
S.
---------- Post updated 14-06-10 at 08:16 ---------- Previous update was 13-06-10 at 23:58 ----------
Quote:
Originally Posted by drl
Hi.
.. The grep family appears to be very well-written, as is the gawk processor ... cheers, drl
Grep is very efficient indeed. Also have a look at mawk. In my experience it usually beats gawk on speed...
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 2,288
Thanks Given: 430
Thanked 480 Times in 395 Posts
Hi.
Indeed, mawk is slightly faster than gawk. The mawk variant is occasionally mentioned in comp.lang.awk. It is available in the Debian repositories, so I often have it available, but I rarely install it in other distributions that I have in (testing installs on) virtual machines. For example, it s not present in the standard repositories for openSUSE 11.2 "Emerald".
Both claim to be standard compliant, although the exact phrase for gawk is " ... almost completely POSIX 1003.2 compliant ..."
Here are the versions and timings for one run against a 12 MB file that has "fiber" only at the very end of the file, and "alpha" does not occur at all -- using the techniques discussed earlier:
cheers, drl
I have the file like this.
cat 123.txt
<p> <table border='1' width='90%' align='center' summary='Script output'> <tr><td>text </td> </tr> </table> </p>
I want to replace some tags and want the output like below. I tried with awk & sed commands. But no luck. Could someone help me on this?
... (4 Replies)
Being new to the forum, I tried finding a solution to find files containing 2 words not necessarily on the same line.
This thread
"List all file names that contain two specific words."
answered it in part, but I was looking for a more concise solution.
Here's a one-line suggestion... (8 Replies)
Hy there all. Im new here. Olso new to terminal & bash, but it seams that for me it's much easyer to undarsatnd scripts than an actual programming language as c or anyother languare for that matter.
S-o here is one og my home works s-o to speak.
Write a shell script which:
-only works as a... (1 Reply)
Hi,
As a newbie, I'm desperate ro make my shell script work. I'd like a script which checks all the files in a directory, check the file name, if the file name ends with "extracted", store it in a variable, if it has a suffix of ".roi" stores in another variable. I'm going to use these two... (3 Replies)
I have the need to search a text file from my unix script to determine if it contains the strings of: 'ERROR' and/or 'WARNING'.
By using Grep I can search the file and return a where one of these strings exists. Like this:
cat myfile.txt | grep ERROR
Output:
PROCESS ERROR HERE ... (3 Replies)
Hi
I have a script where the user calls it with arguments like so:
./import.sh -s DNSNAME -d DBNAME
I want to check that the database entered is valid by going through a passwd.ds file and checking if the database exists there.
If it doesn't, the I need to send a message to my log... (4 Replies)
Hi All,
I have a file like this,(This is a sql output file)
cat query_file
200000029
12345 10001
0.2 0
I want to fetch the values 200000029,10001,0.2 .I tried using the below code but i could get... (2 Replies)
Hi,
I am trying to find the content of file using grep and find command and list only the file names
but i am getting entire file list of files in the directory
find . -exec grep "test" {} \; -ls
Can anyone of you correct this (2 Replies)
I have a file that contains the following:
Mon Dec 3 15:52:57 PST 2o007: FAILED TO PROCESSED FILE 200712030790881200.TXT - exit code=107
Tue Dec 4 09:08:57 PST 2007: FAILED TO PROCESSED FILE 200712030790879200a.TXT - exit code=107
This file also has a lot more stuff since it is a log file.... (2 Replies)
Hi,
I want to be able to list all the names in a file which begin with a capital letter, but I don't want it to list words that begin a new sentence. Is there any way round this?
Thanks for your help. (1 Reply)