I have a large number of input files with two columns of numbers.
For example:
I only wish to retain lines where the numbers fullfil two requirements. E.g:
[X]=83
1000<=[Y]<=2000
To do this I use the following command:
PROBLEM: My inputfiles contain >50 million lines, so the awk command is too slow (it takes >2 minutes and I have thousands of inputfiles). Is there a way to make it faster? I have been told that it would be faster if I use Perl.
Last edited by Scrutinizer; 07-04-2012 at 04:46 PM..
Reason: code tags also for data sample
I don't think using an awk regex instead of a simple = is going to make it faster, either...
That's a reasonable assumption, but it turns out to be incorrect (at least with the implementation I tested).
In my testing, the following code is over three times faster than the original solution:
I'm curious to know if this solution is also faster on other implementions (nawk and gawk, specifically), but I won't be able to test on them today.
I used an obsolete linux system for all of my testing.
Hardware: Pentium 2 @ 350 MHz (can you feel the power?)
Software: awk is mawk 1.3.3, perl 5.8.8, GNU (e)grep 2.5.1, GNU sed 4.1.5, GNU coreutils 5.97 (cat, wc)
Data: 14 megabytes. 6 line repeating pattern. 1,783,782 lines. 297,297 matches.
Slowest to fastest:
Most surprising to me is how long it takes GNU sed to do nothing.
For everyone's amusement (GNU bash 3.1.17):
Regards,
Alister
---------- Post updated at 01:23 PM ---------- Previous update was at 01:17 PM ----------
Quote:
Originally Posted by Klashxx
If the first value is fixed try:
That regular expression says that the space is optional. That's probably not a good idea. The way it's written, 832999 2000 would match.
Quote:
Originally Posted by jayan_jay
That may require an anchor at the beginning, ^, if numbers with more than 3 digits are possible in the first column. Also, the $ anchor should probably be moved so that it's just after the parenthesized group (for a similar reason).
Regards,
Alister
Last edited by alister; 07-04-2012 at 03:09 PM..
Reason: Added perl version information
These 2 Users Gave Thanks to alister For This Post:
Another factor that might prove an important factor is which awk or which grep is used.
For example when using the same extended regex ^83 *(1[0-9][0-9][0-9]|2000)$
I got the following results:
gawk
1m20s
awk
25s
mawk
7s
grep -E
39s
cgrep -E
2s
For comparison, perl needed 25s...
--
@alister, results of tests 1,3 and the bash loop may be flawed because the regex or pattern match do not match the lines of the OP's input spec..
These 3 Users Gave Thanks to Scrutinizer For This Post:
Another factor that might prove an important factor is which awk or which grep is used.
Absolutely. GNU tools in particular tend to be slower than their counterparts.
Quote:
Originally Posted by Scrutinizer
@alister, results of tests 1,3 and the bash loop may be flawed because the regex or pattern match do not match the lines of the OP's input spec..
Woops. My test data was delimited by a single space, so the output of the commands would be correct, but the time was slightly underestimated due to the simpler regular expression.
Using ed, I replaced the single space in each line with a <space><tab><space> sequence. I re-ran the tests, replacing the <space> in the regular expression with [<space><tab>]+, and the time for each test increased by 1 to 3 seconds with the rankings unchanged.
Interesting observation: character classes really slowed down GNU grep.
egrep '^83[[:blank:]]+... takes twice as long as egrep '^83[ <tab>]+..., 30s versus 15s. With perl, the difference was approximately 0.6s.
As for the bash trinket, I won't bother fixing that. I'm not _that_ bored.
I have nginx web server logs with all requests that were made and I'm filtering them by date and time.
Each line has the following structure:
127.0.0.1 - xyz.com GET 123.ts HTTP/1.1 (200) 0.000 s 3182 CoreMedia/1.0.0.15F79 (iPhone; U; CPU OS 11_4 like Mac OS X; pt_br)
These text files are... (21 Replies)
I have the below command which is referring a large file and it is taking 3 hours to run. Can something be done to make this command faster.
awk -F ',' '{OFS=","}{ if ($13 == "9999") print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12 }' ${NLAP_TEMP}/hist1.out|sort -T ${NLAP_TEMP} |uniq>... (13 Replies)
awk "/May 23, 2012 /,0" /var/tmp/datafile
the above command pulls out information in the datafile. the information it pulls is from the date specified to the end of the file.
now, how can i make this faster if the datafile is huge? even if it wasn't huge, i feel there's a better/faster way to... (8 Replies)
Can someone help me edit the below script to make it run faster?
Shell: bash
OS: Linux Red Hat
The point of the script is to grab entire chunks of information that concerns the service "MEMORY_CHECK".
For each chunk, the beginning starts with "service {", and ends with "}".
I should... (15 Replies)
Hi,
I have a script below for extracting xml from a file.
for i in *.txt
do
echo $i
awk '/<.*/ , /.*<\/.*>/' "$i" | tr -d '\n'
echo -ne '\n'
done
.
I read about using multi threading to speed up the script.
I do not know much about it but read it on this forum.
Is it a... (21 Replies)
Hi -- I have the following SQL query in my UNIX shell script -- but the subquery in the second section is very slow. I know there must be a way to do this with a union or something which would be better. Can anyone offer an alternative to this query? Thanks.
select
count(*)
from
... (2 Replies)
I am processing some terabytes of information on a computer having 8 processors (each with 4 cores) with a 16GB RAM and 5TB hard drive implemented as a RAID. The processing doesn't seem to be blazingly fast perhaps because of the IO limitation.
I am basically running a perl script to read some... (13 Replies)
If I just wanted to get andred08 from the following ldap dn
would I be best to use AWK or CUT?
uid=andred08,ou=People,o=example,dc=com
It doesn't make a difference if it's just one ldap search I am getting it from but when there's a couple of hundred people in the group that retruns all... (10 Replies)