Making a faster alternative to a slow awk command


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Making a faster alternative to a slow awk command
# 1  
Old 07-04-2012
Making a faster alternative to a slow awk command

Hi,

I have a large number of input files with two columns of numbers.

For example:
Code:
[X]    [Y]
83     1453
99     3255
99     8482
99     7372
83     175

I only wish to retain lines where the numbers fullfil two requirements. E.g:
[X]=83
1000<=[Y]<=2000

To do this I use the following command:
Code:
awk '($1==83) &&  $2>=1000 && $2<=2000' [inputfile]

PROBLEM: My inputfiles contain >50 million lines, so the awk command is too slow (it takes >2 minutes and I have thousands of inputfiles). Is there a way to make it faster? I have been told that it would be faster if I use Perl.

Last edited by Scrutinizer; 07-04-2012 at 04:46 PM.. Reason: code tags also for data sample
# 2  
Old 07-04-2012
If the first value is fixed try:
Code:
awk '/^83 *[12][0-9][0-9][0-9]/{if($2>=1000 && $2<=2000){print}}' infile

This User Gave Thanks to Klashxx For This Post:
# 3  
Old 07-04-2012
Performance not tested .. To avoid arithmetic calculations, try with egrep
Code:
$ egrep "83 (1...$|2000)" infile
83 1453
$

This User Gave Thanks to jayan_jay For This Post:
# 4  
Old 07-04-2012
The problem, really, is that you have a huge amount of data, not a slow program. How big are your records, really?

I haven't heard perl suggested to increase speed before. I don't think using an awk regex instead of a simple = is going to make it faster, either...

The egrep solution might be worth a shot.
This User Gave Thanks to Corona688 For This Post:
# 5  
Old 07-04-2012
Quote:
Originally Posted by Corona688
I don't think using an awk regex instead of a simple = is going to make it faster, either...
That's a reasonable assumption, but it turns out to be incorrect (at least with the implementation I tested).

In my testing, the following code is over three times faster than the original solution:
Code:
awk '/^83  *(1[0-9][0-9][0-9]|2000)$/' data

I'm curious to know if this solution is also faster on other implementions (nawk and gawk, specifically), but I won't be able to test on them today.

I used an obsolete linux system for all of my testing.

Hardware: Pentium 2 @ 350 MHz (can you feel the power?)
Software: awk is mawk 1.3.3, perl 5.8.8, GNU (e)grep 2.5.1, GNU sed 4.1.5, GNU coreutils 5.97 (cat, wc)
Data: 14 megabytes. 6 line repeating pattern. 1,783,782 lines. 297,297 matches.

Slowest to fastest:
Code:
$ time egrep '^83 (1...|2000)$' data > /dev/null

real    0m15.170s
user    0m15.089s
sys     0m0.080s

$ time awk '$1==83 && $2>=1000 && $2<=2000' data > /dev/null

real    0m11.325s
user    0m11.213s
sys     0m0.112s

$ time perl -ne 'print if /^83 (1[0-9][0-9][0-9]|2000)$/' data > /dev/null

real    0m9.728s
user    0m9.629s
sys     0m0.100s

$ time sed d data

real    0m8.357s
user    0m8.277s
sys     0m0.080s

$ time awk '/^83  *[12][0-9][0-9][0-9]$/ {if ($2>=1000 && $2<=2000) print}' data > /dev/null

real    0m6.809s
user    0m6.692s
sys     0m0.116s

$ time awk '/^83  *(1[0-9][0-9][0-9]|2000)$/' data > /dev/null

real    0m3.555s
user    0m3.404s
sys     0m0.152s

$ time awk 0 data

real    0m1.898s
user    0m1.832s
sys     0m0.068s

$ time wc -l data > /dev/null

real    0m0.721s
user    0m0.316s
sys     0m0.128s

$ time cat data > /dev/null

real    0m0.084s
user    0m0.012s
sys     0m0.072s

Most surprising to me is how long it takes GNU sed to do nothing.


For everyone's amusement (GNU bash 3.1.17):
Code:
$ cat match.sh
while read -r line; do
    case $line in
        83\ 1???|83\ 2000) echo $line;;
    esac
done

$ time sh match.sh < data > /dev/null

real    6m53.128s
user    6m28.776s
sys     0m24.150s

Regards,
Alister

---------- Post updated at 01:23 PM ---------- Previous update was at 01:17 PM ----------

Quote:
Originally Posted by Klashxx
If the first value is fixed try:
Code:
awk '/^83 *[12][0-9][0-9][0-9]/{if($2>=1000 && $2<=2000){print}}' infile

That regular expression says that the space is optional. That's probably not a good idea. The way it's written, 832999 2000 would match.


Quote:
Originally Posted by jayan_jay
Code:
$ egrep "83 (1...$|2000)" infile
83 1453
$

That may require an anchor at the beginning, ^, if numbers with more than 3 digits are possible in the first column. Also, the $ anchor should probably be moved so that it's just after the parenthesized group (for a similar reason).

Regards,
Alister

Last edited by alister; 07-04-2012 at 03:09 PM.. Reason: Added perl version information
These 2 Users Gave Thanks to alister For This Post:
# 6  
Old 07-04-2012
Thanks alister...

Another factor that might prove an important factor is which awk or which grep is used.
For example when using the same extended regex ^83 *(1[0-9][0-9][0-9]|2000)$
I got the following results:
gawk1m20s
awk25s
mawk7s
grep -E39s
cgrep -E2s
For comparison, perl needed 25s...


--
@alister, results of tests 1,3 and the bash loop may be flawed because the regex or pattern match do not match the lines of the OP's input spec..
These 3 Users Gave Thanks to Scrutinizer For This Post:
# 7  
Old 07-04-2012
Quote:
Originally Posted by Scrutinizer
Another factor that might prove an important factor is which awk or which grep is used.
Absolutely. GNU tools in particular tend to be slower than their counterparts.

Quote:
Originally Posted by Scrutinizer
@alister, results of tests 1,3 and the bash loop may be flawed because the regex or pattern match do not match the lines of the OP's input spec..
Woops. My test data was delimited by a single space, so the output of the commands would be correct, but the time was slightly underestimated due to the simpler regular expression.

Using ed, I replaced the single space in each line with a <space><tab><space> sequence. I re-ran the tests, replacing the <space> in the regular expression with [<space><tab>]+, and the time for each test increased by 1 to 3 seconds with the rankings unchanged.

Interesting observation: character classes really slowed down GNU grep.

egrep '^83[[:blank:]]+... takes twice as long as egrep '^83[ <tab>]+..., 30s versus 15s. With perl, the difference was approximately 0.6s.

As for the bash trinket, I won't bother fixing that. I'm not _that_ bored.

Thanks for living up to your nick.

Regards,
Alister
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to make awk command faster for large amount of data?

I have nginx web server logs with all requests that were made and I'm filtering them by date and time. Each line has the following structure: 127.0.0.1 - xyz.com GET 123.ts HTTP/1.1 (200) 0.000 s 3182 CoreMedia/1.0.0.15F79 (iPhone; U; CPU OS 11_4 like Mac OS X; pt_br) These text files are... (21 Replies)
Discussion started by: brenoasrm
21 Replies

2. Shell Programming and Scripting

How to make awk command faster?

I have the below command which is referring a large file and it is taking 3 hours to run. Can something be done to make this command faster. awk -F ',' '{OFS=","}{ if ($13 == "9999") print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12 }' ${NLAP_TEMP}/hist1.out|sort -T ${NLAP_TEMP} |uniq>... (13 Replies)
Discussion started by: Peu Mukherjee
13 Replies

3. Shell Programming and Scripting

Faster way to use this awk command

awk "/May 23, 2012 /,0" /var/tmp/datafile the above command pulls out information in the datafile. the information it pulls is from the date specified to the end of the file. now, how can i make this faster if the datafile is huge? even if it wasn't huge, i feel there's a better/faster way to... (8 Replies)
Discussion started by: SkySmart
8 Replies

4. Shell Programming and Scripting

Making script run faster

Can someone help me edit the below script to make it run faster? Shell: bash OS: Linux Red Hat The point of the script is to grab entire chunks of information that concerns the service "MEMORY_CHECK". For each chunk, the beginning starts with "service {", and ends with "}". I should... (15 Replies)
Discussion started by: SkySmart
15 Replies

5. Shell Programming and Scripting

Multi thread awk command for faster performance

Hi, I have a script below for extracting xml from a file. for i in *.txt do echo $i awk '/<.*/ , /.*<\/.*>/' "$i" | tr -d '\n' echo -ne '\n' done . I read about using multi threading to speed up the script. I do not know much about it but read it on this forum. Is it a... (21 Replies)
Discussion started by: chetan.c
21 Replies

6. UNIX and Linux Applications

Alternative for slow SQL subquery

Hi -- I have the following SQL query in my UNIX shell script -- but the subquery in the second section is very slow. I know there must be a way to do this with a union or something which would be better. Can anyone offer an alternative to this query? Thanks. select count(*) from ... (2 Replies)
Discussion started by: whoknows
2 Replies

7. UNIX for Dummies Questions & Answers

Which command will be faster? y?

i)wc -c/etc/passwd|awk'{print $1}' ii)ls -al/etc/passwd|awk'{print $5}' (4 Replies)
Discussion started by: karthi_g
4 Replies

8. UNIX for Advanced & Expert Users

Making things run faster

I am processing some terabytes of information on a computer having 8 processors (each with 4 cores) with a 16GB RAM and 5TB hard drive implemented as a RAID. The processing doesn't seem to be blazingly fast perhaps because of the IO limitation. I am basically running a perl script to read some... (13 Replies)
Discussion started by: Legend986
13 Replies

9. Shell Programming and Scripting

Which is faster AWK or CUT

If I just wanted to get andred08 from the following ldap dn would I be best to use AWK or CUT? uid=andred08,ou=People,o=example,dc=com It doesn't make a difference if it's just one ldap search I am getting it from but when there's a couple of hundred people in the group that retruns all... (10 Replies)
Discussion started by: dopple
10 Replies
Login or Register to Ask a Question