Making a faster alternative to a slow awk command

07-04-2012

Registered User

12, 0

Join Date: May 2010

Last Activity: 15 October 2012, 3:39 PM EDT

Posts: 12

Thanks Given: 8

Thanked 0 Times in 0 Posts

Making a faster alternative to a slow awk command

Hi,

I have a large number of input files with two columns of numbers.

For example:

Code:

[X]    [Y]
83     1453
99     3255
99     8482
99     7372
83     175

I only wish to retain lines where the numbers fullfil two requirements. E.g:
[X]=83
1000<=[Y]<=2000

To do this I use the following command:

Code:

awk '($1==83) &&  $2>=1000 && $2<=2000' [inputfile]

PROBLEM: My inputfiles contain >50 million lines, so the awk command is too slow (it takes >2 minutes and I have thousands of inputfiles). Is there a way to make it faster? I have been told that it would be faster if I use Perl.

Last edited by Scrutinizer; 07-04-2012 at 04:46 PM.. Reason: code tags also for data sample

s052866

View Public Profile for s052866

Find all posts by s052866

07-04-2012

Registered User

779, 112

Join Date: Feb 2006

Last Activity: 18 May 2018, 1:51 PM EDT

Location: Almer�a, Spain

Posts: 779

Thanks Given: 24

Thanked 112 Times in 106 Posts

If the first value is fixed try:

Code:

awk '/^83 *[12][0-9][0-9][0-9]/{if($2>=1000 && $2<=2000){print}}' infile

This User Gave Thanks to Klashxx For This Post:

Klashxx

View Public Profile for Klashxx

Find all posts by Klashxx

07-04-2012

Registered User

833, 187

Join Date: Jul 2008

Last Activity: 9 March 2016, 9:36 AM EST

Posts: 833

Thanks Given: 9

Thanked 187 Times in 177 Posts

Performance not tested .. To avoid arithmetic calculations, try with egrep

Code:

$ egrep "83 (1...$|2000)" infile
83 1453
$

This User Gave Thanks to jayan_jay For This Post:

jayan_jay

View Public Profile for jayan_jay

Find all posts by jayan_jay

07-04-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

The problem, really, is that you have a huge amount of data, not a slow program. How big are your records, really?

I haven't heard perl suggested to increase speed before. I don't think using an awk regex instead of a simple = is going to make it faster, either...

The egrep solution might be worth a shot.

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

07-04-2012

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by Corona688

I don't think using an awk regex instead of a simple = is going to make it faster, either...

That's a reasonable assumption, but it turns out to be incorrect (at least with the implementation I tested).

In my testing, the following code is over three times faster than the original solution:

Code:

awk '/^83  *(1[0-9][0-9][0-9]|2000)$/' data

I'm curious to know if this solution is also faster on other implementions (nawk and gawk, specifically), but I won't be able to test on them today.

I used an obsolete linux system for all of my testing.

Hardware: Pentium 2 @ 350 MHz (can you feel the power?)
Software: awk is mawk 1.3.3, perl 5.8.8, GNU (e)grep 2.5.1, GNU sed 4.1.5, GNU coreutils 5.97 (cat, wc)
Data: 14 megabytes. 6 line repeating pattern. 1,783,782 lines. 297,297 matches.

Slowest to fastest:

Code:

$ time egrep '^83 (1...|2000)$' data > /dev/null

real    0m15.170s
user    0m15.089s
sys     0m0.080s

$ time awk '$1==83 && $2>=1000 && $2<=2000' data > /dev/null

real    0m11.325s
user    0m11.213s
sys     0m0.112s

$ time perl -ne 'print if /^83 (1[0-9][0-9][0-9]|2000)$/' data > /dev/null

real    0m9.728s
user    0m9.629s
sys     0m0.100s

$ time sed d data

real    0m8.357s
user    0m8.277s
sys     0m0.080s

$ time awk '/^83  *[12][0-9][0-9][0-9]$/ {if ($2>=1000 && $2<=2000) print}' data > /dev/null

real    0m6.809s
user    0m6.692s
sys     0m0.116s

$ time awk '/^83  *(1[0-9][0-9][0-9]|2000)$/' data > /dev/null

real    0m3.555s
user    0m3.404s
sys     0m0.152s

$ time awk 0 data

real    0m1.898s
user    0m1.832s
sys     0m0.068s

$ time wc -l data > /dev/null

real    0m0.721s
user    0m0.316s
sys     0m0.128s

$ time cat data > /dev/null

real    0m0.084s
user    0m0.012s
sys     0m0.072s

Most surprising to me is how long it takes GNU sed to do nothing.

For everyone's amusement (GNU bash 3.1.17):

Code:

$ cat match.sh
while read -r line; do
    case $line in
        83\ 1???|83\ 2000) echo $line;;
    esac
done

$ time sh match.sh < data > /dev/null

real    6m53.128s
user    6m28.776s
sys     0m24.150s

Regards,
Alister

---------- Post updated at 01:23 PM ---------- Previous update was at 01:17 PM ----------

Quote:

Originally Posted by Klashxx

If the first value is fixed try:

Code:

awk '/^83 *[12][0-9][0-9][0-9]/{if($2>=1000 && $2<=2000){print}}' infile

That regular expression says that the space is optional. That's probably not a good idea. The way it's written, 832999 2000 would match.

Quote:

Originally Posted by jayan_jay

Code:

$ egrep "83 (1...$|2000)" infile
83 1453
$

That may require an anchor at the beginning, ^, if numbers with more than 3 digits are possible in the first column. Also, the $ anchor should probably be moved so that it's just after the parenthesized group (for a similar reason).

Regards,
Alister

Last edited by alister; 07-04-2012 at 03:09 PM.. Reason: Added perl version information

These 2 Users Gave Thanks to alister For This Post:

alister

View Public Profile for alister

Find all posts by alister

07-04-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Thanks alister...

Another factor that might prove an important factor is which awk or which grep is used.
For example when using the same extended regex ^83 *(1[0-9][0-9][0-9]|2000)$
I got the following results:

gawk 1m20s
awk 25s
mawk 7s
grep -E 39s
cgrep -E 2s

For comparison, perl needed 25s...

--
@alister, results of tests 1,3 and the bash loop may be flawed because the regex or pattern match do not match the lines of the OP's input spec..

These 3 Users Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

07-04-2012

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by Scrutinizer

Another factor that might prove an important factor is which awk or which grep is used.

Absolutely. GNU tools in particular tend to be slower than their counterparts.

Quote:

Originally Posted by Scrutinizer

@alister, results of tests 1,3 and the bash loop may be flawed because the regex or pattern match do not match the lines of the OP's input spec..

Woops. My test data was delimited by a single space, so the output of the commands would be correct, but the time was slightly underestimated due to the simpler regular expression.

Using ed, I replaced the single space in each line with a <space><tab><space> sequence. I re-ran the tests, replacing the <space> in the regular expression with [<space><tab>]+, and the time for each test increased by 1 to 3 seconds with the rankings unchanged.

Interesting observation: character classes really slowed down GNU grep.

egrep '^83[[:blank:]]+... takes twice as long as egrep '^83[ <tab>]+..., 30s versus 15s. With perl, the difference was approximately 0.6s.

As for the bash trinket, I won't bother fixing that. I'm not _that_ bored.

Thanks for living up to your nick.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

Shell Programming and Scripting

Making a faster alternative to a slow awk command

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to make awk command faster for large amount of data?

Discussion started by: brenoasrm

2. Shell Programming and Scripting

How to make awk command faster?

Discussion started by: Peu Mukherjee

3. Shell Programming and Scripting

Faster way to use this awk command

Discussion started by: SkySmart

4. Shell Programming and Scripting

Making script run faster

Discussion started by: SkySmart

5. Shell Programming and Scripting

Multi thread awk command for faster performance

Discussion started by: chetan.c

6. UNIX and Linux Applications

Alternative for slow SQL subquery

Discussion started by: whoknows

7. UNIX for Dummies Questions & Answers

Which command will be faster? y?

Discussion started by: karthi_g

8. UNIX for Advanced & Expert Users

Making things run faster

Discussion started by: Legend986

9. Shell Programming and Scripting

Which is faster AWK or CUT

Discussion started by: dopple