How to make awk command faster for large amount of data?

10-02-2018

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Assuming your input is sorted by date (which seems likely, given logfiles), processing files individually makes another really big optimization possible - once the date exceeds the cutoff, quit! You might skip entire files.

Code:

...
gunzip < "$FILE" | awk '$3 > "[20/Jun/2018:22:00:00" { exit } ; {...}' > /tmp/$$/$FNAME &
...

That may be worth trying even without multithreading, actually.

Code:

for FILE in nginx*
do
        gunzip < "$FILE" | awk '$3 > "[20/Jun/2018:22:00:00" { exit } ; {...}'
done > output

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

10-02-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Corona688's reasoning is absolutely correct. I created a large file and timed several awk selection methods:

Code:

time awk '{if ($9 == "(200)" && $3 > "[13/Jul/2018:17:00:00" && $3 < "[13/Jul/2018:21:00:00") print}' file > /dev/null

real    0m0,794s
user    0m0,725s
sys    0m0,068s
time awk '$9 == "(200)" {if ($3 > "[13/Jul/2018:17:00:00" && $3 < "[13/Jul/2018:21:00:00") print}' file > /dev/null

real    0m0,787s
user    0m0,733s
sys    0m0,052s
time awk '$9 == "(200)" {X = substr ($3, 14); if (X > "17:00:00" && X < "21:00:00") print}' file > /dev/null

real    0m0,806s
user    0m0,732s
sys    0m0,072s
time awk '$9 == "(200)" {X = substr ($3, 14); if (X < "17:00:00") next; if (X > "21:00:00") exit; print}' file > /dev/null

real    0m0,775s
user    0m0,676s
sys    0m0,093s
time awk '$9 == "(200)" {X = substr ($3, 14); if (X ~ /1[7-9]|2[01]:[0-5][0-9]:[0-5][0-9]$/) print}' file > /dev/null

real    0m0,827s
user    0m0,727s
sys    0m0,078s

, and found very little variation in execution time. You'd better focus on the data supply / file access.

Last edited by RudiC; 10-02-2018 at 03:22 PM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

10-04-2018

Registered User

6, 2

Join Date: Oct 2018

Last Activity: 22 October 2018, 12:09 PM EDT

Posts: 6

Thanks Given: 7

Thanked 2 Times in 2 Posts

Quote:

Originally Posted by Corona688

How big are your files, how long do they take, and how fast is your disk? If you're hitting throughput limits optimizing your programs won't help one iota. If you're not, however, you can process several files at once for large gains.

File sizes varies, for instance, some files have ~300MB, others ~1GB or ~2GB.
My disk is not operating in full capacity while running awk, so I think your solution to read multiple files at once will benefit me.

Although I can't tell what's the datetime just by file name, I can leverage the fact that log files are ordered to choose what files i'll be reading, like it was suggested here. I'll make an exampĺe to illustrate.

I have 832 files in one directory totalizing 100GB, let's say nginx1.gz, nginx2.gz, ..., nginx832.gz.
first line of nginx1.gz has [11/Jul/2018:18:00:01 and the last line [11/Jul/2018:21:00:01
first line of nginx2.gz also has [11/Jul/2018:18:00:01 and the last line [11/Jul/2018:21:00:01

The natural would be nginx2.gz start with the same time or later than nginx1.gz
I could do what you've suggested with this code:

Code:

        gunzip < "$FILE" | awk '$3 > "[20/Jun/2018:22:00:00" { exit } ; {...}'

, but I would avoid to read just 2 hours of logs. So, I thought of reading the first and last line of each file and decide whether I would read the file or not. For instance,
1st line datetime would be in: zcat file1.gz | head -n 1
last line datetime would be in: zcat file1.gz | tail -n 1

So, there are two cases when I could skip reading the files:
1 - When 1st and last line is before the time I want
2 - When 1st line is after the time I want (like you've suggested)

This way I think it'll be much faster. I've finish this modification and I'm testing it right now.
If everything is allright I'll test your solution of reading multiple files at once and give you the feedback of the results.

Thank you very much for your help, I didn't know about wait and it'll probably help me.

brenoasrm

View Public Profile for brenoasrm

Find all posts by brenoasrm

10-04-2018

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Unfortunately you can't seek to the end of a compressed file without decompressing the entire thing, so it really doesn't save you time. Otherwise that'd be a really good idea.

How big is your data, and how long does it take to process? These numbers will tell us how much improvement is possible worst-case.

These 2 Users Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

10-04-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Expanding on what Corona688 said, your two commands:

Code:

zcat file1.gz | head -n 1
zcat file1.gz | tail -n 1

decompress the file twice (maybe not completing the first decompression), and if you find that there is some data in that file that you need, you'll then decompress it again for your awk script to process.

I would strongly suggest creating a separate text file that contains the timestamp of the first record in each compressed file and the name of that compressed file. (And, add a new entry to the end of that file each time you create a new compress log file.) Then you can look at that (uncompressed) text file to quickly determine which compressed file(s) you need to uncompress and feed to your awk script to get the records you want for any particular timestamp range.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-04-2018

Registered User

6, 2

Join Date: Oct 2018

Last Activity: 22 October 2018, 12:09 PM EDT

Posts: 6

Thanks Given: 7

Thanked 2 Times in 2 Posts

I have ~120GB of compressed files per day. It's taking many hours to process

real 217m15.559s
user 763m17.030s
sys 87m40.926s

I didn't know using tail -n 1 would descompress the entire file, but it shouldn't take less time because it doesn't compare the 3rd column of all lines? For instance, if I have 100000000 lines, it would be 100000000 comparisons that would me eliminate if the last line is before the time I want to get

brenoasrm

View Public Profile for brenoasrm

Find all posts by brenoasrm

10-04-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

The comparison is taking a trivial amount of time compared to the time required to:

read your compressed data,
uncompress your data,
write your uncompressed data into a pipe, and
read your uncompressed data from the pipe.

Anything you can do to eliminate those four steps for data that can't match your desired time range will yield huge benefits in run-time reduction.

Increasing the number of times you perform those four steps will increase your run times; not decrease them.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

How to make awk command faster for large amount of data?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to make awk command faster?

Discussion started by: Peu Mukherjee

2. Shell Programming and Scripting

Perl : Large amount of data put into an array

Discussion started by: sumguy

3. Shell Programming and Scripting

awk changes to make it faster

Discussion started by: mirwasim

4. Shell Programming and Scripting

Faster way to use this awk command

Discussion started by: SkySmart

5. Shell Programming and Scripting

Running rename command on large files and make it faster

Discussion started by: shoaibjameel123

6. Emergency UNIX and Linux Support

Help to make awk script more efficient for large files

Discussion started by: script_op2a

7. Shell Programming and Scripting

How to tar large amount of files?

Discussion started by: chriss_58

8. AIX

amount of memory allocated to large page

Discussion started by: daveisme

9. Programming

Read/Write a fairly large amount of data to a file as fast as possible

Discussion started by: emitrax

10. Shell Programming and Scripting

awk help to make my work faster

Discussion started by: kumar_amit