How to make awk command faster for large amount of data?

Login or Register to Ask a Question and Join Our Community

How to make awk command faster for large amount of data?

Tags

awk, command, data, make, shell script awk, shell scripts, solved

Login to Discuss or Reply to this Discussion in Our Community

Top Forums Shell Programming and Scripting How to make awk command faster for large amount of data?

10-01-2018

Registered User

6, 2

Join Date: Oct 2018

Last Activity: 22 October 2018, 12:09 PM EDT

Posts: 6

Thanks Given: 7

Thanked 2 Times in 2 Posts

How to make awk command faster for large amount of data?

I have nginx web server logs with all requests that were made and I'm filtering them by date and time.
Each line has the following structure:

Code:

127.0.0.1 - [13/Jul/2018:21:51:31 +0000] xyz.com GET 123.ts HTTP/1.1 (200) 0.000 s 3182 CoreMedia/1.0.0.15F79 (iPhone; U; CPU OS 11_4 like Mac OS X; pt_br)

These text files are all compressed and what I'm currently doing to filter them is:

Code:

unpigz -c nginx* | awk '{if($9 == "(200)" && $3 >= "[20/Jun/2018:17:00:00" && $3 <= "[20/Jun/2018:22:00:00") print}'  > 0620-17_22.txt

It's taking too long because I have too much data to be analyzed, so I am trying to find a faster

solution. It could be in awk itself or not. Any help would be appreciate.

Thanks!

This User Gave Thanks to brenoasrm For This Post:

brenoasrm

View Public Profile for brenoasrm

Find all posts by brenoasrm

10-01-2018

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

Does it even work the way you've implemented it?
I'd be surprised the way you're comparing dates.......
You need to convert date to the YYYYMMDDHHMMSS format

This User Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

10-02-2018

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

The GNU Awk User's Guide: Time Functions

Run the file through awk only.

Learn to use awk time functions. In the BEGIN {} function, convert the start date and the end date to epoch seconds (%s format)- a big sequential number -- seconds since Jan 1 1970.
For each line in the file convert the date to "%s" format, then see if the number is >=start and <=finish epoch seconds.
If it is in the date range print the line.

This will be a few lines of awk.

And I think vgersh is correct - I do not see how your command worked correctly.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

10-02-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I must be missing something here. But, since the starting and ending timestamps in the awk code in the sample pipeline are on the same date and the times are in 24 hour format (not 12 hour with AM/PM), I see no reason why there is any need to convert the two string arguments to Seconds since the Epoch values and perform numeric comparisons on those converted Seconds since the Epoch values instead of comparing the input values as strings. Furthermore, performing the string comparisons should be faster than converting to strings to integers and then performing a numeric comparison. However, if the start and end timestamps are on different dates, the comments made by vgersh99 and jim mcnamara are absolutely correct.

I have never heard of the unpigz command used at the head of the pipeline being used and I have no idea how the files matched by the pattern nginx* are named nor how big they are. If there are lots of huge compressed files and unpigz is being used to produce uncompressed text from all of those files to be used as input to awk (or if unpigz is a typo and the intended utility at the start of the pipeline was gunzip -c or, equivalently, zcat) and if part of the name matched by the asterisk in nginx* encodes the dates contained in that file, the way to speed up your pipeline might well be to select a smaller set of files to uncompress instead of trying to speed up the awk code when the slow part of your pipeline may well be the time needed to uncompress unneeded data and to then filter that unneeded data in your awk code.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-02-2018

Registered User

6, 2

Join Date: Oct 2018

Last Activity: 22 October 2018, 12:09 PM EDT

Posts: 6

Thanks Given: 7

Thanked 2 Times in 2 Posts

Quote:

Originally Posted by Don Cragun

I must be missing something here. But, since the starting and ending timestamps in the awk code in the sample pipeline are on the same date and the times are in 24 hour format (not 12 hour with AM/PM), I see no reason why there is any need to convert the two string arguments to Seconds since the Epoch values and perform numeric comparisons on those converted Seconds since the Epoch values instead of comparing the input values as strings. Furthermore, performing the string comparisons should be faster than converting to strings to integers and then performing a numeric comparison. However, if the start and end timestamps are on different dates, the comments made by vgersh99 and jim mcnamara are absolutely correct.

I have never heard of the unpigz command used at the head of the pipeline being used and I have no idea how the files matched by the pattern nginx* are named nor how big they are. If there are lots of huge compressed files and unpigz is being used to produce uncompressed text from all of those files to be used as input to awk (or if unpigz is a typo and the intended utility at the start of the pipeline was gunzip -c or, equivalently, zcat) and if part of the name matched by the asterisk in nginx* encodes the dates contained in that file, the way to speed up your pipeline might well be to select a smaller set of files to uncompress instead of trying to speed up the awk code when the slow part of your pipeline may well be the time needed to uncompress unneeded data and to then filter that unneeded data in your awk code.

First, like you've said about comparing string should be faster than coverting to time for compare later, it's exactly what I think happens and that's why my code is how it is and it's working just fine. I have one directory per day, and I ran awk in the files of just one day, so I think it works because of that.

pigz is a parallel implementation of gzip, but it's parallel just for compressing files, but it's a little faster than zcat for descompressing because it uses additional threads for reading, writing, and check calculation (I've read this in one answer in stackoverflow but couldn't refer here because I'm new to the forum). So that's why I use it instead of zcat. About selecting files instead of using nginx* that selects all files in the given directory, it's not possible because I can't tell easily what are the content of each file. That's why I thought maybe it would be something I could do with awk to make it a little faster.

brenoasrm

View Public Profile for brenoasrm

Find all posts by brenoasrm

10-02-2018

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

How big are your files, how long do they take, and how fast is your disk? If you're hitting throughput limits optimizing your programs won't help one iota. If you're not, however, you can process several files at once for large gains.

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

10-02-2018

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

i.e.

Code:

PROC=0
MAXPROC=4

mkdir /tmp/$$

let N=0

for FILE in nginx*
do
        FNAME=$(printf "%04d" "$N") # Filenames like 0001 will sort properly in *
        gunzip < "$FILE" | awk '{...}' > /tmp/$$/$FNAME &
        PROC=$((PROC+1))
        N=$((N+1))

        if [ "$PROC" -ge "$MAXPROC" ]
       then
               wait
               PROC=0
       fi
done

wait

cat /tmp/$$/* > output
rm /tmp/$$/*
rmdir /tmp/$$

And if you are hitting throughput limits, it will help to put your temp files on a different spindle than your input files.

Last edited by Corona688; 10-02-2018 at 02:41 PM..

These 2 Users Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to make awk command faster?

I have the below command which is referring a large file and it is taking 3 hours to run. Can something be done to make this command faster. awk -F ',' '{OFS=","}{ if ($13 == "9999") print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12 }' ${NLAP_TEMP}/hist1.out|sort -T ${NLAP_TEMP} |uniq>...

2. Shell Programming and Scripting

Perl : Large amount of data put into an array

This basic code works. I have a very long list, almost 10000 lines that I am building into the array. Each line has either 2 or 3 fields as shown in the code snippit. The array elements are static (for a few reasons that out of scope of this question) the list has to be "built in". It...

3. Shell Programming and Scripting

awk changes to make it faster

I have script like below, who is picking number from one file and and searching in another file, and printing output. Bu is is very slow to be run on huge file.can we modify it with awk #! /bin/ksh while read line1 do echo "$line1" a=`echo $line1` if then echo "$num" cat file1|nawk...

4. Shell Programming and Scripting

Faster way to use this awk command

awk "/May 23, 2012 /,0" /var/tmp/datafile the above command pulls out information in the datafile. the information it pulls is from the date specified to the end of the file. now, how can i make this faster if the datafile is huge? even if it wasn't huge, i feel there's a better/faster way to...

5. Shell Programming and Scripting

Running rename command on large files and make it faster

Hi All, I have some 80,000 files in a directory which I need to rename. Below is the command which I am currently running and it seems, it is taking fore ever to run this command. This command seems too slow. Is there any way to speed up the command. I have have GNU Parallel installed on my...

6. Emergency UNIX and Linux Support

Help to make awk script more efficient for large files

Hello, Error awk: Internal software error in the tostring function on TS1101?05044400?.0085498227?0?.0011041461?.0034752266?.00397045?0?0?0?0?0?0?11/02/10?09/23/10???10?no??0??no?sct_det3_10_20110516_143936.txt What it is It is a unix shell script that contains an awk program as well as...

7. Shell Programming and Scripting

How to tar large amount of files?

Hello I have the following files VOICE_hhhh SUBSCR_llll DEL_kkkk Consider that there are 1000 VOICE files+1000 SUBSCR files+1000DEL files When i try to tar these files using tar -cvf backup.tar VOICE* SUBSCR* DEL* i get the error: ksh: /usr/bin/tar: arg list too long How can i...

8. AIX

amount of memory allocated to large page

We just set up a system to use large pages. I want to know if there is a command to see how much of the memory is being used for large pages. For example if we have a system with 8GB of RAm assigned and it has been set to use 4GB for large pages is there a command to show that 4GB of the *GB is...

9. Programming

Read/Write a fairly large amount of data to a file as fast as possible

Hi, I'm trying to figure out the best solution to the following problem, and I'm not yet that much experienced like you. :-) Basically I have to read a fairly large file, composed of "messages" , in order to display all of them through an user interface (made with QT). The messages that...

10. Shell Programming and Scripting

awk help to make my work faster

hii everyone , i have a file in which i have line numbers.. file name is file1.txt aa bb cc "12" qw xx yy zz "23" we bb qw we "123249" jh here 12,23,123249. is the line number now according to this line numbers we have to print lines from other file named...

Login or Register to Ask a Question