awk slowing down -- why?


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users awk slowing down -- why?
# 1  
Old 01-07-2014
awk slowing down -- why?

I have an awk script that extracts data from log files and generates a report. The script is incrementing elements in arrays, summing fields depending on contents of other fields, and then in END performing some operations on those arrays. It seemed to be taking longer than it should on my system, so I started trying to figure out where that's happening. I added a simple test to the script:

Code:
NR % 10000 == 0 {
        print NR "\t" length(time_count) "\t" systime() - prevtime
        prevtime = systime()
        }


time_count is an array that becomes rather large-- a per-second count across an hour's time for each of several servers. So, it stacks up considerably, and seems like the best candidate for causing the slowdown. Here's a snippet of that particular bit's output around when the slowdown starts:

Code:
350000  8559    1
360000  8804    2
370000  9012    1
380000  16773   3
390000  16811   4
400000  16857   4

You can see that there's a big increase in the array length around 380,000 lines, as well as in the time required to process that particular set of 10,000 lines. The time grows, slightly but measurably, as the line count increases. This will become more of a problem as the script is used to process larger files.

So, my questions:

1) Are there any general suggestions for increasing performance? This may require me to post the whole script. I don't mind, but I don't want to clutter things up here, so let me know if that would help.
2) I've noticed that my VIRT/RES/SHR in top for my script max out at 108m/6204/864. So, it's actually using about 6MB of RAM, if I read that right. I'd like it to feel free to gobble up as much as it can get. This is a system with 96GB of RAM, so not a problem. How can I encourage the process to do that?

I'd love it if I could tweak things so that disk I/O became the limiting factor. Many thanks in advance for suggestions.
# 2  
Old 01-08-2014
show your code then
# 3  
Old 01-08-2014
Quote:
Originally Posted by kurumi
show your code then
As requested! Attached as .txt.
# 4  
Old 01-08-2014
As of this posting, my attachment is still pending approval. So, here's the script:

Code:
#! /bin/awk -f

BEGIN   {
        OFS = ","
        count = 1
        prevtime = 0
        while ( "cat /root/scripts/billing/subant_list" | getline )
                {
                split($0, sublist, ",")
                subants[count] = sublist[1]
                count ++
                }
        }

## Let's chew on something now...
NR % 10000 == 0 {
        print NR "\t" length(time_count) "\t" systime() - prevtime
        prevtime = systime()
        }

$1 >= start_time && $1 < end_time       {
linecount++

## Hourly operations count

hourlyOperationsCount[substr($2,2,14)]++

## Generate a per-subant count, including:
##      Count of HTTP status codes per subant
##      Count of HTTP tx types per subant

for ( i = 1 ; i <= length(subants) ; i++)
        {
        if ( $14 == subants[i] )
                {
                ##  Subtenant count:
                found["TotalTXCount," subants[i]]++

                ##  Count of HTTP status codes per subant:
                found["StatusCount," subants[i] "," $5] ++

                ##  Count of HTTP tx types per subant
                httptype = substr($9, 2, length($9) - 1)
                found["TXTypeCount," subants[i] "," httptype] ++

                ##  Cumulative size and time of tx by subant_list and HTTP tx type
                indexInSizeByType = "InSizeByType," subants[i] "," httptype
                found[indexInSizeByType] = found[indexInSizeByType] + $17

                indexOutSizeByType = "OutSizeByType," subants[i] "," httptype
                found[indexOutSizeByType] = found[indexOutSizeByType] + $18

                indexTimeByType = "TimeByType," subants[i] "," httptype
                found[indexTimeByType] = found[indexTimeByType] + $19
                }
        }
}


## Ok, these next two sections warrant a little 'splainin.  We track:
##      1)  Concurrent connections -- connections that are ongoing during
##              a given second, whether or not they were initiated in that
##              particular second.
##      2)  initiated connections -- connections that were started in a
##              given second.

{

## First, track concurrent connections.  This one doesn't have the time
##      filter that everything else has so that connections already in
##      progress when the time window starts are counted.

for ( i = 1 ; i <= length(subants) ; i++)
        {
        if ( $14 == subants[i] )
                {
                stime = $1
                if (int(($19 + 500000) / 1000000) >= 1 )
                        {
                        for ( j = stime ; j <= (stime + int(($19 + 500000) / 1000000)) ; j ++ )
                                {
                                time_count[subants[i] "," j] ++
                                }
                        }
#               for (i in time_count) {print i "\t" time_count[i]}}
                }
        }
}

$1 >= start_time && $1 < end_time       {
## Finally, we track initiated connections.

for ( i = 1 ; i <= length(subants) ; i++)
        {
        if ( $14 == subants[i] )
                {
                txInitiated[subants[i] "," $1] ++
                }
        }
}


END     {
        print linecount
        for ( i in found )
                {
                print i "," found[i]
                }

        for ( i in time_count )
                {
                split(i,st,",")
                subant = st[1]
                subant_time = st[2]
                if ( time_count[i] > max_concurrency[subant] && subant_time >= start_time && subant_time < end_time )
                        {
                        max_concurrency[subant] = time_count[i]
                        max_concurr_time[subant] = subant_time
                        }
                }

        for ( i in txInitiated )
                {
                split(i,st,",")
                subant = st[1]
                subant_time = st[2]
                if (txInitiated[i] > max_initiated[subant])
                        {
                        max_initiated[subant] = txInitiated[i]
                        max_init_time[subant] = subant_time
                        }
                }

        for ( i in max_concurrency )
                {
                print "PeakCncrntConns," i "," max_concurrency[i] ",@" max_concurr_time[i]
                }

        for ( i in max_initiated )
                {
                print "PeakInitConns," i "," max_initiated[i] ",@" max_init_time[i]
                }

        for ( i in hourlyOperationsCount )
                {
                print "hourlyOperationsCount", i, hourlyOperationsCount[i]
                }

        }

# 5  
Old 01-08-2014
Attachment approved.
You may try to run your script like this:
Code:
export LC_ALL=C
./your_awk_script args...

and see if the elapsed time decreases.

If you post sample datafiles, the analysis would be easier.

Last edited by radoulov; 01-08-2014 at 07:36 PM..
# 6  
Old 01-08-2014
Quote:
Originally Posted by radoulov
Attachment approved.
You may try to run your script like this:
Code:
export LC_ALL=C
./your_awk_script args...

and see if the elapsed time decreases.

If you post sample datafiles, the analysis would be easier.
Thanks for the reply. I'm trying the variable export to see how the completion time compares.
# 7  
Old 01-09-2014
Answering your previous question (now deleted): yes, you can limit the scope
of LC_ALL and execute it like this:
Code:
LC_ALL=C <your_script> <args> ...

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk output yields error: awk:can't open job_name (Autosys)

Good evening, Im newbie at unix specially with awk From an scheduler program called Autosys i want to extract some data reading an inputfile that comprises jobs names, then formating the output to columns for example 1. This is the inputfile: $ more MapaRep.txt ds_extra_nikira_usuarios... (18 Replies)
Discussion started by: alexcol
18 Replies

2. Shell Programming and Scripting

Expect slowing down / missing characters

Im writing an expect program to connect to cisco routers and run commands. my commands file has only two entries show version show running-config when I run the script, the first command is run without a problem. The second command isn't. The "s" is missing at the device command line,... (1 Reply)
Discussion started by: popeye
1 Replies

3. Shell Programming and Scripting

Passing awk variable argument to a script which is being called inside awk

consider the script below sh /opt/hqe/hqapi1-client-5.0.0/bin/hqapi.sh alert list --host=localhost --port=7443 --user=hqadmin --password=hqadmin --secure=true >/tmp/alerts.xml awk -F'' '{for(i=1;i<=NF;i++){ if($i=="Alert id") { if(id!="") if(dt!=""){ cmd="sh someScript.sh... (2 Replies)
Discussion started by: vivek d r
2 Replies

4. Shell Programming and Scripting

HELP with AWK one-liner. Need to employ an If condition inside AWK to check for array variable ?

Hello experts, I'm stuck with this script for three days now. Here's what i need. I need to split a large delimited (,) file into 2 files based on the value present in the last field. Samp: Something.csv bca,adc,asdf,123,12C bca,adc,asdf,123,13C def,adc,asdf,123,12A I need this split... (6 Replies)
Discussion started by: shell_boy23
6 Replies

5. Shell Programming and Scripting

awk command to compare a file with set of files in a directory using 'awk'

Hi, I have a situation to compare one file, say file1.txt with a set of files in directory.The directory contains more than 100 files. To be more precise, the requirement is to compare the first field of file1.txt with the first field in all the files in the directory.The files in the... (10 Replies)
Discussion started by: anandek
10 Replies

6. UNIX for Dummies Questions & Answers

Sendmail process "Toomany" system slowing down

Hello Experts I have M4000 Solaris 10 server, from few many days there are too many sendmail and mail.local process starting on server and each time i need to kill mannualy using pkill send mail, some time there will 600 of them taking 30mb memory for each and hence slowing down the server,... (2 Replies)
Discussion started by: karghum
2 Replies

7. Shell Programming and Scripting

Problem with awk awk: program limit exceeded: sprintf buffer size=1020

Hi I have many problems with a script. I have a script that formats a text file but always prints the same error when i try to execute it The code is that: { if (NF==17){ print $0 }else{ fields=NF; all=$0; while... (2 Replies)
Discussion started by: fate
2 Replies

8. Shell Programming and Scripting

scripting/awk help : awk sum output is not comming in regular format. Pls advise.

Hi Experts, I am adding a column of numbers with awk , however not getting correct output: # awk '{sum+=$1} END {print sum}' datafile 2.15291e+06 How can I getthe output like : 2152910 Thank you.. # awk '{sum+=$1} END {print sum}' datafile 2.15079e+06 (3 Replies)
Discussion started by: rveri
3 Replies

9. Shell Programming and Scripting

Awk problem: How to express the single quote(') by using awk print function

Actually I got a list of file end with *.txt I want to use the same command apply to all the *.txt Thus I try to find out the fastest way to write those same command in a script and then want to let them run automatics. For example: I got the file below: file1.txt file2.txt file3.txt... (4 Replies)
Discussion started by: patrick87
4 Replies

10. UNIX for Advanced & Expert Users

mysqldump slowing down the process?

Hi All, I have a data calculation process-a perl script running each and every hour which will do some calculations on the data stored in a mysql server. Normally it tooks around 2minutes (max) to complete. But in case if i did any actions on the linux box where the database is... (7 Replies)
Discussion started by: DILEEP410
7 Replies
Login or Register to Ask a Question