awk command optimization


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk command optimization
# 1  
Old 09-13-2014
awk command optimization

Code:
gawk -v sw="error|fail|panic|accepted" 'NR>1 && NR <=128500 {
                                                                                for (w in a)
                                                                                {
                                                                                        if ($0 ~ a[w])
                                                                                                d[a[w]]++
                                                                                }
                                                                }
                                                                BEGIN {
                                                                        c = split(sw,a,"[|]")
                                                                }
                                                                END {
                                                                for (i in a)
                                                                {
                                                                        o = o (a[i]"="(d[a[i]]?d[a[i]]:0)",")
                                                                }
                                                                        sub(",*$","",o)
                                                                        print o
                                                                }' /var/log/treg.test

the above code works majestically when searching for multiple strings in a log.

the problem is, as the log gets bigger (i.e. 5MB), the time it takes to search for all the strings gets longer as well. took 2 seconds to search a 5MB file using this code. had the file been bigger, say 10MB, it would take longer.

so i'm wondering, can this code be optimized at all to make it run faster? maybe if the strings were read from a separate file it would help speed things up??

code runs on linux redhat / ubuntu platforms
# 2  
Old 09-13-2014
FWIW -

Code:
egrep -c '(error|fail|panic|accepted)' logfile

Does a lot of what your awk code does, not all of it.

Code:
for (w in a)
 {
         if ($0 ~ a[w])
                 d[a[w]]++
 }

This code above means you loop 5 times on every line. I do not think regex in awk supports alternation, someone who knows more please comment. But that would be the first place to attack your problem. And if you search for more terms your program will iterate more times over each line of input.

This is the same problem we have when we use grep -f list_of_items filename with a large number of entries in list_of_items.

Edit: the red comment is flat wrong. Alternation is possible. You can rewrite the main loop to use it.
Code:
{ /error|wanting|panic|failure/ } { [define array here]++..... }


Last edited by jim mcnamara; 09-13-2014 at 11:55 AM..
This User Gave Thanks to jim mcnamara For This Post:
# 3  
Old 09-13-2014
Can you post a sample of the input file as well...
# 4  
Old 09-13-2014
Quote:
Originally Posted by shamrock
Can you post a sample of the input file as well...
the input file is just any data file. no set format. the code is used to list the number of specific patterns in a file. so the file does not matter.
# 5  
Old 09-13-2014
Hi,
You can try:
Code:
gawk -v sw="error|fail|panic|accepted" 'NR>1 && NR <=128500 && match($0,"/"sw"/") {
											d[substr($0,RSTART,RLENGTH)]++
                                                                                }

                                                                BEGIN {
                                                                        c = split(sw,a,"[|]")
                                                                }
                                                                END {
                                                                for (i in a)
                                                                {
                                                                        o = o (a[i]"="(d[a[i]]?d[a[i]]:0)",")
                                                                }
                                                                        sub(",*$","",o)
                                                                        print o
                                                                }' /var/log/treg.test

Regards.
This User Gave Thanks to disedorgue For This Post:
# 6  
Old 09-13-2014
Well you could give this [g]awk a try...
Code:
gawk '{
    for (i=1; i<=NF; i++)
        if ($i ~ "^(error|fail|panic|accepted)$")
            a[$i]++
} END {
    for (i in a) {
        n++
        printf("%s=%s%s", i, a[i], (n < 4 ? ", " : "\n"))
    }
}' file

This User Gave Thanks to shamrock For This Post:
# 7  
Old 09-13-2014
Quote:
Originally Posted by disedorgue
Hi,
You can try:
Code:
gawk -v sw="error|fail|panic|accepted" 'NR>1 && NR <=128500 && match($0,"/"sw"/") {
											d[substr($0,RSTART,RLENGTH)]++
                                                                                }

                                                                BEGIN {
                                                                        c = split(sw,a,"[|]")
                                                                }
                                                                END {
                                                                for (i in a)
                                                                {
                                                                        o = o (a[i]"="(d[a[i]]?d[a[i]]:0)",")
                                                                }
                                                                        sub(",*$","",o)
                                                                        print o
                                                                }' /var/log/treg.test

Regards.
thank you so much!

this looks promising. when i run it though, it only gives a count for one of the strings even though there are lines in the data file that contain the other strings:

Code:
accepted=0,error=0,fail=3859,panic=0

---------- Post updated at 02:01 PM ---------- Previous update was at 01:58 PM ----------

Quote:
Originally Posted by shamrock
Well you could give this [g]awk a try...
Code:
gawk '{
    for (i=1; i<=NF; i++)
        if ($i ~ "^(error|fail|panic|accepted)$")
            a[$i]++
} END {
    for (i in a) {
        n++
        printf("%s=%s%s", i, a[i], (n < 4 ? ", " : "\n"))
    }
}' file

this looks quite promising as well. thank you so much!!!

looks like the code is written in such a way that it only counts the number of lines that contain just the specific patterns specified. but i believe i can play with it some more.

one question. is the "n < 4" setting a limit of patterns that can be specified?

btw, this completed in under .3 seconds on a 5mb file. so very good news!!!
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Need Optimization shell/awk script to aggreagte (sum) for all the columns of Huge data file

Optimization shell/awk script to aggregate (sum) for all the columns of Huge data file File delimiter "|" Need to have Sum of all columns, with column number : aggregation (summation) for each column File not having the header Like below - Column 1 "Total Column 2 : "Total ... ...... (2 Replies)
Discussion started by: kartikirans
2 Replies

2. Shell Programming and Scripting

Code optimization

Hi all I wrote below code: #!/bin/sh R='\033 do you have any idea how to optimize my code ? (to make it shorter eg.) (11 Replies)
Discussion started by: primo102
11 Replies

3. Shell Programming and Scripting

awk command optimization

Hi, I need some help to optimize this piece of code: sqlplus -S $DB_USER/$DB_PWD@$DB_INSTANCE @$PRODUCT_COLL/$SSA_NAME/bin/tools/sql/tablespace.sql | grep -i UNDO_001_COD3 | awk '{printf ";TBS_UNDO_001_COD3"$5"\n"}' sqlplus -S $DB_USER/$DB_PWD@$DB_INSTANCE... (1 Reply)
Discussion started by: abhi1988sri
1 Replies

4. Shell Programming and Scripting

CPU optimization

hi guys , I have 10 scripts suppose 1.sh , 2.sh ,3.sh ,4.sh ......10.sh each takes some time ( for instance 2 minutes to 40 minutes ) my server can run around 3-4 files at a time suppose, 1.sh , 2.sh , 3.sh are running currently now as soon as ANY ONE of the gets finished i... (4 Replies)
Discussion started by: Gl@)!aTor
4 Replies

5. Shell Programming and Scripting

awk command in script gives error while same awk command at prompt runs fine: Why?

Hello all, Here is what my bash script does: sums number columns, saves the tot in new column, outputs if tot >= threshold val: > cat getnon0file.sh #!/bin/bash this="getnon0file.sh" USAGE=$this" InFile="xyz.38" Min="0.05" # awk '{sum=0; for(n=2; n<=NF; n++){sum+=$n};... (4 Replies)
Discussion started by: catalys
4 Replies

6. Shell Programming and Scripting

Awk script gsub optimization

I have created Shell script with below awk code for replacing special characters from input file. Source file has 6 mn records. This script was able to handle 2 mn records in 1 hr. This is very slow speed and we need to optimise our processing. Can any Guru help me for optimization... (6 Replies)
Discussion started by: Akshay
6 Replies

7. Shell Programming and Scripting

sed optimization

I have a process using the following series of sed commands that works pretty well. sed -e 1,1d $file |sed 1i\\"EHLO Broadridge.com" |sed 2i\\"MAIL FROM:${eaddr}"|sed 3i\\"RCPT TO:${eaddr}"|sed 4i\\"DATA"|sed 5s/.FROM/FROM:/|sed 6s/.TO/TO:/|sed 7,7d|sed s/.ENDDATA/./|sed s/.ENDARRAY// >temp/$file... (1 Reply)
Discussion started by: njaiswal
1 Replies

8. Shell Programming and Scripting

AWK optimization

Hello, Do you have any tips on how to optimize the AWK that gets the lines in the log between these XML tags? se2|6|<ns1:accountInfoRequest xmlns:ns1="http://www.123.com/123/ se2|6|etc2"> .... <some other tags> se2|6|</ns1:acc se2|6|ountInfoRequest> The AWK I'm using to get this... (2 Replies)
Discussion started by: majormark
2 Replies

9. Shell Programming and Scripting

script optimization

:o Hi, I am writing a script in which at some time, I need to get the process id of a special process and kill it... I am getting the PID as follows... ps -ef | grep $PKMS/scripts | grep -v grep | awk '{print $2 }'can we optimize it more further since my script already doing lot of other... (3 Replies)
Discussion started by: vivek.gkp
3 Replies
Login or Register to Ask a Question