awk command optimization

09-13-2014

Registered User

919, 3

Join Date: Dec 2006

Last Activity: 5 March 2020, 5:37 PM EST

Posts: 919

Thanks Given: 757

Thanked 3 Times in 3 Posts

awk command optimization

Code:

gawk -v sw="error|fail|panic|accepted" 'NR>1 && NR <=128500 {
                                                                                for (w in a)
                                                                                {
                                                                                        if ($0 ~ a[w])
                                                                                                d[a[w]]++
                                                                                }
                                                                }
                                                                BEGIN {
                                                                        c = split(sw,a,"[|]")
                                                                }
                                                                END {
                                                                for (i in a)
                                                                {
                                                                        o = o (a[i]"="(d[a[i]]?d[a[i]]:0)",")
                                                                }
                                                                        sub(",*$","",o)
                                                                        print o
                                                                }' /var/log/treg.test

the above code works majestically when searching for multiple strings in a log.

the problem is, as the log gets bigger (i.e. 5MB), the time it takes to search for all the strings gets longer as well. took 2 seconds to search a 5MB file using this code. had the file been bigger, say 10MB, it would take longer.

so i'm wondering, can this code be optimized at all to make it run faster? maybe if the strings were read from a separate file it would help speed things up??

code runs on linux redhat / ubuntu platforms

SkySmart

View Public Profile for SkySmart

Find all posts by SkySmart

09-13-2014

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

FWIW -

Code:

egrep -c '(error|fail|panic|accepted)' logfile

Does a lot of what your awk code does, not all of it.

Code:

for (w in a)
 {
         if ($0 ~ a[w])
                 d[a[w]]++
 }

This code above means you loop 5 times on every line. I do not think regex in awk supports alternation, someone who knows more please comment. But that would be the first place to attack your problem. And if you search for more terms your program will iterate more times over each line of input.

This is the same problem we have when we use grep -f list_of_items filename with a large number of entries in list_of_items.

Edit: the red comment is flat wrong. Alternation is possible. You can rewrite the main loop to use it.

Code:

{ /error|wanting|panic|failure/ } { [define array here]++..... }

Last edited by jim mcnamara; 09-13-2014 at 11:55 AM..

This User Gave Thanks to jim mcnamara For This Post:

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

09-13-2014

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

Can you post a sample of the input file as well...

shamrock

View Public Profile for shamrock

Find all posts by shamrock

09-13-2014

Registered User

919, 3

Join Date: Dec 2006

Last Activity: 5 March 2020, 5:37 PM EST

Posts: 919

Thanks Given: 757

Thanked 3 Times in 3 Posts

Quote:

Originally Posted by shamrock

Can you post a sample of the input file as well...

the input file is just any data file. no set format. the code is used to list the number of specific patterns in a file. so the file does not matter.

SkySmart

View Public Profile for SkySmart

Find all posts by SkySmart

09-13-2014

Registered User

503, 195

Join Date: Sep 2013

Last Activity: 22 January 2021, 1:52 PM EST

Location: France

Posts: 503

Thanks Given: 43

Thanked 195 Times in 176 Posts

Hi,
You can try:

Code:

gawk -v sw="error|fail|panic|accepted" 'NR>1 && NR <=128500 && match($0,"/"sw"/") {
											d[substr($0,RSTART,RLENGTH)]++
                                                                                }

                                                                BEGIN {
                                                                        c = split(sw,a,"[|]")
                                                                }
                                                                END {
                                                                for (i in a)
                                                                {
                                                                        o = o (a[i]"="(d[a[i]]?d[a[i]]:0)",")
                                                                }
                                                                        sub(",*$","",o)
                                                                        print o
                                                                }' /var/log/treg.test

Regards.

This User Gave Thanks to disedorgue For This Post:

disedorgue

View Public Profile for disedorgue

Find all posts by disedorgue

09-13-2014

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

Well you could give this [g]awk a try...

Code:

gawk '{
    for (i=1; i<=NF; i++)
        if ($i ~ "^(error|fail|panic|accepted)$")
            a[$i]++
} END {
    for (i in a) {
        n++
        printf("%s=%s%s", i, a[i], (n < 4 ? ", " : "\n"))
    }
}' file

This User Gave Thanks to shamrock For This Post:

shamrock

View Public Profile for shamrock

Find all posts by shamrock

09-13-2014

Registered User

919, 3

Join Date: Dec 2006

Last Activity: 5 March 2020, 5:37 PM EST

Posts: 919

Thanks Given: 757

Thanked 3 Times in 3 Posts

Quote:

Originally Posted by disedorgue

Hi,
You can try:

Code:

gawk -v sw="error|fail|panic|accepted" 'NR>1 && NR <=128500 && match($0,"/"sw"/") {
											d[substr($0,RSTART,RLENGTH)]++
                                                                                }

                                                                BEGIN {
                                                                        c = split(sw,a,"[|]")
                                                                }
                                                                END {
                                                                for (i in a)
                                                                {
                                                                        o = o (a[i]"="(d[a[i]]?d[a[i]]:0)",")
                                                                }
                                                                        sub(",*$","",o)
                                                                        print o
                                                                }' /var/log/treg.test

Regards.

thank you so much!

this looks promising. when i run it though, it only gives a count for one of the strings even though there are lines in the data file that contain the other strings:

Code:

accepted=0,error=0,fail=3859,panic=0

---------- Post updated at 02:01 PM ---------- Previous update was at 01:58 PM ----------

Quote:

Originally Posted by shamrock

Well you could give this [g]awk a try...

Code:

gawk '{
    for (i=1; i<=NF; i++)
        if ($i ~ "^(error|fail|panic|accepted)$")
            a[$i]++
} END {
    for (i in a) {
        n++
        printf("%s=%s%s", i, a[i], (n < 4 ? ", " : "\n"))
    }
}' file

this looks quite promising as well. thank you so much!!!

looks like the code is written in such a way that it only counts the number of lines that contain just the specific patterns specified. but i believe i can play with it some more.

one question. is the "n < 4" setting a limit of patterns that can be specified?

btw, this completed in under .3 seconds on a 5mb file. so very good news!!!

SkySmart

View Public Profile for SkySmart

Find all posts by SkySmart

Shell Programming and Scripting

awk command optimization

9 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Need Optimization shell/awk script to aggreagte (sum) for all the columns of Huge data file

Discussion started by: kartikirans

2. Shell Programming and Scripting

Code optimization

Discussion started by: primo102

3. Shell Programming and Scripting

awk command optimization

Discussion started by: abhi1988sri

4. Shell Programming and Scripting

CPU optimization

Discussion started by: Gl@)!aTor

5. Shell Programming and Scripting

awk command in script gives error while same awk command at prompt runs fine: Why?

Discussion started by: catalys

6. Shell Programming and Scripting

Awk script gsub optimization

Discussion started by: Akshay

7. Shell Programming and Scripting

sed optimization

Discussion started by: njaiswal

8. Shell Programming and Scripting

AWK optimization

Discussion started by: majormark

9. Shell Programming and Scripting

script optimization

Discussion started by: vivek.gkp