Optimizing awk script

08-04-2013

Registered User

919, 3

Join Date: Dec 2006

Last Activity: 5 March 2020, 5:37 PM EST

Posts: 919

Thanks Given: 757

Thanked 3 Times in 3 Posts

Optimizing awk script

Can this awk statement be optimized? i ask because log.txt is a giant file with several hundred thousands of lines of records.

Code:

myscript.sh:

while read line
do
        searchterm="${1}"
        datecurr=$(date  +%s)
        file=$(awk 'BEGIN{split(ARGV[1],var,",");print var[1]}' $line)
        llnum=$(awk 'BEGIN{split(ARGV[1],var,",");print var[2]}' $line)
        termcount=$(awk -v llnum=${llnum} 'NR>llnum' $file | egrep -c "${searchterm}")
        newfilelinecount=$(wc -l $file)
        echo "${file},${newfilelinecount},${termcount},${datecurr}" >> /tmp/log.txt_2
done < log.txt

Code:

log.txt:

/tmp/text1.txt,343,193,833
/tmp/text2.txt,43,93,533

The first column in "log.txt" contains the file name.
The second column in "log.txt" contains the last known total number of lines for each file.

myscript.sh reads in the file "log.txt" and for each file it finds, it gets the line number from the second column. begins scanning the file from that line number and gets the number of times it finds the search term provided by the user.

can this be optimized?

OS:
Linux (redhat, centos, ubuntu)/SunOS

Last edited by SkySmart; 08-04-2013 at 12:07 PM..

SkySmart

View Public Profile for SkySmart

Find all posts by SkySmart

08-04-2013

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

The worst approach (performance wise ) is

Code:

while read rec
do
var=$(awk command here)
var2=$(awk command here)
var3=$(sed command here)
done < somefile

This creates three child processes for each line in the input file, 300K lines means 1 million process creations.

awk, perl, and ruby can do almost anything with a single process creation because they have tools built in as part of the language.

Consider simply using a larger awk program.

Or..

So, if your awk commands could be accomplished with parameter substitution in bash or ksh, you would speed things up enormously.

Example:

Code:

file=$(awk 'BEGIN{split(ARGV[1],var,",");print var[1]}' $line)

is doing nothing more than getting a field from data like this:

Code:

fap,boo bar

where it appears that you want store "fap" in a variable.

Since you seem to have done this several times in code before, I'm hoping to get you past the problem.

Can you think of a way to get "fap" using one of bash's ${ } constructs, or maybe set

Code:

IFS=,

and use one of the bash commands to get "fap"?

This User Gave Thanks to jim mcnamara For This Post:

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

08-04-2013

Registered User

919, 3

Join Date: Dec 2006

Last Activity: 5 March 2020, 5:37 PM EST

Posts: 919

Thanks Given: 757

Thanked 3 Times in 3 Posts

Code:

file=$(awk 'BEGIN{split(ARGV[1],var,",");print var[1]}' $line)

you are right jim. my original code is very far from being efficient. the command above can probably replace it with ease.

i cant seem to think in awk, even though i prefer that over any other languages. i'm more comfortable with bash but i really want awk to do this.

in the code i quoted, i'm interested in the second field, which is the line number. and i need to be able to store the values of the count of the search terms, the new total line number of the each file. and then send that information to another file.

thinking in bash, any one who knows anything about scripting can figure out exactly what im doing here very easily.

Code:

while read line
do
        searchterm="${1}"
        datecurr=$(date  +%s)
        file=$(awk 'BEGIN{split(ARGV[1],var,",");print var[1]}' $line)
        llnum=$(awk 'BEGIN{split(ARGV[1],var,",");print var[2]}' $line)
        termcount=$(awk -v llnum=${llnum} 'NR>llnum' $file | egrep -c "${searchterm}")
        newfilelinecount=$(wc -l $file)
        echo "${file},${newfilelinecount},${termcount},${datecurr}" >> /tmp/log.txt_2
done < log.txt

to translate this to an efficient awk program is where i need serious help.

SkySmart

View Public Profile for SkySmart

Find all posts by SkySmart

08-04-2013

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Step 1 is to define "static" variables outside the loop.
Step 2 is to use the full potential of the read command.
Step 3 is to get both termcount and newfilelinecount in one stroke.
Step 4 is to have the sole output at the end of the loop

Code:

searchterm="${1}"
datecurr=$(date +%s)
while IFS="," read file llnum rest
do
  eval $(awk -v llnum=${llnum} -v search="$searchterm" '
NR>llnum && $0~search {++tcnt}
END {print "termcount=" tcnt+0, "newfilelinecount=" NR}
' $file)
  echo "${file},${newfilelinecount},${termcount},${datecurr}"
done < log.txt > /tmp/log.txt_2

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

Shell Programming and Scripting

Optimizing awk script

10 More Discussions You Might Find Interesting

1. Web Development

Optimizing JS and CSS

Discussion started by: Akshay Hegde

2. Shell Programming and Scripting

Optimizing the Shell Script [Expert Advise Needed]

Discussion started by: gold2k8

3. Shell Programming and Scripting

Optimizing script to reduce execution time

Discussion started by: SkySmart

4. Shell Programming and Scripting

Optimizing for loop with awk or anything similar and portable

Discussion started by: SkySmart

5. Shell Programming and Scripting

Optimizing bash script

Discussion started by: SkySmart

6. Shell Programming and Scripting

Optimizing the code

Discussion started by: nua7

7. Shell Programming and Scripting

Need help optimizing this piece of code (Shell script Busybox)

Discussion started by: snappy46

8. OS X (Apple)

Optimizing OSX

Discussion started by: deiphon

9. UNIX and Linux Applications

Optimizing query

Discussion started by: matrixmadhan

10. Shell Programming and Scripting

Optimizing for a Speed-up

Discussion started by: switch