Optimizing awk script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Optimizing awk script
# 1  
Old 08-04-2013
Optimizing awk script

Can this awk statement be optimized? i ask because log.txt is a giant file with several hundred thousands of lines of records.

Code:
myscript.sh:

while read line
do
        searchterm="${1}"
        datecurr=$(date  +%s)
        file=$(awk 'BEGIN{split(ARGV[1],var,",");print var[1]}' $line)
        llnum=$(awk 'BEGIN{split(ARGV[1],var,",");print var[2]}' $line)
        termcount=$(awk -v llnum=${llnum} 'NR>llnum' $file | egrep -c "${searchterm}")
        newfilelinecount=$(wc -l $file)
        echo "${file},${newfilelinecount},${termcount},${datecurr}" >> /tmp/log.txt_2
done < log.txt

Code:
log.txt:

/tmp/text1.txt,343,193,833
/tmp/text2.txt,43,93,533

The first column in "log.txt" contains the file name.
The second column in "log.txt" contains the last known total number of lines for each file.

myscript.sh reads in the file "log.txt" and for each file it finds, it gets the line number from the second column. begins scanning the file from that line number and gets the number of times it finds the search term provided by the user.

can this be optimized?

OS:
Linux (redhat, centos, ubuntu)/SunOS

Last edited by SkySmart; 08-04-2013 at 12:07 PM..
# 2  
Old 08-04-2013
The worst approach (performance wise ) is
Code:
while read rec
do
var=$(awk command here)
var2=$(awk command here)
var3=$(sed command here)
done < somefile

This creates three child processes for each line in the input file, 300K lines means 1 million process creations.

awk, perl, and ruby can do almost anything with a single process creation because they have tools built in as part of the language.

Consider simply using a larger awk program.

Or..

So, if your awk commands could be accomplished with parameter substitution in bash or ksh, you would speed things up enormously.

Example:
Code:
file=$(awk 'BEGIN{split(ARGV[1],var,",");print var[1]}' $line)

is doing nothing more than getting a field from data like this:
Code:
fap,boo bar

where it appears that you want store "fap" in a variable.

Since you seem to have done this several times in code before, I'm hoping to get you past the problem.

Can you think of a way to get "fap" using one of bash's ${ } constructs, or maybe set
Code:
IFS=,

and use one of the bash commands to get "fap"?
This User Gave Thanks to jim mcnamara For This Post:
# 3  
Old 08-04-2013
Code:
file=$(awk 'BEGIN{split(ARGV[1],var,",");print var[1]}' $line)

you are right jim. my original code is very far from being efficient. the command above can probably replace it with ease.

i cant seem to think in awk, even though i prefer that over any other languages. i'm more comfortable with bash but i really want awk to do this.

in the code i quoted, i'm interested in the second field, which is the line number. and i need to be able to store the values of the count of the search terms, the new total line number of the each file. and then send that information to another file.

thinking in bash, any one who knows anything about scripting can figure out exactly what im doing here very easily.

Code:
while read line
do
        searchterm="${1}"
        datecurr=$(date  +%s)
        file=$(awk 'BEGIN{split(ARGV[1],var,",");print var[1]}' $line)
        llnum=$(awk 'BEGIN{split(ARGV[1],var,",");print var[2]}' $line)
        termcount=$(awk -v llnum=${llnum} 'NR>llnum' $file | egrep -c "${searchterm}")
        newfilelinecount=$(wc -l $file)
        echo "${file},${newfilelinecount},${termcount},${datecurr}" >> /tmp/log.txt_2
done < log.txt

to translate this to an efficient awk program is where i need serious help.
# 4  
Old 08-04-2013
Step 1 is to define "static" variables outside the loop.
Step 2 is to use the full potential of the read command.
Step 3 is to get both termcount and newfilelinecount in one stroke.
Step 4 is to have the sole output at the end of the loop
Code:
searchterm="${1}"
datecurr=$(date +%s)
while IFS="," read file llnum rest
do
  eval $(awk -v llnum=${llnum} -v search="$searchterm" '
NR>llnum && $0~search {++tcnt}
END {print "termcount=" tcnt+0, "newfilelinecount=" NR}
' $file)
  echo "${file},${newfilelinecount},${termcount},${datecurr}"
done < log.txt > /tmp/log.txt_2

This User Gave Thanks to MadeInGermany For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Web Development

Optimizing JS and CSS

Yes. Got few suggestions. - How about minifying resources - mod_expires - Service workers setup https://www.unix.com/attachments/web-programming/7709d1550557731-sneak-preview-new-unix-com-usercp-vuejs-demo-screenshot-png (8 Replies)
Discussion started by: Akshay Hegde
8 Replies

2. Shell Programming and Scripting

Optimizing the Shell Script [Expert Advise Needed]

I have prepared a shell script to find the duplicates based on the part of filename and retain latest. #!/bin/bash if ; then mkdir -p dup fi NOW=$(date +"%F-%H:%M:%S") LOGFILE="purge_duplicate_log-$NOW.log" LOGTIME=`date "+%Y-%m-%d %H:%M:%S"` echo... (6 Replies)
Discussion started by: gold2k8
6 Replies

3. Shell Programming and Scripting

Optimizing script to reduce execution time

AFILENAME=glow.sh FILENAME="/${AFILENAME}" WIDTHA=$(echo ${FILENAME} | wc -c) NTIME=0 RESULTS=$(for eachletter in $(echo ${FILENAME} | fold -w 1) do WIDTHTIMES=$(awk "BEGIN{printf... (5 Replies)
Discussion started by: SkySmart
5 Replies

4. Shell Programming and Scripting

Optimizing for loop with awk or anything similar and portable

The variable COUNTPRO contains: COUNTPRO='Error__posting__message__to__EMR__Queue=0 Error__parsing__ReceiptSummary=0 xinetd__=4327 HTTP__1_1__500___=0 START__=2164 Marshaller__exception__while__converting__to__Receipt__xml=0 MessagePublisher__is__not__configured__correctly=0... (9 Replies)
Discussion started by: SkySmart
9 Replies

5. Shell Programming and Scripting

Optimizing bash script

any way the following code can be optimized? FIRSTIN=$( HKIPP=$(echo ${TMFR} | egrep -v "mo|MO|Mo" | egrep "m |M ") HRAMH=$(echo ${TMFR} | egrep "h|H") HRAMD=$(echo ${TMFR} | egrep "d|D") HRAMW=$(echo ${TMFR} | egrep "w|W") HKIPPO=$(echo ${TMFR} |... (5 Replies)
Discussion started by: SkySmart
5 Replies

6. Shell Programming and Scripting

Optimizing the code

Hi, I have two files in the format listed below. I need to find out all values from field 12 to field 20 present in file 2 and list them in file3(format as file2) File1 : FEIN,CHRISTA... (2 Replies)
Discussion started by: nua7
2 Replies

7. Shell Programming and Scripting

Need help optimizing this piece of code (Shell script Busybox)

I am looking for suggestions on how I could possibly optimized that piece of code where most of the time is spend on this script. In a nutshell this is a script that creates an xml file(s) based on certain criteria that will be used by a movie jukebox. Example of data: $SORTEDTMP= it is a... (16 Replies)
Discussion started by: snappy46
16 Replies

8. OS X (Apple)

Optimizing OSX

Hi forum, I'm administrating a workstation/server for my lab and I was wondering how to optimize OSX. I was wondering what unnecessary background tasks I could kick off the system so I free up as much memory and cpu power. Other optimization tips are also welcome (HD parameters, memory... (2 Replies)
Discussion started by: deiphon
2 Replies

9. UNIX and Linux Applications

Optimizing query

Hi All, My first thread to this sub-forum and first thread of this sub-forum :) Here it is, Am trying to delete duplicates from a table retaining just 1 duplicate value out of the duplicate records for example : from n records of a table out of which x are duplicates, I want to remove x... (15 Replies)
Discussion started by: matrixmadhan
15 Replies

10. Shell Programming and Scripting

Optimizing for a Speed-up

How would one go about optimizing this current .sh program so it works at a more minimal time. Such as is there a better way to count what I need than what I have done or better way to match patterns in the file? Thanks, #declare variables to be used. help=-1 count=0 JanCount=0 FebCount=0... (3 Replies)
Discussion started by: switch
3 Replies
Login or Register to Ask a Question