Help 'speeding' up this 'parsing' script - taking 24+ hours to run Post: 303015332

Sponsored Content

Top Forums Shell Programming and Scripting Help 'speeding' up this 'parsing' script - taking 24+ hours to run Post 303015332 by bakunin on Tuesday 3rd of April 2018 04:26:37 AM

04-03-2018

Registered User

The following is probably not the answer you hoped for - i will tell you what you did do wrongly, why it was wrong and how you could do it better. You will still have to implement what i tell you yourself. Also, i will keep my explanation very short and introductory. You will need to research many of the pointers i will give you on your own to explore the full capabilities of the things i will explain to you.

If you want to show us the fruit of your efforts once you reimplemented the script and seek further advice - you will be welcome.

Quote:

Originally Posted by newbie_01

I've written a ksh script that read a file and parse/filter/format each line. The script runs as expected but it runs for 24+ hours for a file that has 2million lines. And sometimes, the input file has 10million lines which means it can be running for more than 2 days and still not finish.

This is a good start. Whenever you write code always take the time to estimate how long it will run, depending on the amount of input you expect. You don't need exact calculations, just a rough estimation for some expected orders of magnitude will suffice. There is a whole mathematical theory about this (see "Landau symbols" or "Big O notation"), but we won't need it. A glimpse of it will suffice.

Look at the following code:

Code:

while read LINE ; do
     program -abc "$LINE" >> firstresult
     program -def "$LINE" >> secondresult
done < /some/input

How long will this run? Well, obviously that depends on how long "program" will run, yes? But even without knowing that we can already say that for every line of input we will have to run "program" twice. Now we can examine the input and if it contains, say, 1 million lines, we know that "program" will be called 2 million times. If we estimate that "program" needs 1 millisecond for a single run the script will take 0.001s x 2 000 000 = 2 000s ^= ~35min. Add to that some overhead for reading the input file, writing the output files, loading "program" two million times into memory and starting it, etc. and we probably end at 1 hour runtime.

Especially for large inputs it makes sense to test the finished program (script) with a short input and measure the time it takes. For this there is the time command. For instance you can take your script, save it under the name of myscript and then execute it with a test input of, say, 1000 lines, like this:

Code:

time ./myscript <maybe necessary options/arguments here>

You will get an output like the following:

Code:

time ./myscript -some options

real    0m0,41s
user    0m0,03s
sys     0m0,08s

If you are interested you may want to explore performance tuning and measuring but for a start we are only concerned with the "real" line of the output. This is how long your program has run overall. Now, that you have an estimation how long it has taken to process thousand lines it is easy to extrapolate how long it takes to process a million or ten million.

The next thing i want to talk about is probably more of what you expected: how to make code faster. First, here is a part of your code which i have trimmed down a bit. Let us use our new tool to estimate the runtime:

Code:

for LOG in *search_string_found.out
#for LOG in *xyz
do
   server_db=`echo $LOG | awk -F"_" '{ print $1 }'`
   server_app=`echo $LOG | awk -F"_" '{ print $2 }'`
   echo "- [ `date` ] // `wc -l $LOG | awk '{ print $1 }'` lines ==> Processing $LOG // ${server_db} from ${server_app}"

#while IFS="*" read TS CS HOST RESULT SERVICE RETURNCODE
oIFS=$IFS
while read line
do
   IFS="*"
   echo "$line" | read TS CS HOST RESULT SERVICE RETURNCODE
   timestamp=`echo $TS | awk '{ print $2 }'`
   year=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $3 }'`
   day=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $1 }'`
   month=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $2 }'`
done
done

You immediated see why it pays off to indent properly: you don't see at a glance how many levels of nesting you have here. Therefore, let us first reindent your code:

Code:

for LOG in *search_string_found.out ; do
     server_db=`echo $LOG | awk -F"_" '{ print $1 }'`
     server_app=`echo $LOG | awk -F"_" '{ print $2 }'`
     echo "- [ `date` ] // `wc -l $LOG | awk '{ print $1 }'` lines ==> Processing $LOG // ${server_db} from ${server_app}"

     oIFS=$IFS
     while read line ; do
          IFS="*"
          echo "$line" | read TS CS HOST RESULT SERVICE RETURNCODE
          timestamp=`echo $TS | awk '{ print $2 }'`
          year=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $3 }'`
          day=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $1 }'`
          month=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $2 }'`
     done < $LOG
done

Now we see immediately that the inner while-loop is executed completely every time the outer for-loop does one pass. If we estimate the for-loop to find 10 files and each file has 100 lines the while-loop as a whole will be executed 10 times and every line within the while-loop will be exectuted 1000 times.

Most lines within the while-loop look like this:

Code:

variable=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print something }'`

What does the shell do to process this code? First, the shell creates an extra process, in which the echo program is started. Some output stream is generated running echo $TS. Next, the awk program is loaded and executed by starting a child process and running awk '{ print $1 }' inside it. To this process the output generated by the echo is fed as input. The awk program generates some output of its own and a third sub-process is created and started, into which another instance of the awk program is loaded. The output of the first awk program is now fed as input to the second awk program, which itself generates some output based on that input. This output is caught and put into the variable.

Sounds complicated? Yes - because it is! Calling an external program is one of the most "expensive" (in terms of needed system resources and time) system calls there are! Fast shell scripts differ mostly in this regard from slow ones: how well they avoid calling external programs.

That begs the question: if we don't filter the part we need from the rest of the output with awk, what should we use instead? Luckily, the inventors of the shell asked themselves this question and they invented: variable expansion (also called "parameter expansion").

I won't explain it completely here, but only a short introduction: suppose we have a variable holding a date, like this (notice that i imply the european date format: YYYY-MM-DD):

Code:

var="2018-03-31"

Now, we want to split that into a year, month and day part.

There is a device which will cut off a part of a variables content based on some pattern:

Code:

${variable#pattern}     # cut off from the front, shortest match
${variable%pattern}     # cut off from the rear, shortest match

${variable##pattern}    # cut off from the front, longest match
${variable%%pattern}    # cut off from the rear, longest match

In our case the pattern we look for is "-", because this separates the days, months and the year. You can also use wildcards, like "*" (any number of any characters) and "?" (any single character), just like in filenames, when you do a ls -l *.txt.

Now let us try (i absolutely suggest that you play around with this - create your own variable contents and try different patterns and what comes out):

Code:

$ mydate="2018-03-31"
$ echo "${mydate#*-}"
03-31
$ echo "${mydate##*-}"
31
$ echo "${mydate%-*}"
2018-03
$ echo "${mydate%%-*}"
2018

Notice, that the content of the variable is not changed at all - just the part which is displayed is changed! If you want to save the result you will need to assign another (or the same) variable with it:

Code:

$ mydate="2018-03-31"
$ myday="${mydate##*-}"
$ myyear="${mydate%%-*}"
$ echo "YEAR: $myyear   DAY: $myday"

Notice that i have left out the month here. we need a two-step approach to filter that out:

Code:

$ mydate="2018-03-31"
$ echo "${mydate#*-}"
03-31
$ mymonth="${mydate#*-}"
$ echo "${mymonth%-*}"
03
$ mymonth="${mymonth%-*}"

Now we have a complete solution:

Code:

$ mydate="2018-03-31"
$ myday="${mydate##*-}"
$ myyear="${mydate%%-*}"
$ mymonth="${mydate#-*}"
$ mymonth="${mymonth%*-}"
$ echo "YEAR: $myyear   MONTH: $mymonth DAY: $myday"

You probably may ask right now how much this is influencing the runtime. You are right to ask, but seeing is believing, as they say. Prepare a log file with 1000 lines and run these two scripts, each with the "time" command, i showed you above:

Code:

while read line ; do
     echo "$line" | read TS CS HOST RESULT SERVICE RETURNCODE
     timestamp=`echo $TS | awk '{ print $2 }'`
     year=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $3 }'`
     day=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $1 }'`
     month=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $2 }'`
     echo "SCRIPT1: YEAR: $year   MONTH: $month  DAY: $day"
done < /your/file

Code:

while read TS junk ; do
     year="${TS##*-}"
     day="${TS%%-*}"
     month="${TS#*-}"
     month="${month%%-*}"
     echo "SCRIPT2: YEAR: $year   MONTH: $month  DAY: $day"
done < /your/file

And see what comes out.

I have used another device above to further speed up things: the shell has the ability to split input into fields. This is usually done along delimiters of whitespace. Consider the following command:

Code:

command -abc file1 file2

Somehow we expect the shell to interpret file1 as the name of one file and file2 as the name of another. We do NOT expect the shell to confuse this for a file called -abc file1 or file1 file2 or so. This is because of this innate splitting ability and the fact that the strings file1 and file2 are surrounded by whitespace.

We can use this ability to our advantage when we read input too. You do it already when you do:

Code:

echo "$line" | read TS CS HOST RESULT SERVICE RETURNCODE

The content of the variable "line" is split along whitespace and the first part goes into a variable named TS, the second part to a variable named CS and so on. (On a passing note: "HOST" is a bad name for a variable because it is often a - fixed - value with the name of the system you are running on. Use something else.)

But instead of doing:

Code:

while read line ; do
     echo $line | read var1 var2 var3 ...
done

You can do immediately:

Code:

while read var1 var2 var3 ... ; do
     ....
done

This is what i have done above. Notice that you may still need the line as a whole and it might make sense to retain it like you did - i just didn't need it for this part, so i left it out. You should just be aware of what is possible.

There are some further rules for this splitting: if you have less variables than fields everything left over will be put into the last variable:

Code:

$ echo one two three four five | read var1 var2 var3
$ echo $var1
one
$ echo $var2
two
$ echo $var3
three four five

So, if you need only the, say, second part of a list of values:

Code:

while read junk VAR junk ; do
     echo $VAR
done < /your/input

If you have more variables than available fields the last variables will be simply empty.

Now, i suggest you first play around with what i told you and explore the possibilities. Only then try to reimplement your script in light of what i told you.

I hope this helps.

bakunin

These 4 Users Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

FTP taking ages to run.

Hi every one, We have HP UX server which normally loaded as avg load of 19-21. NOw when I try and do ftp to this server it takes ages to get the FTP prompt. I have seen this server loaded as max agv load of 35-40 tht time we never had such problems of FTP sessions. Now my new Unix admin...

2. Shell Programming and Scripting

How to make a script run for a maximum of "x" number of hours only

3. UNIX for Dummies Questions & Answers

Speeding up a Shell Script (find, grep and a for loop)

Hi all, I'm having some trouble with a shell script that I have put together to search our web pages for links to PDFs. The first thing I did was: ls -R | grep .pdf > /tmp/dave_pdfs.outWhich generates a list of all of the PDFs on the server. For the sake of arguement, say it looks like...

4. HP-UX

Crontab do not run on PM hours

Hi All I have a problem, I wonder if you can help me sort it out: I have the following entry in the cron: 00 1,13 * * * /home/report/opn_amt_gestores_credito.ksh > opn_amt_gestores_credito.log But the entry only runs at 01:07 I have stopped the cron deamon, and started, but it still...

5. Shell Programming and Scripting

Parsing log file for last 2 hours

I want to parse a log file which i am grepping root user connection but is showing whole day and previous day detail as well. First i want to see last 2 hours log file then after that i want to search particular string. Lets suppose right now its 5:00PM, So i want to see the log of 3:00PM to...

6. Shell Programming and Scripting

Help speeding up script

This is my first experience writing unix script. I've created the following script. It does what I want it to do, but I need it to be a lot faster. Is there any way to speed it up? cat 'Tax_Provision_Sample.dat' | sort | while read p; do fn=`echo $p|cut -d~ -f2,4,3,8,9`; echo $p >> "$fn.txt";...

7. UNIX for Advanced & Expert Users

Zip million files taking 12 hours or more

Hi I have task to zip files based on modified time but they are in millions and it is taking lot of time more than 12 hours and also eating up high cpu is there any other / better way to handle it quickly with less cpu consumptionfind . ! -name \"*.gz\" -mtime +7 -type f | grep -v '/.*/' |...

8. Shell Programming and Scripting

Speeding up shell script with grep

HI Guys hoping some one can help I have two files on both containing uk phone numbers master is a file which has been collated over a few years ad currently contains around 4 million numbers new is a file which also contains 4 million number i need to split new nto two separate files...

9. Shell Programming and Scripting

Run a command once in three hours

Hi All, I have a bash script which is scheduled to run for every 20 minutes. Inside the bash script, one command which I am using need to be triggered only once in two or three hours.Is there anyway to achieve this. For example, if then echo "hi" else echo "Hello" UNIX Command---once...

10. Shell Programming and Scripting

Help with speeding up my working script to take less time - how to use more CPU usage for a script

Hello experts, we have input files with 700K lines each (one generated for every hour). and we need to convert them as below and move them to another directory once. Sample INPUT:- # cat test1 1559205600000,8474,NormalizedPortInfo,PctDiscards,0.0,Interface,BG-CTA-AX1.test.com,Vl111...

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

FTP taking ages to run.

Discussion started by: nilesrex

2. Shell Programming and Scripting

How to make a script run for a maximum of "x" number of hours only

Discussion started by: ScriptDummy

3. UNIX for Dummies Questions & Answers

Speeding up a Shell Script (find, grep and a for loop)

Discussion started by: Dave Stockdale

4. HP-UX

Crontab do not run on PM hours

Discussion started by: fretagi

5. Shell Programming and Scripting

Parsing log file for last 2 hours

Discussion started by: learnbash

6. Shell Programming and Scripting

Help speeding up script

Discussion started by: JohnN6

7. UNIX for Advanced & Expert Users

Zip million files taking 12 hours or more

Discussion started by: reldb

8. Shell Programming and Scripting

Speeding up shell script with grep

Discussion started by: dunryc

9. Shell Programming and Scripting

Run a command once in three hours

Discussion started by: ginrkf

10. Shell Programming and Scripting

Help with speeding up my working script to take less time - how to use more CPU usage for a script

Discussion started by: prvnrk