Help 'speeding' up this 'parsing' script

03-28-2018

Registered User

295, 6

Join Date: May 2009

Last Activity: 7 May 2020, 5:18 PM EDT

Posts: 295

Thanks Given: 62

Thanked 6 Times in 6 Posts

Help 'speeding' up this 'parsing' script - taking 24+ hours to run

Hi,

I've written a ksh script that read a file and parse/filter/format each line. The script runs as expected but it runs for 24+ hours for a file that has 2million lines. And sometimes, the input file has 10million lines which means it can be running for more than 2 days and still not finish. And of course, SA's been chasing me up as it is showing in top as running like forever.

I need some advise on maybe instead of reading one line at a time, I can run an awk one liner instead. I wish I can code it in Perl but not sure how to. Most says it is faster in Perl but not sure how to use Perl-like equivalence of the UNIX command besides using system

Anyway, hopefully I can interest someone into looking into this.

Below is the excerpt / part of the script that is taking the most time:

Code:

for LOG in *search_string_found.out
#for LOG in *xyz
do
   server_db=`echo $LOG | awk -F"_" '{ print $1 }'`
   server_app=`echo $LOG | awk -F"_" '{ print $2 }'`
   echo "- [ `date` ] // `wc -l $LOG | awk '{ print $1 }'` lines ==> Processing $LOG // ${server_db} from ${server_app}"

#while IFS="*" read TS CS HOST RESULT SERVICE RETURNCODE
oIFS=$IFS
while read line
do
   IFS="*"
   echo "$line" | read TS CS HOST RESULT SERVICE RETURNCODE
   timestamp=`echo $TS | awk '{ print $2 }'`
   year=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $3 }'`
   day=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $1 }'`
   month=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $2 }'`

   case $month in
      "JAN" ) mm="01" ;;
      "FEB" ) mm="02" ;;
      "MAR" ) mm="03" ;;
      "APR" ) mm="04" ;;
      "MAY" ) mm="05" ;;
      "JUN" ) mm="06" ;;
      "JUL" ) mm="07" ;;
      "AUG" ) mm="08" ;;
      "SEP" ) mm="09" ;;
      "OCT" ) mm="10" ;;
      "NOV" ) mm="11" ;;
      "DEC" ) mm="12" ;;
   esac
   TS2="$year-$mm-$day $timestamp"

   program=`echo $CS | awk -F"(" '{ print $4 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`
   user=`echo $CS | awk -F"(" '{ print $6 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`
   service_name=`echo $CS | awk -F"(" '{ print $8 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`

   app_protocol=`echo $HOST | awk -F"(" '{ print $3 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`
   app_host=`echo $HOST | awk -F"(" '{ print $4 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`
   app_port=`echo $HOST | awk -F"(" '{ print $5 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`

   #echo "- line = $line"
   #echo "- timestamp = $TS"
   #echo "  TS2 = $TS2"
   #echo "- connectstring = $CS"
   #echo "  program = $program"
   #echo "  user = $user"
   #echo "  service_name = $service_name"
   #echo "- host = $HOST"
   #echo "  app_protocol = $app_protocol"
   #echo "  app_host = $app_host"
   #echo "  app_port = $app_port"
   #echo "- result = $RESULT"
   #echo "- service = $SERVICE"
   #echo "- returncode = $RETURNCODE"
   #echo "-------------------------------------------------------------"
   #echo

   RETURNCODE=`echo $RETURNCODE | sed "s/ *//g"`
   detail="$TS2^${server_db}^${server_app} = ${app_host}^$program^$user^${service_name}^$RETURNCODE^$line"
   #echo "${detail}" | tee -a ${f_report}
   echo "${detail}" >> ${f_report}
   IFS=$oIFS
done <  $LOG

Below are example entries of the input file that the script reads, it can be 2million lines at least and go to as much as 10million lines. I've change entries as they are customer data.

Code:

04-MAR-2018 03:19:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60791)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:19:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60795)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:21:07 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=mickeyp0))(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.7)(PORT=14582)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:22:25 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=mickeyp0))(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.7)(PORT=15176)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:24:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60881)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:24:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60885)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:29:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60965)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:29:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60969)) * establish * test_app.abcde.xx.yy * 12514
04-MAR-2018 03:29:02 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=xyzimain)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60973)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:57:10 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=mickeyp0))(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.7)(PORT=24394)) * establish * test_app.abcde.xx.yy * 12514

What I am wanting to do really in simplest term is as below:

Change the date format to YYYY-MM-DD. Main reason being is it is most convenient sorting in this date format
Filter some information from each line, i.e host name, IP, program name, service name, return code etc.

I then re-direct these formatted line/record to a file that I can check group by return code value or simply do a sort | uniq -c so it displays and show a count of occurrence.

Any advice much appreciated. Thanks in advance.

newbie_01

View Public Profile for newbie_01

Find all posts by newbie_01

03-28-2018

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Running awk once and only once would be so much faster than running awk 180,000,000 times, it'd be done in under a minute, maybe even single digit seconds.

Perl is not faster. If you wrote this code the same way in Perl it'd be just as slow or slower.

Unfortunately, the program you've given doesn't seem to work, so I can't tell what output you want. Could you post the output you want?

Last edited by Corona688; 03-28-2018 at 04:46 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

03-28-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You have shown us an input file and you have shown us a script that invokes awk and sed at least 30 times for every line read from your file. It is no wonder that running this script is burning up CPU cycles to the detriment of anyone else trying to use the same system you're using.

Please describe in English exactly what output you're trying to produce and show us the exact output you hope to produce from your sample input. Saying that you want to filter the host name for each line doesn't really describe what you're trying to do especially since many of your sample input lines contain more than one (HOST=value) string.

Please also tell us what operating system you're using. (Different operating systems have different utilities and different options available for some utilities.)

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

04-03-2018

Registered User

295, 6

Join Date: May 2009

Last Activity: 7 May 2020, 5:18 PM EDT

Posts: 295

Thanks Given: 62

Thanked 6 Times in 6 Posts

Hi,

Sorry Corona688 and Don Cragun, I should have thought about how very so difficult and unfair of me not to post in an example output.

You are right that it is indeed a lot, lot, lot faster if it reads the whole file at once instead of line by one I kick off the script to run on a 10million lines over the weekend, I didn't get an easter miracle of any sort, it is still running at this time.

You can ignore or ideally forget the so horrible codes that I posted

. Maybe I can explain what I've been trying to do as below.

So, here is an example raw input file, un-filtered

Code:

24-MAR-2018 07:59:52 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=42145)) * establish * testapp1_app.somewhere.out.ph * 0
24-MAR-2018 07:59:52 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=42149)) * establish * testapp1_app.somewhere.out.ph * 12514
24-MAR-2018 07:59:52 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=42153)) * establish * testapp1_app.somewhere.out.ph * 0
24-MAR-2018 07:59:52 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=42157)) * establish * testapp1_app.somewhere.out.ph * 12514
24-MAR-2018 07:59:52 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=42161)) * establish * testapp1_app.somewhere.out.ph * 12514
12-MAR-2018 10:04:38 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)(CID=(PROGRAM=sqlplus)(HOST=xxx00001.somewhere.out.ph)(USER=ogre01))) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.101)(PORT=12358)) * establish * testapp1_app.somewhere.out.ph * 12514
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11662)) * establish * testapp1_app.somewhere.out.ph * 12514
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11666)) * establish * testapp1_app.somewhere.out.ph * 12514
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11672)) * establish * testapp1_app.somewhere.out.ph * 12514
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11674)) * establish * testapp1_app.somewhere.out.ph * 0
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11680)) * establish * testapp1_app.somewhere.out.ph * 12514
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11682)) * establish * testapp1_app.somewhere.out.ph * 12514
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11686)) * establish * testapp1_app.somewhere.out.ph * 0
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11690)) * establish * testapp1_app.somewhere.out.ph * 12520
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11696)) * establish * testapp1_app.somewhere.out.ph * 12514

There can be million of these lines and at the moment, the script reads one line at a time and generate a formatted output like below.

Code:

2018-03-12 10:04:38  runserver01        = 66.65.60.101                testapp1_app.somewhere.out.ph       sqlplus         ogre01                    12514
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 0
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 0
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12520
2018-03-24 07:59:52  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 0
2018-03-24 07:59:52  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 0
2018-03-24 07:59:52  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-24 07:59:52  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-24 07:59:52  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514

I then use sort | uniq -c to do some sort of a count and comes up with below:

Code:

      1 2018-03-12 10:04:38  runserver01        = 66.65.60.101                testapp1_app.somewhere.out.ph       sqlplus         ogre01                    12514
      2 2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 0
      6 2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
      1 2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12520
      2 2018-03-24 07:59:52  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 0
      3 2018-03-24 07:59:52  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514

All fields of the output file are from the input file with the exception of the second field that is showing up as runserver01. This is from running hostname. It doesn't have to be on the second field. it can be anywhere or can come in later on after all the filtering, it is just basically a way for me to figure out where I run the script from.

Most of the lines are of the following format:

Code:

12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11662)) * establish * testapp1_app.somewhere.out.ph * 12514

Sometimes, it can be like below:

Code:

12-MAR-2018 10:04:38 *  (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)(CID=(PROGRAM=sqlplus)(HOST=xxx00001.somewhere.out.ph)(USER=ogre01)))  * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.101)(PORT=12358)) * establish *  testapp1_app.somewhere.out.ph * 12514

I don't know how to make awk differentiate between the two formats and filter/get the right information. Note that the information are in different order for these two strings.

And yes, running the whole file thru awk is faster instead of having to read one line at a time but I don't know how to get awk to do what I wanted so it comes up with the output format that I wanted.

I am looking at maybe do one run of awk changing the date format first and then the next awk is to filter out the CONNECT_DATA string into different parts.

But I can't figure out what to do, so for the first pass, I need to change

Code:

24-MAR-2018 07:59:52 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=42145)) * establish * testapp1_app.somewhere.out.ph * 0
12-MAR-2018 10:04:38 *  (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)(CID=(PROGRAM=sqlplus)(HOST=xxx00001.somewhere.out.ph)(USER=ogre01)))  * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.101)(PORT=12358)) * establish *  testapp1_app.somewhere.out.ph * 12514

Code:

2018-03-12 10:04:38 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)(CID=(PROGRAM=sqlplus)(HOST=xxx00001.somewhere.out.ph)(USER=ogre01))) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.101)(PORT=12358)) * establish * testapp1_app.somewhere.out.ph * 12514
2018-03-24 07:59:52 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin  Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph))  * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=42145)) * establish *  testapp1_app.somewhere.out.ph * 0

How do I tell awk -F"*" to print $1 and the rest of the field with $1 to be further change to a YYYY-MM-DD format. The real reason behind formatting it to YYYY-MM-DD is because that works best for when doing the sort.

And then the next pass is supposed to filter it to be like

Code:

2018-03-12 10:04:38  runserver01        = 66.65.60.101                testapp1_app.somewhere.out.ph       sqlplus         ogre01                    12514
2018-03-24 07:59:52  runserver01        = 66.65.60.7                   JDBC Thin Client                    ogre01           testapp1_app.somewhere.out.ph 0

Or ideally be like

Code:

2018-03-12 16:23:09  runserver01        = 66.65.60.101                sqlplus                             ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-24 07:59:52  runserver01        = 66.65.60.7                    JDBC Thin Client                    ogre01            testapp1_app.somewhere.out.ph 0

Please advise on how best to do what I am wanting to do. Apologies for not giving enough information earlier.

P.S:
That ksh script that I run processing a file that has 9890943 lines, it is still running, ps -o etime= -p 3036 says it has been running for 5-14:38:03, time to CTRL-C it

newbie_01

View Public Profile for newbie_01

Find all posts by newbie_01

04-03-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Not sure why the service name comes in field $4 sometimes, shoving other fields right, and in field $6 other times...
How far do you get with

Code:

awk -F\* '
BEGIN   {for (n=split("JAN*FEB*MAR*APR*MAY*JUN*JUL*AUG*SEP*OCT*NOV*DEC", T); n; n--) MTH[T[n]] = n
         "hostname" | getline HN
        }

function GETSTR(SRC, STR)       {match (SRC, STR "[^)]*")
                                 LN = length(STR) - gsub (/\(/, "&", STR)
                                 return substr (SRC, RSTART+LN, RLENGTH-LN)
                                }

        {gsub (/ *\* */, "*")
         split ($1, T, "[- ]")
         if (T[2] in MTH) $1 = sprintf ("%s-%02d-%s %s", T[3], MTH[T[2]], T[1], T[4])
         PG = GETSTR($2, "CID=\(PROGRAM=")
         US = GETSTR($2, "USER=")
         SN = GETSTR($2, "SERVICE_NAME=")
         IP = GETSTR($3, "HOST=")
         print $1, HN, "= " IP, PG, US, SN, $NF
        }
' OFS="\t" file
2018-03-24 07:59:52    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-24 07:59:52    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-12 10:04:38    RudisPC    = 66.65.60.101    sqlplus    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-12 16:23:09    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-12 16:23:09    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-12 16:23:09    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
.
.
.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

04-03-2018

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

The following is probably not the answer you hoped for - i will tell you what you did do wrongly, why it was wrong and how you could do it better. You will still have to implement what i tell you yourself. Also, i will keep my explanation very short and introductory. You will need to research many of the pointers i will give you on your own to explore the full capabilities of the things i will explain to you.

If you want to show us the fruit of your efforts once you reimplemented the script and seek further advice - you will be welcome.

Quote:

Originally Posted by newbie_01

I've written a ksh script that read a file and parse/filter/format each line. The script runs as expected but it runs for 24+ hours for a file that has 2million lines. And sometimes, the input file has 10million lines which means it can be running for more than 2 days and still not finish.

This is a good start. Whenever you write code always take the time to estimate how long it will run, depending on the amount of input you expect. You don't need exact calculations, just a rough estimation for some expected orders of magnitude will suffice. There is a whole mathematical theory about this (see "Landau symbols" or "Big O notation"), but we won't need it. A glimpse of it will suffice.

Look at the following code:

Code:

while read LINE ; do
     program -abc "$LINE" >> firstresult
     program -def "$LINE" >> secondresult
done < /some/input

How long will this run? Well, obviously that depends on how long "program" will run, yes? But even without knowing that we can already say that for every line of input we will have to run "program" twice. Now we can examine the input and if it contains, say, 1 million lines, we know that "program" will be called 2 million times. If we estimate that "program" needs 1 millisecond for a single run the script will take 0.001s x 2 000 000 = 2 000s ^= ~35min. Add to that some overhead for reading the input file, writing the output files, loading "program" two million times into memory and starting it, etc. and we probably end at 1 hour runtime.

Especially for large inputs it makes sense to test the finished program (script) with a short input and measure the time it takes. For this there is the time command. For instance you can take your script, save it under the name of myscript and then execute it with a test input of, say, 1000 lines, like this:

Code:

time ./myscript <maybe necessary options/arguments here>

You will get an output like the following:

Code:

time ./myscript -some options

real    0m0,41s
user    0m0,03s
sys     0m0,08s

If you are interested you may want to explore performance tuning and measuring but for a start we are only concerned with the "real" line of the output. This is how long your program has run overall. Now, that you have an estimation how long it has taken to process thousand lines it is easy to extrapolate how long it takes to process a million or ten million.

The next thing i want to talk about is probably more of what you expected: how to make code faster. First, here is a part of your code which i have trimmed down a bit. Let us use our new tool to estimate the runtime:

Code:

for LOG in *search_string_found.out
#for LOG in *xyz
do
   server_db=`echo $LOG | awk -F"_" '{ print $1 }'`
   server_app=`echo $LOG | awk -F"_" '{ print $2 }'`
   echo "- [ `date` ] // `wc -l $LOG | awk '{ print $1 }'` lines ==> Processing $LOG // ${server_db} from ${server_app}"

#while IFS="*" read TS CS HOST RESULT SERVICE RETURNCODE
oIFS=$IFS
while read line
do
   IFS="*"
   echo "$line" | read TS CS HOST RESULT SERVICE RETURNCODE
   timestamp=`echo $TS | awk '{ print $2 }'`
   year=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $3 }'`
   day=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $1 }'`
   month=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $2 }'`
done
done

You immediated see why it pays off to indent properly: you don't see at a glance how many levels of nesting you have here. Therefore, let us first reindent your code:

Code:

for LOG in *search_string_found.out ; do
     server_db=`echo $LOG | awk -F"_" '{ print $1 }'`
     server_app=`echo $LOG | awk -F"_" '{ print $2 }'`
     echo "- [ `date` ] // `wc -l $LOG | awk '{ print $1 }'` lines ==> Processing $LOG // ${server_db} from ${server_app}"

     oIFS=$IFS
     while read line ; do
          IFS="*"
          echo "$line" | read TS CS HOST RESULT SERVICE RETURNCODE
          timestamp=`echo $TS | awk '{ print $2 }'`
          year=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $3 }'`
          day=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $1 }'`
          month=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $2 }'`
     done < $LOG
done

Now we see immediately that the inner while-loop is executed completely every time the outer for-loop does one pass. If we estimate the for-loop to find 10 files and each file has 100 lines the while-loop as a whole will be executed 10 times and every line within the while-loop will be exectuted 1000 times.

Most lines within the while-loop look like this:

Code:

variable=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print something }'`

What does the shell do to process this code? First, the shell creates an extra process, in which the echo program is started. Some output stream is generated running echo $TS. Next, the awk program is loaded and executed by starting a child process and running awk '{ print $1 }' inside it. To this process the output generated by the echo is fed as input. The awk program generates some output of its own and a third sub-process is created and started, into which another instance of the awk program is loaded. The output of the first awk program is now fed as input to the second awk program, which itself generates some output based on that input. This output is caught and put into the variable.

Sounds complicated? Yes - because it is! Calling an external program is one of the most "expensive" (in terms of needed system resources and time) system calls there are! Fast shell scripts differ mostly in this regard from slow ones: how well they avoid calling external programs.

That begs the question: if we don't filter the part we need from the rest of the output with awk, what should we use instead? Luckily, the inventors of the shell asked themselves this question and they invented: variable expansion (also called "parameter expansion").

I won't explain it completely here, but only a short introduction: suppose we have a variable holding a date, like this (notice that i imply the european date format: YYYY-MM-DD):

Code:

var="2018-03-31"

Now, we want to split that into a year, month and day part.

There is a device which will cut off a part of a variables content based on some pattern:

Code:

${variable#pattern}     # cut off from the front, shortest match
${variable%pattern}     # cut off from the rear, shortest match

${variable##pattern}    # cut off from the front, longest match
${variable%%pattern}    # cut off from the rear, longest match

In our case the pattern we look for is "-", because this separates the days, months and the year. You can also use wildcards, like "*" (any number of any characters) and "?" (any single character), just like in filenames, when you do a ls -l *.txt.

Now let us try (i absolutely suggest that you play around with this - create your own variable contents and try different patterns and what comes out):

Code:

$ mydate="2018-03-31"
$ echo "${mydate#*-}"
03-31
$ echo "${mydate##*-}"
31
$ echo "${mydate%-*}"
2018-03
$ echo "${mydate%%-*}"
2018

Notice, that the content of the variable is not changed at all - just the part which is displayed is changed! If you want to save the result you will need to assign another (or the same) variable with it:

Code:

$ mydate="2018-03-31"
$ myday="${mydate##*-}"
$ myyear="${mydate%%-*}"
$ echo "YEAR: $myyear   DAY: $myday"

Notice that i have left out the month here. we need a two-step approach to filter that out:

Code:

$ mydate="2018-03-31"
$ echo "${mydate#*-}"
03-31
$ mymonth="${mydate#*-}"
$ echo "${mymonth%-*}"
03
$ mymonth="${mymonth%-*}"

Now we have a complete solution:

Code:

$ mydate="2018-03-31"
$ myday="${mydate##*-}"
$ myyear="${mydate%%-*}"
$ mymonth="${mydate#-*}"
$ mymonth="${mymonth%*-}"
$ echo "YEAR: $myyear   MONTH: $mymonth DAY: $myday"

You probably may ask right now how much this is influencing the runtime. You are right to ask, but seeing is believing, as they say. Prepare a log file with 1000 lines and run these two scripts, each with the "time" command, i showed you above:

Code:

while read line ; do
     echo "$line" | read TS CS HOST RESULT SERVICE RETURNCODE
     timestamp=`echo $TS | awk '{ print $2 }'`
     year=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $3 }'`
     day=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $1 }'`
     month=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $2 }'`
     echo "SCRIPT1: YEAR: $year   MONTH: $month  DAY: $day"
done < /your/file

Code:

while read TS junk ; do
     year="${TS##*-}"
     day="${TS%%-*}"
     month="${TS#*-}"
     month="${month%%-*}"
     echo "SCRIPT2: YEAR: $year   MONTH: $month  DAY: $day"
done < /your/file

And see what comes out.

I have used another device above to further speed up things: the shell has the ability to split input into fields. This is usually done along delimiters of whitespace. Consider the following command:

Code:

command -abc file1 file2

Somehow we expect the shell to interpret file1 as the name of one file and file2 as the name of another. We do NOT expect the shell to confuse this for a file called -abc file1 or file1 file2 or so. This is because of this innate splitting ability and the fact that the strings file1 and file2 are surrounded by whitespace.

We can use this ability to our advantage when we read input too. You do it already when you do:

Code:

echo "$line" | read TS CS HOST RESULT SERVICE RETURNCODE

The content of the variable "line" is split along whitespace and the first part goes into a variable named TS, the second part to a variable named CS and so on. (On a passing note: "HOST" is a bad name for a variable because it is often a - fixed - value with the name of the system you are running on. Use something else.)

But instead of doing:

Code:

while read line ; do
     echo $line | read var1 var2 var3 ...
done

You can do immediately:

Code:

while read var1 var2 var3 ... ; do
     ....
done

This is what i have done above. Notice that you may still need the line as a whole and it might make sense to retain it like you did - i just didn't need it for this part, so i left it out. You should just be aware of what is possible.

There are some further rules for this splitting: if you have less variables than fields everything left over will be put into the last variable:

Code:

$ echo one two three four five | read var1 var2 var3
$ echo $var1
one
$ echo $var2
two
$ echo $var3
three four five

So, if you need only the, say, second part of a list of values:

Code:

while read junk VAR junk ; do
     echo $VAR
done < /your/input

If you have more variables than available fields the last variables will be simply empty.

Now, i suggest you first play around with what i told you and explore the possibilities. Only then try to reimplement your script in light of what i told you.

I hope this helps.

bakunin

These 4 Users Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

05-05-2018

Registered User

295, 6

Join Date: May 2009

Last Activity: 7 May 2020, 5:18 PM EDT

Posts: 295

Thanks Given: 62

Thanked 6 Times in 6 Posts

Quote:

Originally Posted by RudiC

Not sure why the service name comes in field $4 sometimes, shoving other fields right, and in field $6 other times...
How far do you get with

Code:

awk -F\* '
BEGIN   {for (n=split("JAN*FEB*MAR*APR*MAY*JUN*JUL*AUG*SEP*OCT*NOV*DEC", T); n; n--) MTH[T[n]] = n
         "hostname" | getline HN
        }

function GETSTR(SRC, STR)       {match (SRC, STR "[^)]*")
                                 LN = length(STR) - gsub (/\(/, "&", STR)
                                 return substr (SRC, RSTART+LN, RLENGTH-LN)
                                }

        {gsub (/ *\* */, "*")
         split ($1, T, "[- ]")
         if (T[2] in MTH) $1 = sprintf ("%s-%02d-%s %s", T[3], MTH[T[2]], T[1], T[4])
         PG = GETSTR($2, "CID=\(PROGRAM=")
         US = GETSTR($2, "USER=")
         SN = GETSTR($2, "SERVICE_NAME=")
         IP = GETSTR($3, "HOST=")
         print $1, HN, "= " IP, PG, US, SN, $NF
        }
' OFS="\t" file
2018-03-24 07:59:52    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-24 07:59:52    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-12 10:04:38    RudisPC    = 66.65.60.101    sqlplus    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-12 16:23:09    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-12 16:23:09    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-12 16:23:09    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
.
.
.

Yeah, I hate that fact too, that the service name divert from field to field. Looking at the lines, it has to do with the request being a JDBC connection or otherwise. I'll give the awk bit to work. Thanks a lot.

---------- Post updated at 01:53 PM ---------- Previous update was at 01:40 PM ----------

Sorry, I've been sick for awhile. Thanks a lot for all your advise. I will give all of the suggestion with a cut down version of the file. I will have a real long read and understand how to implement your suggestion. Wish me luck. Thanks again everyone.

newbie_01

View Public Profile for newbie_01

Find all posts by newbie_01

Shell Programming and Scripting

Help 'speeding' up this 'parsing' script - taking 24+ hours to run

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help with speeding up my working script to take less time - how to use more CPU usage for a script

Discussion started by: prvnrk

2. Shell Programming and Scripting

Run a command once in three hours

Discussion started by: ginrkf

3. Shell Programming and Scripting

Speeding up shell script with grep

Discussion started by: dunryc

4. UNIX for Advanced & Expert Users

Zip million files taking 12 hours or more

Discussion started by: reldb

5. Shell Programming and Scripting

Help speeding up script

Discussion started by: JohnN6

6. Shell Programming and Scripting

Parsing log file for last 2 hours

Discussion started by: learnbash

7. HP-UX

Crontab do not run on PM hours

Discussion started by: fretagi

8. UNIX for Dummies Questions & Answers

Speeding up a Shell Script (find, grep and a for loop)

Discussion started by: Dave Stockdale

9. Shell Programming and Scripting

How to make a script run for a maximum of "x" number of hours only

Discussion started by: ScriptDummy

10. UNIX for Advanced & Expert Users

FTP taking ages to run.

Discussion started by: nilesrex