Script Optimization - large delimited file, for loop with many greps

04-21-2011

Registered User

30, 0

Join Date: Apr 2011

Last Activity: 6 July 2011, 7:09 PM EDT

Posts: 30

Thanks Given: 12

Thanked 0 Times in 0 Posts

Script Optimization - large delimited file, for loop with many greps

Since there are approximately 75K gsfiles and hundreds of stfiles per gsfile, this script can take hours. How can I rewrite this script, so that it's much faster? I'm not as familiar with perl but I'm open to all suggestions.

Code:

ls file.list>$split
for gsfile in `cat $split`;
do
  csplit -ks -n 6 -f $gsfile.ST $gsfile /^ST/ {100000} 2>>$diagnostic
  for stfile in `ls $gsfile.ST*|sort -n`; 
  do 
    delim=`LC_ALL=C grep "^GS" $gsfile|cut -c3` 2>>$diagnostic
  gscode=`LC_ALL=C grep "^GS" $gsfile|cut -d "$delim" -f3` 2>>$diagnostic
  supcd=`LC_ALL=C grep "^N1.SU" $stfile|cut -d "$delim" -f5|head -1` 2>>$diagnostic
    sellcd=`LC_ALL=C grep "^N1.SE" $stfile|cut -d "$delim" -f5|head -1` 2>>$diagnostic
    firponum=`LC_ALL=C grep "^IT1" $stfile|cut -d "$delim" -f10|head -1` 2>>$diagnostic
    invtl=`LC_ALL=C grep "^TDS" $stfile|cut -d "$delim" -f2|tr -cd '[[:digit:]]'` 2>>$diagnostic 
    #I have about ten more greps here
    echo "$gscode,$supcd,$sellcd,$firponum,$invtl">>$detail_file
    rm -f $stfile 2>>$diagnostic                                                                                          
  done 
done

Here's an example of an input file. The delimiters can be any non-word character.

Code:

 
gsfile_1
GS*IN*TPU*TPM*110303*0634*65433*X*002000 
ST*810*0001  
N1*SU*TPUNAME*92*TPUCD21 
N1*SE*SELNAME*92*789 
IT1*1*8*EA*909234.12**BP*PARTNUM123*PO*PONUM342342*PL*526 
IT1*2*3*EA*53342.65**BP*PARTNUM456*PO*PONUM31131*PL*528 
TDS*32424214  
SE*7*0001
ST*810*0002  
N1*SU*TPUNAME*92*TPUCD43 
N1*SE*SELNAME*92*543 
DTM*011*110302 
IT1*1*10*EA*909234.12**BP*PARTNUM575*PO*PONUM1253123*PL*001  
IT1*2*15*EA*53342.65**BP*PARTNUM483*PO*PONUM646456*PL*002 
TDS*989248095 
SE*8*0002 
GE*2*65433
gs_file2
GS~IN~TPT~TPM~110302~2055~2321123~X~003010~
ST~810~000027324~
N1~SU~TPMNAME~92~TPUCD87
N1~SE~SELMNAME~92~23234
IT1~001~3450~EA~1234.67~~BP~PARTNUM6546-048~PO~PONUM99484~PL~235~
TDS~425961150~
SE~6~2321123~
GE~1~3201~

output should look like this ...
TPU,TPUCD21,789,PONUM342342,32424214
TPU,TPUCD43,543,PONUM1253123,989248095
TPT,TPUCD87,23234,PONUM99484,425961150

I hope this isn't too long! I'm new and not yet familiar with the forum posting style. Thanks so much for your help.

verge

View Public Profile for verge

Find all posts by verge

04-21-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Code:

while read file
do
...
done < foo

much more efficiently. You can also do

Code:

 while stuff ; do ... ; done 2>filename

to redirect stderr once for the whole loop instead of doing a special redirection for each and every individual command.

You can also set LC_ALL once instead of doing so for each and every individual command.

In your defense, you've been forced to deal with input data that looks like line noise!

I don't entirely understand what you're doing. Why are you csplitting on 10000 and /^ST/ ? Are two non-word characters in a row, **, supposed to imply a blank record between them? Finally, what is your system, what is your shell? That will have a big effect on the tools available to you.

I've started writing a solution in awk.

Last edited by Corona688; 04-21-2011 at 06:27 PM..

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

04-21-2011

Registered User

30, 0

Join Date: Apr 2011

Last Activity: 6 July 2011, 7:09 PM EDT

Posts: 30

Thanks Given: 12

Thanked 0 Times in 0 Posts

Thanks Corona,

I'm using Korn Shell on Microsoft Windows Services for UNIX 3.5 which supports
Sun Microsystems Solaris versions 7 and 8
Red Hat Linux version 8.0
IBM AIX version 5L 5.2
Hewlett-Packard HP-UX version 11i

Thanks for the tip about the backticks, sterr redirect and the while read ... I'll change that.

Yeah, the file is cumbersome

. As for the splitting, each /^ST/ is a new group, I had taken the {100000} to be the max number of times to execute the csplit.

Yes, two non-word characters in a row, is a blank field

Hope this clarifies the structure of the file ... the initial file is approx 3 million lines

Code:

GS*IN*TPU*TPM*110303*0634*65433*X*002000 #there are approx 75K GS to GE groups
  ST*810*0001  #potentially thousands of ST to SE groups, I have to relate the ST/SE group to the GS line
     N1*SU*TPUNAME*92*TPUCD21 
     N1*SE*SELNAME*92*789 
     IT1*1*8*EA*909234.12**BP*PARTNUM123*PO*PONUM342342*PL*526 
     IT1*2*3*EA*53342.65**BP*PARTNUM456*PO*PONUM31131*PL*528 
     TDS*32424214  
  SE*7*0001
     ST*810*0002  
       N1*SU*TPUNAME*92*TPUCD43 
       N1*SE*SELNAME*92*543 
       DTM*011*110302 
       IT1*1*10*EA*909234.12**BP*PARTNUM575*PO*PONUM1253123*PL*001  
       IT1*2*15*EA*53342.65**BP*PARTNUM483*PO*PONUM646456*PL*002 
       TDS*989248095 
     SE*8*0002 
GE*2*65433
GS~IN~TPT~TPM~110302~2055~2321123~X~003010~
   ST~810~000027324~
     N1~SU~TPMNAME~92~TPUCD87
     N1~SE~SELMNAME~92~23234
     IT1~001~3450~EA~1234.67~~BP~PARTNUM6546-048~PO~PONUM99484~PL~235~
    TDS~425961150~
   SE~6~2321123~
GE~1~3201~

verge

View Public Profile for verge

Find all posts by verge

04-21-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

How about this:

Code:

#!/bin/awk -f
# This section gets run only once, before anything's read.
# using it for variable setup.
BEGIN {
        # Don't have to check what the delimiter is, just split on
        # any single character that's not a-z, A-Z, 0-9, _
        FS="[^a-zA-Z0-9_]"
        # Print separated by commas
        OFS=","
}

# Each of the following expressions gets executed once for every
# line that matches the regex.

# Sometimes this one's column 11, sometimes it's column 12
/^IT1/  {       if(!FIRPONUM)
                {
                        FIRPONUM=$11
                        if(!(FIRPONUM ~ /^PONUM/))
                                FIRPONUM=$12;
                }
        }
# Matching these lines is easy
/^TDS/  {       INVTL=$2        }
/^N1.SE/{       SELLCD=$5       }
/^N1.SU/{       SUPCD=$5        }
/^GS/   {       GSCODE=$3       }
# Print on this only once we've read FIRPONUM
/^ST/   {
                if(FIRPONUM)
                        print GSCODE,SUPCD,SELLCD,FIRPONUM,INVTL;

                FIRPONUM=""
        }

# Have to print once on exit or we'll lose the last line
END {   print GSCODE,SUPCD,SELLCD,FIRPONUM,INVTL;       }

Not complete since neither's your example, but much more efficient than grep | cut for every line, and might be enough to get you started.

---------- Post updated at 03:46 PM ---------- Previous update was at 03:44 PM ----------

Quote:

I'm using Korn Shell on Microsoft Windows Services for UNIX 3.5

Blech. Poor imitation of a korn shell.

And since you're not actually running UNIX my awk script of course can't run as a script like I intended. Small difference though. Just run it like awk -f script.awk inputfile

---------- Post updated at 03:52 PM ---------- Previous update was at 03:46 PM ----------

Whoa, is your data actually indented like that? That changes things.

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

04-21-2011

Registered User

30, 0

Join Date: Apr 2011

Last Activity: 6 July 2011, 7:09 PM EDT

Posts: 30

Thanks Given: 12

Thanked 0 Times in 0 Posts

No it isn't indented ... I just indented it, to point out the relationship between the groups

ok ... you've given me something to chew on here, this is a great start, I'm going to start rewriting

would I call this awk script from within my ksh script?

Thanks Corona!

---------- Post updated at 03:09 PM ---------- Previous update was at 03:08 PM ----------

I was trying to be brief ... if I can make my example more complete, please let me know

verge

View Public Profile for verge

Find all posts by verge

04-21-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by verge

would I call this awk script from within my ksh script?

Yes. You could dump everything I wrote into a text file named script.awk (name unimportant), then run awk on that file in your ksh script with awk -f script.awk datafile

Or you could embed the entire thing into your ksh script like

Code:

<datafile awk 'BEGIN { a=b; c=d;

/^WTF/ { stuff }

...

...

...

}'

If your shell supports multi-line strings, that is.

I'll be happy to help with troubles you have improving it but it's probably best for you to match it to your needs. I'm not as likely to notice if things go just slightly wrong.

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

04-21-2011

Registered User

30, 0

Join Date: Apr 2011

Last Activity: 6 July 2011, 7:09 PM EDT

Posts: 30

Thanks Given: 12

Thanked 0 Times in 0 Posts

Thanks a lot Corona, I really appreciate your help ... I have a few other parsing issues but solving this piece helps me a great deal ... I knew there was a better way then grep|cut etc.

I just started scripting by stringing commands together and I'm noticing more and more that's the wrong approach

I'm going to try your awk now

verge

View Public Profile for verge

Find all posts by verge

Shell Programming and Scripting

Script Optimization - large delimited file, for loop with many greps

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Need Optimization shell/awk script to aggreagte (sum) for all the columns of Huge data file

Discussion started by: kartikirans

2. UNIX for Advanced & Expert Users

Need optimized awk/perl/shell to give the statistics for the Large delimited file

Discussion started by: kartikirans

3. Shell Programming and Scripting

Tab Delimited file in loop

Discussion started by: sukhdip

4. Shell Programming and Scripting

Need a script to convert comma delimited files to semi colon delimited

Discussion started by: CarpKing

5. Shell Programming and Scripting

Removing dupes within 2 delimited areas in a large dictionary file

Discussion started by: gimley

6. Shell Programming and Scripting

help with a shell script that greps an error from the logs

Discussion started by: adak2010

7. Shell Programming and Scripting

Extracting a portion of data from a very large tab delimited text file

Discussion started by: Lucky Ali

8. Shell Programming and Scripting

Large pipe delimited file that I need to add CR/LF every n fields

Discussion started by: clintrpeterson

9. UNIX for Dummies Questions & Answers

Command that creates file and also greps that file?

Discussion started by: Sepia

10. Shell Programming and Scripting

Directory sizes loop optimization

Discussion started by: la_womn