Script Optimization - large delimited file, for loop with many greps


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Script Optimization - large delimited file, for loop with many greps
# 1  
Old 04-21-2011
Script Optimization - large delimited file, for loop with many greps

Since there are approximately 75K gsfiles and hundreds of stfiles per gsfile, this script can take hours. How can I rewrite this script, so that it's much faster? I'm not as familiar with perl but I'm open to all suggestions.

Code:
ls file.list>$split
for gsfile in `cat $split`;
do
  csplit -ks -n 6 -f $gsfile.ST $gsfile /^ST/ {100000} 2>>$diagnostic
  for stfile in `ls $gsfile.ST*|sort -n`; 
  do 
    delim=`LC_ALL=C grep "^GS" $gsfile|cut -c3` 2>>$diagnostic
  gscode=`LC_ALL=C grep "^GS" $gsfile|cut -d "$delim" -f3` 2>>$diagnostic
  supcd=`LC_ALL=C grep "^N1.SU" $stfile|cut -d "$delim" -f5|head -1` 2>>$diagnostic
    sellcd=`LC_ALL=C grep "^N1.SE" $stfile|cut -d "$delim" -f5|head -1` 2>>$diagnostic
    firponum=`LC_ALL=C grep "^IT1" $stfile|cut -d "$delim" -f10|head -1` 2>>$diagnostic
    invtl=`LC_ALL=C grep "^TDS" $stfile|cut -d "$delim" -f2|tr -cd '[[:digit:]]'` 2>>$diagnostic 
    #I have about ten more greps here
    echo "$gscode,$supcd,$sellcd,$firponum,$invtl">>$detail_file
    rm -f $stfile 2>>$diagnostic                                                                                          
  done 
done

Here's an example of an input file. The delimiters can be any non-word character.
Code:
 
gsfile_1
GS*IN*TPU*TPM*110303*0634*65433*X*002000 
ST*810*0001  
N1*SU*TPUNAME*92*TPUCD21 
N1*SE*SELNAME*92*789 
IT1*1*8*EA*909234.12**BP*PARTNUM123*PO*PONUM342342*PL*526 
IT1*2*3*EA*53342.65**BP*PARTNUM456*PO*PONUM31131*PL*528 
TDS*32424214  
SE*7*0001
ST*810*0002  
N1*SU*TPUNAME*92*TPUCD43 
N1*SE*SELNAME*92*543 
DTM*011*110302 
IT1*1*10*EA*909234.12**BP*PARTNUM575*PO*PONUM1253123*PL*001  
IT1*2*15*EA*53342.65**BP*PARTNUM483*PO*PONUM646456*PL*002 
TDS*989248095 
SE*8*0002 
GE*2*65433
gs_file2
GS~IN~TPT~TPM~110302~2055~2321123~X~003010~
ST~810~000027324~
N1~SU~TPMNAME~92~TPUCD87
N1~SE~SELMNAME~92~23234
IT1~001~3450~EA~1234.67~~BP~PARTNUM6546-048~PO~PONUM99484~PL~235~
TDS~425961150~
SE~6~2321123~
GE~1~3201~

output should look like this ...
TPU,TPUCD21,789,PONUM342342,32424214
TPU,TPUCD43,543,PONUM1253123,989248095
TPT,TPUCD87,23234,PONUM99484,425961150

I hope this isn't too long! I'm new and not yet familiar with the forum posting style. Thanks so much for your help.
# 2  
Old 04-21-2011
foo | grep | cut | sed | really | long | pipe | chain is never efficient, and you're doing this on almost every line. You've also got a lot of useless use of backticks, and useless use of cat. Whenever you have 'for file in `cat foo` you could've done
Code:
while read file
do
...
done < foo

much more efficiently. You can also do
Code:
 while stuff ; do ... ; done 2>filename

to redirect stderr once for the whole loop instead of doing a special redirection for each and every individual command.

You can also set LC_ALL once instead of doing so for each and every individual command.

In your defense, you've been forced to deal with input data that looks like line noise! Smilie I don't entirely understand what you're doing. Why are you csplitting on 10000 and /^ST/ ? Are two non-word characters in a row, **, supposed to imply a blank record between them? Finally, what is your system, what is your shell? That will have a big effect on the tools available to you.

I've started writing a solution in awk.

Last edited by Corona688; 04-21-2011 at 06:27 PM..
This User Gave Thanks to Corona688 For This Post:
# 3  
Old 04-21-2011
Thanks Corona,

I'm using Korn Shell on Microsoft Windows Services for UNIX 3.5 which supports
Sun Microsystems Solaris versions 7 and 8
Red Hat Linux version 8.0
IBM AIX version 5L 5.2
Hewlett-Packard HP-UX version 11i

Thanks for the tip about the backticks, sterr redirect and the while read ... I'll change that.

Yeah, the file is cumbersome Smilie. As for the splitting, each /^ST/ is a new group, I had taken the {100000} to be the max number of times to execute the csplit.

Yes, two non-word characters in a row, is a blank field

Hope this clarifies the structure of the file ... the initial file is approx 3 million lines
Code:
GS*IN*TPU*TPM*110303*0634*65433*X*002000 #there are approx 75K GS to GE groups
  ST*810*0001  #potentially thousands of ST to SE groups, I have to relate the ST/SE group to the GS line
     N1*SU*TPUNAME*92*TPUCD21 
     N1*SE*SELNAME*92*789 
     IT1*1*8*EA*909234.12**BP*PARTNUM123*PO*PONUM342342*PL*526 
     IT1*2*3*EA*53342.65**BP*PARTNUM456*PO*PONUM31131*PL*528 
     TDS*32424214  
  SE*7*0001
     ST*810*0002  
       N1*SU*TPUNAME*92*TPUCD43 
       N1*SE*SELNAME*92*543 
       DTM*011*110302 
       IT1*1*10*EA*909234.12**BP*PARTNUM575*PO*PONUM1253123*PL*001  
       IT1*2*15*EA*53342.65**BP*PARTNUM483*PO*PONUM646456*PL*002 
       TDS*989248095 
     SE*8*0002 
GE*2*65433
GS~IN~TPT~TPM~110302~2055~2321123~X~003010~
   ST~810~000027324~
     N1~SU~TPMNAME~92~TPUCD87
     N1~SE~SELMNAME~92~23234
     IT1~001~3450~EA~1234.67~~BP~PARTNUM6546-048~PO~PONUM99484~PL~235~
    TDS~425961150~
   SE~6~2321123~
GE~1~3201~

# 4  
Old 04-21-2011
How about this:
Code:
#!/bin/awk -f
# This section gets run only once, before anything's read.
# using it for variable setup.
BEGIN {
        # Don't have to check what the delimiter is, just split on
        # any single character that's not a-z, A-Z, 0-9, _
        FS="[^a-zA-Z0-9_]"
        # Print separated by commas
        OFS=","
}

# Each of the following expressions gets executed once for every
# line that matches the regex.

# Sometimes this one's column 11, sometimes it's column 12
/^IT1/  {       if(!FIRPONUM)
                {
                        FIRPONUM=$11
                        if(!(FIRPONUM ~ /^PONUM/))
                                FIRPONUM=$12;
                }
        }
# Matching these lines is easy
/^TDS/  {       INVTL=$2        }
/^N1.SE/{       SELLCD=$5       }
/^N1.SU/{       SUPCD=$5        }
/^GS/   {       GSCODE=$3       }
# Print on this only once we've read FIRPONUM
/^ST/   {
                if(FIRPONUM)
                        print GSCODE,SUPCD,SELLCD,FIRPONUM,INVTL;

                FIRPONUM=""
        }

# Have to print once on exit or we'll lose the last line
END {   print GSCODE,SUPCD,SELLCD,FIRPONUM,INVTL;       }

Not complete since neither's your example, but much more efficient than grep | cut for every line, and might be enough to get you started.

---------- Post updated at 03:46 PM ---------- Previous update was at 03:44 PM ----------

Quote:
I'm using Korn Shell on Microsoft Windows Services for UNIX 3.5
Blech. Poor imitation of a korn shell.

And since you're not actually running UNIX my awk script of course can't run as a script like I intended. Small difference though. Just run it like awk -f script.awk inputfile

---------- Post updated at 03:52 PM ---------- Previous update was at 03:46 PM ----------

Whoa, is your data actually indented like that? That changes things.
This User Gave Thanks to Corona688 For This Post:
# 5  
Old 04-21-2011
No it isn't indented ... I just indented it, to point out the relationship between the groups

ok ... you've given me something to chew on here, this is a great start, I'm going to start rewriting

would I call this awk script from within my ksh script?

Thanks Corona!

---------- Post updated at 03:09 PM ---------- Previous update was at 03:08 PM ----------

I was trying to be brief ... if I can make my example more complete, please let me know
# 6  
Old 04-21-2011
Quote:
Originally Posted by verge
would I call this awk script from within my ksh script?
Yes. You could dump everything I wrote into a text file named script.awk (name unimportant), then run awk on that file in your ksh script with awk -f script.awk datafile

Or you could embed the entire thing into your ksh script like

Code:
<datafile awk 'BEGIN { a=b; c=d;

/^WTF/ { stuff }

...

...

...

}'

If your shell supports multi-line strings, that is.

I'll be happy to help with troubles you have improving it but it's probably best for you to match it to your needs. I'm not as likely to notice if things go just slightly wrong.
This User Gave Thanks to Corona688 For This Post:
# 7  
Old 04-21-2011
Thanks a lot Corona, I really appreciate your help ... I have a few other parsing issues but solving this piece helps me a great deal ... I knew there was a better way then grep|cut etc.

I just started scripting by stringing commands together and I'm noticing more and more that's the wrong approach

I'm going to try your awk now
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Need Optimization shell/awk script to aggreagte (sum) for all the columns of Huge data file

Optimization shell/awk script to aggregate (sum) for all the columns of Huge data file File delimiter "|" Need to have Sum of all columns, with column number : aggregation (summation) for each column File not having the header Like below - Column 1 "Total Column 2 : "Total ... ...... (2 Replies)
Discussion started by: kartikirans
2 Replies

2. UNIX for Advanced & Expert Users

Need optimized awk/perl/shell to give the statistics for the Large delimited file

I have a file size is around 24 G with 14 columns, delimiter with "|" My requirement- can anyone provide me the fastest and best to get the below results Number of records of the file First column and second Column- Unique counts Thanks for your time Karti ------ Post updated at... (3 Replies)
Discussion started by: kartikirans
3 Replies

3. Shell Programming and Scripting

Tab Delimited file in loop

Hi, I have requirement to create tab delimited file with values coming from variables. File will contain only two columns separated by tab. Header will be added once. Values will be keep adding upon the script run. If values already exists then values will be replaced. I have done so... (1 Reply)
Discussion started by: sukhdip
1 Replies

4. Shell Programming and Scripting

Need a script to convert comma delimited files to semi colon delimited

Hi All, I need a unix script to convert .csv files to .skv files (changing a comma delimited file to a semi colon delimited file). I am a unix newbie and so don't know where to start. The script will be scheduled using cron and needs to convert each .csv file in a particular folder to a .skv... (4 Replies)
Discussion started by: CarpKing
4 Replies

5. Shell Programming and Scripting

Removing dupes within 2 delimited areas in a large dictionary file

Hello, I have a very large dictionary file which is in text format and which contains a large number of sub-sections. Each sub-section starts with the following header : #DATA #VALID 1 and ends with a footer as shown below #END The data between the Header and the Footer consists of... (6 Replies)
Discussion started by: gimley
6 Replies

6. Shell Programming and Scripting

help with a shell script that greps an error from the logs

Hello everyone. I wrote the following script but the second part is not excecuting. It is not sending the notification by email if the error occurs. the send mail is working so i think the errorr should be in the if statement LOGDIR=/logs/out LOG=`date "+%Y%m%d"`.LOG-FILE.out #the log file ... (11 Replies)
Discussion started by: adak2010
11 Replies

7. Shell Programming and Scripting

Extracting a portion of data from a very large tab delimited text file

Hi All I wanted to know how to effectively delete some columns in a large tab delimited file. I have a file that contains 5 columns and almost 100,000 rows 3456 f g t t 3456 g h 456 f h 4567 f g h z 345 f g 567 h j k lThis is a very large data file and tab delimited. I need... (2 Replies)
Discussion started by: Lucky Ali
2 Replies

8. Shell Programming and Scripting

Large pipe delimited file that I need to add CR/LF every n fields

I have a large flat file with variable length fields that are pipe delimited. The file has no new line or CR/LF characters to indicate a new record. I need to parse the file and after some number of fields, I need to insert a CR/LF to start the next record. Input file ... (2 Replies)
Discussion started by: clintrpeterson
2 Replies

9. UNIX for Dummies Questions & Answers

Command that creates file and also greps that file?

I have a command that does something and then creates a log file (importlog.xml). I then want to grep that newly created log (importlog.xml) file for a certain word (success). I then want to write that grep result to a new file (success.log). So far I can run the command which creates the... (2 Replies)
Discussion started by: Sepia
2 Replies

10. Shell Programming and Scripting

Directory sizes loop optimization

I have the following script: #!/usr/bin/ksh export MDIR=/datafiles NAME=$1 SERVER=$2 DIRECTORY=$3 DATABASE=$4 ID=$5 export dirlist=`/usr/bin/ssh -q $ID@$SERVER find $DIRECTORY -type d -print` for dir in $dirlist do SIZE=`</dev/null /usr/bin/ssh -q $ID@$SERVER du -ks $dir` echo... (6 Replies)
Discussion started by: la_womn
6 Replies
Login or Register to Ask a Question