The UNIX and Linux Forums  

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
Google UNIX.COM


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
no data redirected to a file with top and grep - why? fongthai Shell Programming and Scripting 15 04-24-2008 03:30 AM
Using loop reading a file,retrieving data from data base. Sonu4lov Shell Programming and Scripting 1 01-18-2007 11:38 PM
grep data and add to file nbananda Shell Programming and Scripting 5 09-25-2006 07:21 AM
Pipe Data From Grep Into A File katinicsdad Shell Programming and Scripting 4 09-08-2006 08:20 AM
grep data from files getdpg Shell Programming and Scripting 2 01-17-2006 08:57 AM

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 06-11-2008
Registered User
 

Join Date: Mar 2008
Posts: 20
Big data file - sed/grep/awk?

Morning guys. Another day another question.

I am knocking up a script to pull some data from a file. The problem is the file is very big (up to 1 gig in size), so this solution:

for results in `grep "^\[$STARTHOUR" ANYOLDFILE | awk -F'|' '{print $4}'`
do stuff


... works, but takes ages (we're talking minutes) to run. The data is held in this format:

[06:26] [200806] [INFO] |58|33|81|UserID : 00012345|
[07:26] [200806] [INFO] |63|72|79|UserID : 00012345|
[08:26] [200806] [INFO] |41|34|32|UserID : 00012345|
[09:26] [200806] [INFO] |54|55|44|UserID : 00012345|

I'm guessing that instead of the grep part I should be using a stream editor, but I'm struggling to find out which is best, and what the syntax would be.

Any ideas?
Reply With Quote
Forum Sponsor
  #2 (permalink)  
Old 06-11-2008
era era is offline
Herder of Useless Cats
 

Join Date: Mar 2008
Location: /there/is/only/bin/sh
Posts: 2,707
Just getting rid of the grep should save you some cycles. Also reading the result into backticks and then looping over the result is wasteful (although some shells probably optimize that into a loop internally).

Code:
awk -F'|' "/^\[$STARTHOUR/"'{print $4}' ANYOLDFILE |
while read results; do
  stuff
done
If you can implement "stuff" in awk you are probably as fast as this is going to get, unless of course you are in a position to sit down and write your own little C program. Anyway, the "stuff" part is probably responsible for most of the processing time, not the loop scaffolding around it.
Reply With Quote
  #3 (permalink)  
Old 06-11-2008
Registered User
 

Join Date: Mar 2008
Posts: 20
Thanks, I'll implement that bit now and see if that makes much difference. The "stuff" part is:

for results in `grep "^\[$STARTHOUR" ANYOLDFILE | awk -F'|' '{print $4}'`
do
if [ $results -gt 9999 ]
then
(( NUMOFSECONDS[10]=NUMOFSECONDS[10]+1 ))
else
typeset -Z4 results
AMOUNTOFSECONDSFINDER=`echo $results| cut -c1`
(( NUMOFSECONDS[AMOUNTOFSECONDSFINDER]=NUMOFSECONDS[AMOUNTOFSECONDSFINDER]+1 ))
fi
done


... so I'm not sure what fat can be trimmed from here. Feel free to tell me if I'm missing anything obvious!
Reply With Quote
  #4 (permalink)  
Old 06-11-2008
Registered User
 

Join Date: Mar 2008
Posts: 20
Hmmm, looks like you were right. It actually slows it down slightly reading the file in the way you suggested, so the problem is obviously in the "stuff" part.

If the loop only has a few records to handle it's fast, once it gets to a few thousand it slows to a crawl. Curses!

Anyone got any thoughts on way sto improve the performance of the loop?

Last edited by dlam; 06-11-2008 at 02:47 AM.
Reply With Quote
  #5 (permalink)  
Old 06-11-2008
Moderator
 

Join Date: Dec 2003
Location: /ksh93
Posts: 849
This is the line of code that you need to find a way of optimizing. As the number of results increases, this line of code is going to take longer and longer to execute.
Code:
AMOUNTOFSECONDSFINDER=`echo $results| cut -c1`
Reply With Quote
  #6 (permalink)  
Old 06-11-2008
era era is offline
Herder of Useless Cats
 

Join Date: Mar 2008
Location: /there/is/only/bin/sh
Posts: 2,707
@fpmurphy: I don't think it's growing, it's just looping over the fourth field. As far as I can tell, the root cause would seem to be that the shell's arrays are not scaling nicely.

Most of what you're doing can be accomplished in awk directly just as well. typeset -Z4 appears to be a kshism to pad a number with leading zeros to the specified width, correct?

I don't know if this captures all the nuances of your script, but perhaps it can be refined to do what you need.

Code:
awk -F '|' '"/^\[$STARTHOUR/"'{
    if ($4 > 9999) $4=10000; ++m[int($4/1000)]}
  END { for (i=0; i<=10; ++i) printf ("NUMOFSECONDS[%i]=%04i\n", i, m[i]) }' ANYOLDFILE

Last edited by era; 06-11-2008 at 03:41 AM. Reason: Use int() to truncate division
Reply With Quote
  #7 (permalink)  
Old 06-11-2008
Registered User
 

Join Date: Mar 2008
Posts: 20
Thanks guys. A little bit of editing has shown it is definitely that line causing the problem, but as era says it shouldn't be the size that is casusing the problem because it's just holding one variable at a time so the array's not performing well certainly could be the reason.

I'll have a play with your script and see if I can slot it in.

Thanks again.
Reply With Quote
Google UNIX.COM
Reply

Thread Tools
Display Modes




All times are GMT -7. The time now is 03:48 PM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited.
The UNIX and Linux Forums Content Copyright ©1993-2008 The CEP Blog All Rights Reserved -Ad Management by RedTyger Visit The Global Fact Book

Content Relevant URLs by vBSEO 3.2.0