Big data file - sed/grep/awk?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Big data file - sed/grep/awk?
# 1  
Old 06-11-2008
Big data file - sed/grep/awk?

Morning guys. Another day another question. Smilie

I am knocking up a script to pull some data from a file. The problem is the file is very big (up to 1 gig in size), so this solution:

for results in `grep "^\[$STARTHOUR" ANYOLDFILE | awk -F'|' '{print $4}'`
do stuff


... works, but takes ages (we're talking minutes) to run. The data is held in this format:

Code:
[06:26] [200806] [INFO] |58|33|81|UserID : 00012345|
[07:26] [200806] [INFO] |63|72|79|UserID : 00012345|
[08:26] [200806] [INFO] |41|34|32|UserID : 00012345|
[09:26] [200806] [INFO] |54|55|44|UserID : 00012345|

I'm guessing that instead of the grep part I should be using a stream editor, but I'm struggling to find out which is best, and what the syntax would be.

Any ideas? Smilie
# 2  
Old 06-11-2008
Just getting rid of the grep should save you some cycles. Also reading the result into backticks and then looping over the result is wasteful (although some shells probably optimize that into a loop internally).

Code:
awk -F'|' "/^\[$STARTHOUR/"'{print $4}' ANYOLDFILE |
while read results; do
  stuff
done

If you can implement "stuff" in awk you are probably as fast as this is going to get, unless of course you are in a position to sit down and write your own little C program. Anyway, the "stuff" part is probably responsible for most of the processing time, not the loop scaffolding around it.
# 3  
Old 06-11-2008
Thanks, I'll implement that bit now and see if that makes much difference. The "stuff" part is:

for results in `grep "^\[$STARTHOUR" ANYOLDFILE | awk -F'|' '{print $4}'`
do
if [ $results -gt 9999 ]
then
(( NUMOFSECONDS[10]=NUMOFSECONDS[10]+1 ))
else
typeset -Z4 results
AMOUNTOFSECONDSFINDER=`echo $results| cut -c1`
(( NUMOFSECONDS[AMOUNTOFSECONDSFINDER]=NUMOFSECONDS[AMOUNTOFSECONDSFINDER]+1 ))
fi
done


... so I'm not sure what fat can be trimmed from here. Feel free to tell me if I'm missing anything obvious!
# 4  
Old 06-11-2008
Hmmm, looks like you were right. It actually slows it down slightly reading the file in the way you suggested, so the problem is obviously in the "stuff" part.

If the loop only has a few records to handle it's fast, once it gets to a few thousand it slows to a crawl. Curses!

Anyone got any thoughts on way sto improve the performance of the loop?

Last edited by dlam; 06-11-2008 at 06:47 AM..
# 5  
Old 06-11-2008
This is the line of code that you need to find a way of optimizing. As the number of results increases, this line of code is going to take longer and longer to execute.
Code:
AMOUNTOFSECONDSFINDER=`echo $results| cut -c1`

# 6  
Old 06-11-2008
@fpmurphy: I don't think it's growing, it's just looping over the fourth field. As far as I can tell, the root cause would seem to be that the shell's arrays are not scaling nicely.

Most of what you're doing can be accomplished in awk directly just as well. typeset -Z4 appears to be a kshism to pad a number with leading zeros to the specified width, correct?

I don't know if this captures all the nuances of your script, but perhaps it can be refined to do what you need.

Code:
awk -F '|' '"/^\[$STARTHOUR/"'{
    if ($4 > 9999) $4=10000; ++m[int($4/1000)]}
  END { for (i=0; i<=10; ++i) printf ("NUMOFSECONDS[%i]=%04i\n", i, m[i]) }' ANYOLDFILE


Last edited by era; 06-11-2008 at 07:41 AM.. Reason: Use int() to truncate division
# 7  
Old 06-11-2008
Thanks guys. A little bit of editing has shown it is definitely that line causing the problem, but as era says it shouldn't be the size that is casusing the problem because it's just holding one variable at a time so the array's not performing well certainly could be the reason.

I'll have a play with your script and see if I can slot it in.

Thanks again.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to grep/sed selected data from a command or file?

Below is the output of a DB2 command. Now I have 2 requirements... Database Partition 0 -- Database TESTDB1 -- Active Standby -- Up 213 days 02:33:07 -- Date 02/22/2016 17:04:50 HADR Information: Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes) Standby ... (2 Replies)
Discussion started by: rlokesh27
2 Replies

2. Shell Programming and Scripting

awk - sed / reading from a data file and doing algebraic operations

Hi everyone, I am trying to write a bash script which reads a data file and does some algebraic operations. here is the structure of data.xml file that I have; 1 <data> 2 . 3 . 4 . 5 </data> 6 <data> 7 . 8 . 9 . 10</data> etc. Each data block contains same number of lines (say... (4 Replies)
Discussion started by: hayreter
4 Replies

3. Shell Programming and Scripting

Router ping log extract data from it Awk/Sed/grep

Hi, I am new to this world.. Using expect i loging to router and checking ping response to my links. I need to genarate report using this output and that report contains only three file link name, packet loss, latency. my output of script is like below: -bash-3.00$ monmw/mwbkp... (2 Replies)
Discussion started by: jkmistry
2 Replies

4. Shell Programming and Scripting

Sort a big data file

Hello, I have a big data file (160 MB) full of records with pipe(|) delimited those fields. I`m sorting the file on the first field. I'm trying to sort with "sort" command and it brings me 6 minutes. I have tried with some transformation methods in perl but it results "Out of memory". I was... (2 Replies)
Discussion started by: rubber08
2 Replies

5. Shell Programming and Scripting

formatting data file with awk or sed

Hi, I have a (quite large) data file which looks like: _____________ header part.. more header part.. x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 ... ... x59 x60 y1 y2 y3 y4... ... y100 ______________ where x1, x2,...,x60 and y1, y2,...y100 are numbers of 10 digits (so each line... (5 Replies)
Discussion started by: lego
5 Replies

6. UNIX for Dummies Questions & Answers

awk and grep to search a data file

Hi everyone, I cannot figure out how I can do a search in a file that has Names, Surnames, Addresses and telephone number of a number of people. Here is an example of the data file Daisy:Hunter:490 London Road:07313196347 Richard:Murphy:983 Main Road:07002625997 Isobel:Magnusson:133 London... (1 Reply)
Discussion started by: philipisaia
1 Replies

7. Shell Programming and Scripting

How to cut some data from big file

How to cut data from big file my file around 30 gb I tried "head -50022172 filename > newfile.txt ,and tail -5454283 newfile.txt. It's slowy. afer that I tried sed -n '46467831,50022172p' filename > newfile.txt ,also slow Please recommend me , faster command to cut some data from... (4 Replies)
Discussion started by: almanto
4 Replies

8. Shell Programming and Scripting

sed or awk to extract data from Xml file

Hi, I want to get data from Xml file by using sed or awk command. I want to get the following result : mon titre 1;Createur1;Dossier1 mon titre 1;Createur1;Dossier1 and save it in cvs file (fichier.cvs). FROM this Xml file (test.xml): <playlist version="1"> <trackList> <track>... (1 Reply)
Discussion started by: yeclota
1 Replies

9. Shell Programming and Scripting

Should I use sed/ grep/awk for wrap file?

Hi, This is my first time post a new thread. I have been trying to work on this for the past 2 days and could not find any good solution. I have 1 long long line ( EDI wrapped file) like below: NEW*SR*04411763447*279*278*Q~*ZR*AAV*SR*04511763460*SQ*21B37F04~HL*305*304*Q~K~SN1*1*1*SR*05511763461*... (6 Replies)
Discussion started by: vanda_25
6 Replies

10. Shell Programming and Scripting

filter parts of a big file using awk or sed script

I need an assistance in file generation using awk, sed or anything... I have a big file that i need to filter desired parts only. The objective is to select (and print) the report # having the string "apple" on 2 consecutive lines in every report. Please note that the "apple" line has a HEX... (1 Reply)
Discussion started by: apalex
1 Replies
Login or Register to Ask a Question