Help 'speeding' up this 'parsing' script - taking 24+ hours to run
Hi,
I've written a ksh script that read a file and parse/filter/format each line. The script runs as expected but it runs for 24+ hours for a file that has 2million lines. And sometimes, the input file has 10million lines which means it can be running for more than 2 days and still not finish. And of course, SA's been chasing me up as it is showing in top as running like forever.
I need some advise on maybe instead of reading one line at a time, I can run an awk one liner instead. I wish I can code it in Perl but not sure how to. Most says it is faster in Perl but not sure how to use Perl-like equivalence of the UNIX command besides using system
Anyway, hopefully I can interest someone into looking into this.
Below is the excerpt / part of the script that is taking the most time:
Below are example entries of the input file that the script reads, it can be 2million lines at least and go to as much as 10million lines. I've change entries as they are customer data.
What I am wanting to do really in simplest term is as below:
Change the date format to YYYY-MM-DD. Main reason being is it is most convenient sorting in this date format
Filter some information from each line, i.e host name, IP, program name, service name, return code etc.
I then re-direct these formatted line/record to a file that I can check group by return code value or simply do a sort | uniq -c so it displays and show a count of occurrence.
Running awk once and only once would be so much faster than running awk 180,000,000 times, it'd be done in under a minute, maybe even single digit seconds.
Perl is not faster. If you wrote this code the same way in Perl it'd be just as slow or slower.
Unfortunately, the program you've given doesn't seem to work, so I can't tell what output you want. Could you post the output you want?
Last edited by Corona688; 03-28-2018 at 04:46 PM..
You have shown us an input file and you have shown us a script that invokes awk and sed at least 30 times for every line read from your file. It is no wonder that running this script is burning up CPU cycles to the detriment of anyone else trying to use the same system you're using.
Please describe in English exactly what output you're trying to produce and show us the exact output you hope to produce from your sample input. Saying that you want to filter the host name for each line doesn't really describe what you're trying to do especially since many of your sample input lines contain more than one (HOST=value) string.
Please also tell us what operating system you're using. (Different operating systems have different utilities and different options available for some utilities.)
Sorry Corona688 and Don Cragun, I should have thought about how very so difficult and unfair of me not to post in an example output.
You are right that it is indeed a lot, lot, lot faster if it reads the whole file at once instead of line by one I kick off the script to run on a 10million lines over the weekend, I didn't get an easter miracle of any sort, it is still running at this time.
You can ignore or ideally forget the so horrible codes that I posted . Maybe I can explain what I've been trying to do as below.
So, here is an example raw input file, un-filtered
There can be million of these lines and at the moment, the script reads one line at a time and generate a formatted output like below.
I then use sort | uniq -c to do some sort of a count and comes up with below:
All fields of the output file are from the input file with the exception of the second field that is showing up as runserver01. This is from running hostname. It doesn't have to be on the second field. it can be anywhere or can come in later on after all the filtering, it is just basically a way for me to figure out where I run the script from.
Most of the lines are of the following format:
Sometimes, it can be like below:
I don't know how to make awk differentiate between the two formats and filter/get the right information. Note that the information are in different order for these two strings.
And yes, running the whole file thru awk is faster instead of having to read one line at a time but I don't know how to get awk to do what I wanted so it comes up with the output format that I wanted.
I am looking at maybe do one run of awk changing the date format first and then the next awk is to filter out the CONNECT_DATA string into different parts.
But I can't figure out what to do, so for the first pass, I need to change
to
How do I tell awk -F"*" to print $1 and the rest of the field with $1 to be further change to a YYYY-MM-DD format. The real reason behind formatting it to YYYY-MM-DD is because that works best for when doing the sort.
And then the next pass is supposed to filter it to be like
Or ideally be like
Please advise on how best to do what I am wanting to do. Apologies for not giving enough information earlier.
P.S:
That ksh script that I run processing a file that has 9890943 lines, it is still running, ps -o etime= -p 3036 says it has been running for 5-14:38:03, time to CTRL-C it
The following is probably not the answer you hoped for - i will tell you what you did do wrongly, why it was wrong and how you could do it better. You will still have to implement what i tell you yourself. Also, i will keep my explanation very short and introductory. You will need to research many of the pointers i will give you on your own to explore the full capabilities of the things i will explain to you.
If you want to show us the fruit of your efforts once you reimplemented the script and seek further advice - you will be welcome.
Quote:
Originally Posted by newbie_01
I've written a ksh script that read a file and parse/filter/format each line. The script runs as expected but it runs for 24+ hours for a file that has 2million lines. And sometimes, the input file has 10million lines which means it can be running for more than 2 days and still not finish.
This is a good start. Whenever you write code always take the time to estimate how long it will run, depending on the amount of input you expect. You don't need exact calculations, just a rough estimation for some expected orders of magnitude will suffice. There is a whole mathematical theory about this (see "Landau symbols" or "Big O notation"), but we won't need it. A glimpse of it will suffice.
Look at the following code:
How long will this run? Well, obviously that depends on how long "program" will run, yes? But even without knowing that we can already say that for every line of input we will have to run "program" twice. Now we can examine the input and if it contains, say, 1 million lines, we know that "program" will be called 2 million times. If we estimate that "program" needs 1 millisecond for a single run the script will take 0.001s x 2 000 000 = 2 000s ^= ~35min. Add to that some overhead for reading the input file, writing the output files, loading "program" two million times into memory and starting it, etc. and we probably end at 1 hour runtime.
Especially for large inputs it makes sense to test the finished program (script) with a short input and measure the time it takes. For this there is the time command. For instance you can take your script, save it under the name of myscript and then execute it with a test input of, say, 1000 lines, like this:
You will get an output like the following:
If you are interested you may want to explore performance tuning and measuring but for a start we are only concerned with the "real" line of the output. This is how long your program has run overall. Now, that you have an estimation how long it has taken to process thousand lines it is easy to extrapolate how long it takes to process a million or ten million.
The next thing i want to talk about is probably more of what you expected: how to make code faster. First, here is a part of your code which i have trimmed down a bit. Let us use our new tool to estimate the runtime:
You immediated see why it pays off to indent properly: you don't see at a glance how many levels of nesting you have here. Therefore, let us first reindent your code:
Now we see immediately that the inner while-loop is executed completely every time the outer for-loop does one pass. If we estimate the for-loop to find 10 files and each file has 100 lines the while-loop as a whole will be executed 10 times and every line within the while-loop will be exectuted 1000 times.
Most lines within the while-loop look like this:
What does the shell do to process this code? First, the shell creates an extra process, in which the echo program is started. Some output stream is generated running echo $TS. Next, the awk program is loaded and executed by starting a child process and running awk '{ print $1 }' inside it. To this process the output generated by the echo is fed as input. The awk program generates some output of its own and a third sub-process is created and started, into which another instance of the awk program is loaded. The output of the first awk program is now fed as input to the second awk program, which itself generates some output based on that input. This output is caught and put into the variable.
Sounds complicated? Yes - because it is! Calling an external program is one of the most "expensive" (in terms of needed system resources and time) system calls there are! Fast shell scripts differ mostly in this regard from slow ones: how well they avoid calling external programs.
That begs the question: if we don't filter the part we need from the rest of the output with awk, what should we use instead? Luckily, the inventors of the shell asked themselves this question and they invented: variable expansion (also called "parameter expansion").
I won't explain it completely here, but only a short introduction: suppose we have a variable holding a date, like this (notice that i imply the european date format: YYYY-MM-DD):
Now, we want to split that into a year, month and day part.
There is a device which will cut off a part of a variables content based on some pattern:
In our case the pattern we look for is "-", because this separates the days, months and the year. You can also use wildcards, like "*" (any number of any characters) and "?" (any single character), just like in filenames, when you do a ls -l *.txt.
Now let us try (i absolutely suggest that you play around with this - create your own variable contents and try different patterns and what comes out):
Notice, that the content of the variable is not changed at all - just the part which is displayed is changed! If you want to save the result you will need to assign another (or the same) variable with it:
Notice that i have left out the month here. we need a two-step approach to filter that out:
Now we have a complete solution:
You probably may ask right now how much this is influencing the runtime. You are right to ask, but seeing is believing, as they say. Prepare a log file with 1000 lines and run these two scripts, each with the "time" command, i showed you above:
And see what comes out.
I have used another device above to further speed up things: the shell has the ability to split input into fields. This is usually done along delimiters of whitespace. Consider the following command:
Somehow we expect the shell to interpret file1 as the name of one file and file2 as the name of another. We do NOT expect the shell to confuse this for a file called -abc file1 or file1 file2 or so. This is because of this innate splitting ability and the fact that the strings file1 and file2 are surrounded by whitespace.
We can use this ability to our advantage when we read input too. You do it already when you do:
The content of the variable "line" is split along whitespace and the first part goes into a variable named TS, the second part to a variable named CS and so on. (On a passing note: "HOST" is a bad name for a variable because it is often a - fixed - value with the name of the system you are running on. Use something else.)
But instead of doing:
You can do immediately:
This is what i have done above. Notice that you may still need the line as a whole and it might make sense to retain it like you did - i just didn't need it for this part, so i left it out. You should just be aware of what is possible.
There are some further rules for this splitting: if you have less variables than fields everything left over will be put into the last variable:
So, if you need only the, say, second part of a list of values:
If you have more variables than available fields the last variables will be simply empty.
Now, i suggest you first play around with what i told you and explore the possibilities. Only then try to reimplement your script in light of what i told you.
I hope this helps.
bakunin
These 4 Users Gave Thanks to bakunin For This Post:
Not sure why the service name comes in field $4 sometimes, shoving other fields right, and in field $6 other times...
How far do you get with
Yeah, I hate that fact too, that the service name divert from field to field. Looking at the lines, it has to do with the request being a JDBC connection or otherwise. I'll give the awk bit to work. Thanks a lot.
---------- Post updated at 01:53 PM ---------- Previous update was at 01:40 PM ----------
Sorry, I've been sick for awhile. Thanks a lot for all your advise. I will give all of the suggestion with a cut down version of the file. I will have a real long read and understand how to implement your suggestion. Wish me luck. Thanks again everyone.
Hello experts,
we have input files with 700K lines each (one generated for every hour). and we need to convert them as below and move them to another directory once.
Sample INPUT:-
# cat test1
1559205600000,8474,NormalizedPortInfo,PctDiscards,0.0,Interface,BG-CTA-AX1.test.com,Vl111... (7 Replies)
Hi All,
I have a bash script which is scheduled to run for every 20 minutes. Inside the bash script, one command which I am using need to be triggered only once in two or three hours.Is there anyway to achieve this.
For example,
if
then
echo "hi"
else
echo "Hello"
UNIX Command---once... (5 Replies)
HI Guys hoping some one can help
I have two files on both containing uk phone numbers
master is a file which has been collated over a few years ad currently contains around 4 million numbers
new is a file which also contains 4 million number i need to split new nto two separate files... (4 Replies)
Hi
I have task to zip files based on modified time but they are in millions and it is taking lot of time more than 12 hours and also eating up high cpu
is there any other / better way to handle it quickly with less cpu consumptionfind . ! -name \"*.gz\" -mtime +7 -type f | grep -v '/.*/' |... (2 Replies)
This is my first experience writing unix script. I've created the following script. It does what I want it to do, but I need it to be a lot faster. Is there any way to speed it up?
cat 'Tax_Provision_Sample.dat' | sort | while read p; do fn=`echo $p|cut -d~ -f2,4,3,8,9`; echo $p >> "$fn.txt";... (20 Replies)
I want to parse a log file which i am grepping root user connection but is showing whole day and previous day detail as well.
First i want to see last 2 hours log file then after that i want to search particular string. Lets suppose right now its 5:00PM, So i want to see the log of 3:00PM to... (6 Replies)
Hi All
I have a problem, I wonder if you can help me sort it out:
I have the following entry in the cron:
00 1,13 * * * /home/report/opn_amt_gestores_credito.ksh > opn_amt_gestores_credito.log
But the entry only runs at 01:07
I have stopped the cron deamon, and started, but it still... (39 Replies)
Hi all,
I'm having some trouble with a shell script that I have put together to search our web pages for links to PDFs.
The first thing I did was:
ls -R | grep .pdf > /tmp/dave_pdfs.outWhich generates a list of all of the PDFs on the server. For the sake of arguement, say it looks like... (8 Replies)
Hi every one,
We have HP UX server which normally loaded as avg load of 19-21.
NOw when I try and do ftp to this server it takes ages to get the FTP prompt.
I have seen this server loaded as max agv load of 35-40 tht time we never had such problems of FTP sessions.
Now my new Unix admin... (1 Reply)