I have an awk script that extracts data from log files and generates a report. The script is incrementing elements in arrays, summing fields depending on contents of other fields, and then in END performing some operations on those arrays. It seemed to be taking longer than it should on my system, so I started trying to figure out where that's happening. I added a simple test to the script:
time_count is an array that becomes rather large-- a per-second count across an hour's time for each of several servers. So, it stacks up considerably, and seems like the best candidate for causing the slowdown. Here's a snippet of that particular bit's output around when the slowdown starts:
You can see that there's a big increase in the array length around 380,000 lines, as well as in the time required to process that particular set of 10,000 lines. The time grows, slightly but measurably, as the line count increases. This will become more of a problem as the script is used to process larger files.
So, my questions:
1) Are there any general suggestions for increasing performance? This may require me to post the whole script. I don't mind, but I don't want to clutter things up here, so let me know if that would help.
2) I've noticed that my VIRT/RES/SHR in top for my script max out at 108m/6204/864. So, it's actually using about 6MB of RAM, if I read that right. I'd like it to feel free to gobble up as much as it can get. This is a system with 96GB of RAM, so not a problem. How can I encourage the process to do that?
I'd love it if I could tweak things so that disk I/O became the limiting factor. Many thanks in advance for suggestions.
As of this posting, my attachment is still pending approval. So, here's the script:
Code:
#! /bin/awk -f
BEGIN {
OFS = ","
count = 1
prevtime = 0
while ( "cat /root/scripts/billing/subant_list" | getline )
{
split($0, sublist, ",")
subants[count] = sublist[1]
count ++
}
}
## Let's chew on something now...
NR % 10000 == 0 {
print NR "\t" length(time_count) "\t" systime() - prevtime
prevtime = systime()
}
$1 >= start_time && $1 < end_time {
linecount++
## Hourly operations count
hourlyOperationsCount[substr($2,2,14)]++
## Generate a per-subant count, including:
## Count of HTTP status codes per subant
## Count of HTTP tx types per subant
for ( i = 1 ; i <= length(subants) ; i++)
{
if ( $14 == subants[i] )
{
## Subtenant count:
found["TotalTXCount," subants[i]]++
## Count of HTTP status codes per subant:
found["StatusCount," subants[i] "," $5] ++
## Count of HTTP tx types per subant
httptype = substr($9, 2, length($9) - 1)
found["TXTypeCount," subants[i] "," httptype] ++
## Cumulative size and time of tx by subant_list and HTTP tx type
indexInSizeByType = "InSizeByType," subants[i] "," httptype
found[indexInSizeByType] = found[indexInSizeByType] + $17
indexOutSizeByType = "OutSizeByType," subants[i] "," httptype
found[indexOutSizeByType] = found[indexOutSizeByType] + $18
indexTimeByType = "TimeByType," subants[i] "," httptype
found[indexTimeByType] = found[indexTimeByType] + $19
}
}
}
## Ok, these next two sections warrant a little 'splainin. We track:
## 1) Concurrent connections -- connections that are ongoing during
## a given second, whether or not they were initiated in that
## particular second.
## 2) initiated connections -- connections that were started in a
## given second.
{
## First, track concurrent connections. This one doesn't have the time
## filter that everything else has so that connections already in
## progress when the time window starts are counted.
for ( i = 1 ; i <= length(subants) ; i++)
{
if ( $14 == subants[i] )
{
stime = $1
if (int(($19 + 500000) / 1000000) >= 1 )
{
for ( j = stime ; j <= (stime + int(($19 + 500000) / 1000000)) ; j ++ )
{
time_count[subants[i] "," j] ++
}
}
# for (i in time_count) {print i "\t" time_count[i]}}
}
}
}
$1 >= start_time && $1 < end_time {
## Finally, we track initiated connections.
for ( i = 1 ; i <= length(subants) ; i++)
{
if ( $14 == subants[i] )
{
txInitiated[subants[i] "," $1] ++
}
}
}
END {
print linecount
for ( i in found )
{
print i "," found[i]
}
for ( i in time_count )
{
split(i,st,",")
subant = st[1]
subant_time = st[2]
if ( time_count[i] > max_concurrency[subant] && subant_time >= start_time && subant_time < end_time )
{
max_concurrency[subant] = time_count[i]
max_concurr_time[subant] = subant_time
}
}
for ( i in txInitiated )
{
split(i,st,",")
subant = st[1]
subant_time = st[2]
if (txInitiated[i] > max_initiated[subant])
{
max_initiated[subant] = txInitiated[i]
max_init_time[subant] = subant_time
}
}
for ( i in max_concurrency )
{
print "PeakCncrntConns," i "," max_concurrency[i] ",@" max_concurr_time[i]
}
for ( i in max_initiated )
{
print "PeakInitConns," i "," max_initiated[i] ",@" max_init_time[i]
}
for ( i in hourlyOperationsCount )
{
print "hourlyOperationsCount", i, hourlyOperationsCount[i]
}
}
Good evening, Im newbie at unix specially with awk
From an scheduler program called Autosys i want to extract some data reading an inputfile that comprises jobs names, then formating the output to columns for example
1.
This is the inputfile:
$ more MapaRep.txt
ds_extra_nikira_usuarios... (18 Replies)
Im writing an expect program to connect to cisco routers and run commands.
my commands file has only two entries
show version
show running-config
when I run the script, the first command is run without a problem.
The second command isn't.
The "s" is missing at the device command line,... (1 Reply)
Hello experts,
I'm stuck with this script for three days now. Here's what i need.
I need to split a large delimited (,) file into 2 files based on the value present in the last field.
Samp: Something.csv
bca,adc,asdf,123,12C
bca,adc,asdf,123,13C
def,adc,asdf,123,12A
I need this split... (6 Replies)
Hi,
I have a situation to compare one file, say file1.txt with a set of files in directory.The directory contains more than 100 files.
To be more precise, the requirement is to compare the first field of file1.txt with the first field in all the files in the directory.The files in the... (10 Replies)
Hello Experts
I have M4000 Solaris 10 server, from few many days there are too many sendmail and mail.local process starting on server and each time i need to kill mannualy using pkill send mail, some time there will 600 of them taking 30mb memory for each and hence slowing down the server,... (2 Replies)
Hi
I have many problems with a script. I have a script that formats a text file but always prints the same error when i try to execute it
The code is that:
{
if (NF==17){
print $0
}else{
fields=NF;
all=$0;
while... (2 Replies)
Hi Experts,
I am adding a column of numbers with awk , however not getting correct output:
# awk '{sum+=$1} END {print sum}' datafile
2.15291e+06
How can I getthe output like : 2152910
Thank you..
# awk '{sum+=$1} END {print sum}' datafile
2.15079e+06 (3 Replies)
Actually I got a list of file end with *.txt
I want to use the same command apply to all the *.txt
Thus I try to find out the fastest way to write those same command in a script and then want to let them run automatics.
For example:
I got the file below:
file1.txt
file2.txt
file3.txt... (4 Replies)
Hi All,
I have a data calculation process-a perl script running each and every hour which will do some calculations on the data stored in a mysql server. Normally it tooks around 2minutes (max) to complete.
But in case if i did any actions on the linux box where the database is... (7 Replies)