awk - find average interarrival times for each unique page


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk - find average interarrival times for each unique page
# 1  
Old 05-01-2011
awk - find average interarrival times for each unique page

All,

I have a test file as specified below. 1st col is <arrival time> and 2nd col is <Page #>. I want to find the inter-arrival time of requests for each page # (I've done this part already). Once I have this, I want to calculate the average interarrival time. Note, that I am trying to have the average interarrival time for the requests that arrive for each unique page. In other words, I don't want the average inter-arrival time for all of the requests in the trace with no respect to pages, b/c that would be trivial to do.

I know how to do the calculation but my problem is I'm not sure what the best way to store these would be. Before I calculate it, I probably need to store all of the inter-arrival times for each unique page first, then I can calculate the average. Or maybe someone knows of an easier way to do this. Here is my example.

My testfile.txt (the file is sorted by Page # (2nd col))
Code:
0.000 55588
0.000 55592
3.2320 55592
117.964 55596
530.841 55596
928.232 55596
117.964 55600
530.841 55600
630.789 55600
700.232 55600

For the average inter-arrival time, I would just add all the interarrival times up for that page and then divide by [the number of requests for that page - 1]. It is minus one because it is the inter-arrival time between 2 requests.

My desired output should be something like this:
Code:
<Page #> <Average inter-arrival time for each Page #>
55588 0
55592 3.232
55596 405.134
55600 194.089

Here is the code I have so far.
Code:
#!/bin/bash

cat testfile.txt | awk '
{currTS=$1; currPage=$2}

{
if(currPage==prevPage)
        { intArriv=currTS-prevTS }
else
        { intArriv=0 }
print "prevTS="prevTS" prevPage="prevPage" currTS="currTS" currPage="currPage" intArriv="intArriv

prevTS=currTS
prevPage=currPage
} '

Thank you in advance for your help!
Jonathan
# 2  
Old 05-01-2011
Not sure if this is what you are looking for...
Code:
#!/bin/bash
cat testfile.txt | awk '
{currTS=$1; currPage=$2}

{
if(currPage==prevPage)
        { intArriv=currTS-prevTS }
else
        { intArriv=0 }
print "prevTS="prevTS" prevPage="prevPage" currTS="currTS" currPage="currPage" intArriv="intArriv

prevTS=currTS
prevPage=currPage

a[$2]=a[$2]+intArriv;
b[$2]++;
}
END{
for(i in a){div=b[i]-1;print "Average Inter-Arrival Time for "i"\t:\t"a[i]/(div?div:1)}
}

regards,
Ahamed

Last edited by ahamed101; 05-01-2011 at 06:20 PM..
# 3  
Old 05-01-2011
Ahamed,

That definitely worked for the small sample file I posted! Thanks. However, I am doing this on a very large file and for some reason I am getting negative numbers. I'm guessing it's because I need to take into account for very large numbers? Do I need to cast some of the variables as float or somehow account for very large numbers?

Thanks again for your help!
Jonathan

Here is the complete testfile.txt that I am using. I have put it in my dropbox since it is about 18MB.
http://dl.dropbox.com/u/9867823/testfile.txt

I've also modified the script slightly. The updated script is below:
Code:
#!/bin/bash

FILE=$1

cat $FILE | sort -n -k2 | awk '
{currTS=$1; currPage=$2}

{
if(currPage==prevPage)
        { intArriv=currTS-prevTS }
else
        { intArriv=0 }

prevTS=currTS
prevPage=currPage

a[$2]=a[$2]+intArriv;
b[$2]++
} 
END{
for(i in a){div=b[i]-1;print i"\t"a[i]/(div?div:1)
}
}
' > ${FILE}_interArrivalTimes

# 4  
Old 05-02-2011
For floating point notation you need to use printf with %f in your END block e.g.
Slight modification will display everything as you wish.
Code:
END{
for(i in a){div=b[i]-1;printf "%s %f\n",i,a[i]/(div?div:1)}
}

# 5  
Old 05-02-2011
Try this,
Code:
sort -nk2 -nk1 testfile.txt | awk '{if($2 in a){diff=diff+$1-a[$2];a[$2]=$1;i++;b=$2;next}
else {
if(i>0) {--i;if(i==0){printf "%s %f\n",b,0}else{printf "%s %f\n",b,diff/i;}}
i=0;diff=0;a[$2]=$1;i++;b=$2
}
}
END {
 if(i>0) { --i;if(i==0){printf "%s %f\n",b,0}else{printf "%s %f\n",b,diff/i;}}
}'

This User Gave Thanks to pravin27 For This Post:
# 6  
Old 05-02-2011
Code:
echo '0.000 55588
0.000 55592
3.2320 55592
117.964 55596
530.841 55596
928.232 55596
117.964 55600
530.841 55600
630.789 55600
700.232 55600' |awk '{v1=$2==v2?v1:$1;a[$2]=$1-v1;v2=$2;b[$2]++}END{for(i in a) print i,a[i]/(b[i]==1?1:b[i]-1)}'
55592 3.232
55600 194.089
55596 405.134
55588 0

# 7  
Old 05-02-2011
Peasant,

I tried this but I'm still getting negative timestamps. Is the inter-arrival calculation happening correctly? It should be interArrivTime=currTime-prevTime (unless currTime is 0...in which case the ArrivTime for that line should just be 0).

Quote:
Originally Posted by Peasant
For floating point notation you need to use printf with %f in your END block e.g.
Slight modification will display everything as you wish.
Code:
END{
for(i in a){div=b[i]-1;printf "%s %f\n",i,a[i]/(div?div:1)}
}

---------- Post updated at 03:22 PM ---------- Previous update was at 03:19 PM ----------

Pravin27,

This looks like it's working perfectly! Thank you!

Jonathan

Quote:
Originally Posted by pravin27
Try this,
Code:
sort -nk2 -nk1 testfile.txt | awk '{if($2 in a){diff=diff+$1-a[$2];a[$2]=$1;i++;b=$2;next}
else {
if(i>0) {--i;if(i==0){printf "%s %f\n",b,0}else{printf "%s %f\n",b,diff/i;}}
i=0;diff=0;a[$2]=$1;i++;b=$2
}
}
END {
 if(i>0) { --i;if(i==0){printf "%s %f\n",b,0}else{printf "%s %f\n",b,diff/i;}}
}'

---------- Post updated at 03:32 PM ---------- Previous update was at 03:22 PM ----------

Thanks everybody for all your help on this...how much harder would it be to also add a 3rd column that gives me the standard deviation for the average inter arrival time for each page?

The formula for standard deviation is:
stand dev = square_root{ Summation[ (x - aveIntArrivTime)^2] / (N-1) }

where
x = the intArrivalTime for each page
aveIntArrivTime = the average InterArrivalTime for each page (which we now have)
N = the number of requests for each page

The formula is also shown here:
Simple Example of Calculating Standard Deviation
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Awk: Page number with total number of pages, EG Page 1 of 5

So I've worked how to add page numbers based on regex. It's using the footer text. How do we get the total amount added so we have page number with the total number of pages? Desired output: Page No:1 of 5 Thanks in advance. (15 Replies)
Discussion started by: tugar
15 Replies

2. Shell Programming and Scripting

Using awk to find unique, how to save results?

I am very very new to this (as in, I didn't even know awk existed till today) I have a huuuuge csv file. In column 1, there is a ton of emails. I need to find which emails are unique, and save those rows to a separate file. I also need to find which emails are duplicates, and save a record of... (10 Replies)
Discussion started by: shonna
10 Replies

3. Shell Programming and Scripting

Find the average and the different

Hi , Every day I'll get a file, in that I have to match today's file(20130619) third column to previous files (20130618,20130617), that is 124 present in previous files or not. If it matches then I have take the average values of 5th column of 124 from yesterdays and day before yesterdays file,... (5 Replies)
Discussion started by: Shenbaga.d
5 Replies

4. Shell Programming and Scripting

awk to find lines containing word that occur multiple times

i have a script that scans a log file every 10 minutes. this script remembers the last line of the log and then uses it to continue monitoring the log when it runs again 10 minutes later. the script searches the log for a string called MaxClients. now, how can i make it so that when the... (7 Replies)
Discussion started by: SkySmart
7 Replies

5. Shell Programming and Scripting

AWK script to split data and find average

Input: 2.58359023380340e+02 1.43758864405595e+02 -7.65700666212508e+00 1.06460208083228e+02 1.26185441783936e+02 -3.41389169427027e+01 -1.40393299309592e+02 -3.07758776849508e+01 1.45067703495838e+02 1.79405834959073e+02 5.06666234594205e+01 OUT 2.0105894389e+02 (average of... (8 Replies)
Discussion started by: chrisjorg
8 Replies

6. Shell Programming and Scripting

awk based script to find the average of all the columns in a data file

Hi All, I need the modification for the below mentioned code (found in one more post https://www.unix.com/shell-programming-scripting/27161-script-generate-average-values.html) to find the average values for all the columns(but for a specific rows) and print the averages side by side. I have... (4 Replies)
Discussion started by: ks_reddy
4 Replies

7. Homework & Coursework Questions

Find the Maximum value and average of a column

Use and complete the template provided. The entire template must be completed. If you don't, your post may be deleted! 1. The problem statement, all variables and given/known data: I am trying to complete a script which will allow me to find: a) reads a value from the keyboard. (ask the... (4 Replies)
Discussion started by: dstewie
4 Replies

8. UNIX for Dummies Questions & Answers

Compare unique ID's to Create and Delete Times

I have thousands of lines a day of data I would like to sort out. Every sessions has the 3 lines below. I want to figure out each sessions length from Creation to Deletion. Every one has a unique session ID logevent3:<190>Nov 20 08:41:06 000423df255c: 6|4096|RC|CAC: Created CAC session ID... (2 Replies)
Discussion started by: bpfoster7
2 Replies

9. SCO

Find access times

I am working on a SCO Unixware 7.1.4 server and I have been asked to determine over the last year when a file was accessed, not just the last time it was accessed. Is there anyway to figure this out? Thanks in advance, Kevin Harnden (1 Reply)
Discussion started by: chefsride
1 Replies
Login or Register to Ask a Question