awk - find average interarrival times for each unique page

05-01-2011

Registered User

21, 1

Join Date: Jun 2010

Last Activity: 9 September 2011, 9:34 AM EDT

Posts: 21

Thanks Given: 14

Thanked 1 Time in 1 Post

awk - find average interarrival times for each unique page

All,

I have a test file as specified below. 1st col is <arrival time> and 2nd col is <Page #>. I want to find the inter-arrival time of requests for each page # (I've done this part already). Once I have this, I want to calculate the average interarrival time. Note, that I am trying to have the average interarrival time for the requests that arrive for each unique page. In other words, I don't want the average inter-arrival time for all of the requests in the trace with no respect to pages, b/c that would be trivial to do.

I know how to do the calculation but my problem is I'm not sure what the best way to store these would be. Before I calculate it, I probably need to store all of the inter-arrival times for each unique page first, then I can calculate the average. Or maybe someone knows of an easier way to do this. Here is my example.

My testfile.txt (the file is sorted by Page # (2nd col))

Code:

0.000 55588
0.000 55592
3.2320 55592
117.964 55596
530.841 55596
928.232 55596
117.964 55600
530.841 55600
630.789 55600
700.232 55600

For the average inter-arrival time, I would just add all the interarrival times up for that page and then divide by [the number of requests for that page - 1]. It is minus one because it is the inter-arrival time between 2 requests.

My desired output should be something like this:

Code:

<Page #> <Average inter-arrival time for each Page #>
55588 0
55592 3.232
55596 405.134
55600 194.089

Here is the code I have so far.

Code:

#!/bin/bash

cat testfile.txt | awk '
{currTS=$1; currPage=$2}

{
if(currPage==prevPage)
        { intArriv=currTS-prevTS }
else
        { intArriv=0 }
print "prevTS="prevTS" prevPage="prevPage" currTS="currTS" currPage="currPage" intArriv="intArriv

prevTS=currTS
prevPage=currPage
} '

Thank you in advance for your help!
Jonathan

jontjioe

View Public Profile for jontjioe

Find all posts by jontjioe

05-01-2011

Registered User

1,910, 488

Join Date: Sep 2008

Last Activity: 22 December 2019, 2:31 AM EST

Location: San Jose, CA

Posts: 1,910

Thanks Given: 54

Thanked 488 Times in 481 Posts

Not sure if this is what you are looking for...

Code:

#!/bin/bash
cat testfile.txt | awk '
{currTS=$1; currPage=$2}

{
if(currPage==prevPage)
        { intArriv=currTS-prevTS }
else
        { intArriv=0 }
print "prevTS="prevTS" prevPage="prevPage" currTS="currTS" currPage="currPage" intArriv="intArriv

prevTS=currTS
prevPage=currPage

a[$2]=a[$2]+intArriv;
b[$2]++;
}
END{
for(i in a){div=b[i]-1;print "Average Inter-Arrival Time for "i"\t:\t"a[i]/(div?div:1)}
}

regards,
Ahamed

Last edited by ahamed101; 05-01-2011 at 06:20 PM..

ahamed101

View Public Profile for ahamed101

Find all posts by ahamed101

05-01-2011

Registered User

21, 1

Join Date: Jun 2010

Last Activity: 9 September 2011, 9:34 AM EDT

Posts: 21

Thanks Given: 14

Thanked 1 Time in 1 Post

Ahamed,

That definitely worked for the small sample file I posted! Thanks. However, I am doing this on a very large file and for some reason I am getting negative numbers. I'm guessing it's because I need to take into account for very large numbers? Do I need to cast some of the variables as float or somehow account for very large numbers?

Thanks again for your help!
Jonathan

Here is the complete testfile.txt that I am using. I have put it in my dropbox since it is about 18MB.
http://dl.dropbox.com/u/9867823/testfile.txt

I've also modified the script slightly. The updated script is below:

Code:

#!/bin/bash

FILE=$1

cat $FILE | sort -n -k2 | awk '
{currTS=$1; currPage=$2}

{
if(currPage==prevPage)
        { intArriv=currTS-prevTS }
else
        { intArriv=0 }

prevTS=currTS
prevPage=currPage

a[$2]=a[$2]+intArriv;
b[$2]++
} 
END{
for(i in a){div=b[i]-1;print i"\t"a[i]/(div?div:1)
}
}
' > ${FILE}_interArrivalTimes

jontjioe

View Public Profile for jontjioe

Find all posts by jontjioe

05-02-2011

Moderator

1,484, 567

Join Date: Mar 2011

Last Activity: 28 November 2020, 9:34 AM EST

Posts: 1,484

Thanks Given: 68

Thanked 567 Times in 444 Posts

For floating point notation you need to use printf with %f in your END block e.g.
Slight modification will display everything as you wish.

Code:

END{
for(i in a){div=b[i]-1;printf "%s %f\n",i,a[i]/(div?div:1)}
}

Peasant

View Public Profile for Peasant

Find all posts by Peasant

05-02-2011

Registered User

1,271, 299

Join Date: Sep 2009

Last Activity: 17 July 2019, 5:46 PM EDT

Location: ./India/Bangalore

Posts: 1,271

Thanks Given: 70

Thanked 299 Times in 290 Posts

Try this,

Code:

sort -nk2 -nk1 testfile.txt | awk '{if($2 in a){diff=diff+$1-a[$2];a[$2]=$1;i++;b=$2;next}
else {
if(i>0) {--i;if(i==0){printf "%s %f\n",b,0}else{printf "%s %f\n",b,diff/i;}}
i=0;diff=0;a[$2]=$1;i++;b=$2
}
}
END {
 if(i>0) { --i;if(i==0){printf "%s %f\n",b,0}else{printf "%s %f\n",b,diff/i;}}
}'

This User Gave Thanks to pravin27 For This Post:

pravin27

View Public Profile for pravin27

Find all posts by pravin27

05-02-2011

Registered User

436, 107

Join Date: Feb 2011

Last Activity: 24 March 2015, 6:12 AM EDT

Posts: 436

Thanks Given: 9

Thanked 107 Times in 106 Posts

Code:

echo '0.000 55588
0.000 55592
3.2320 55592
117.964 55596
530.841 55596
928.232 55596
117.964 55600
530.841 55600
630.789 55600
700.232 55600' |awk '{v1=$2==v2?v1:$1;a[$2]=$1-v1;v2=$2;b[$2]++}END{for(i in a) print i,a[i]/(b[i]==1?1:b[i]-1)}'
55592 3.232
55600 194.089
55596 405.134
55588 0

yinyuemi

View Public Profile for yinyuemi

Find all posts by yinyuemi

05-02-2011

Registered User

21, 1

Join Date: Jun 2010

Last Activity: 9 September 2011, 9:34 AM EDT

Posts: 21

Thanks Given: 14

Thanked 1 Time in 1 Post

Peasant,

I tried this but I'm still getting negative timestamps. Is the inter-arrival calculation happening correctly? It should be interArrivTime=currTime-prevTime (unless currTime is 0...in which case the ArrivTime for that line should just be 0).

Quote:

Originally Posted by Peasant

For floating point notation you need to use printf with %f in your END block e.g.
Slight modification will display everything as you wish.

Code:

END{
for(i in a){div=b[i]-1;printf "%s %f\n",i,a[i]/(div?div:1)}
}

---------- Post updated at 03:22 PM ---------- Previous update was at 03:19 PM ----------

Pravin27,

This looks like it's working perfectly! Thank you!

Jonathan

Quote:

Originally Posted by pravin27

Try this,

Code:

sort -nk2 -nk1 testfile.txt | awk '{if($2 in a){diff=diff+$1-a[$2];a[$2]=$1;i++;b=$2;next}
else {
if(i>0) {--i;if(i==0){printf "%s %f\n",b,0}else{printf "%s %f\n",b,diff/i;}}
i=0;diff=0;a[$2]=$1;i++;b=$2
}
}
END {
 if(i>0) { --i;if(i==0){printf "%s %f\n",b,0}else{printf "%s %f\n",b,diff/i;}}
}'

---------- Post updated at 03:32 PM ---------- Previous update was at 03:22 PM ----------

Thanks everybody for all your help on this...how much harder would it be to also add a 3rd column that gives me the standard deviation for the average inter arrival time for each page?

The formula for standard deviation is:
stand dev = square_root{ Summation[ (x - aveIntArrivTime)^2] / (N-1) }

where
x = the intArrivalTime for each page
aveIntArrivTime = the average InterArrivalTime for each page (which we now have)
N = the number of requests for each page

The formula is also shown here:
Simple Example of Calculating Standard Deviation

jontjioe

View Public Profile for jontjioe

Find all posts by jontjioe

Shell Programming and Scripting

awk - find average interarrival times for each unique page

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Awk: Page number with total number of pages, EG Page 1 of 5

Discussion started by: tugar

2. Shell Programming and Scripting

Using awk to find unique, how to save results?

Discussion started by: shonna

3. Shell Programming and Scripting

Find the average and the different

Discussion started by: Shenbaga.d

4. Shell Programming and Scripting

awk to find lines containing word that occur multiple times

Discussion started by: SkySmart

5. Shell Programming and Scripting

AWK script to split data and find average

Discussion started by: chrisjorg

6. Shell Programming and Scripting

awk based script to find the average of all the columns in a data file

Discussion started by: ks_reddy

7. Homework & Coursework Questions

Find the Maximum value and average of a column

Discussion started by: dstewie

8. UNIX for Dummies Questions & Answers

Compare unique ID's to Create and Delete Times

Discussion started by: bpfoster7

9. SCO

Find access times

Discussion started by: chefsride