Script to find the top 100 most popular pages

04-27-2011

Registered User

21, 1

Join Date: Jun 2010

Last Activity: 9 September 2011, 9:34 AM EDT

Posts: 21

Thanks Given: 14

Thanked 1 Time in 1 Post

Script to find the top 100 most popular pages

Ok, this is really beyond my scripting skill level so I'm hoping somebody can help me out with this. I have a trace file in the following format:

Code:

<timestame> <devicenum> <sector address> <size in sectors> <0 or 1 (write or read)>

Here is what I need to do. I need to use the <sector address>, <size in sectors>, and the <0 or 1> fields.

I need to first check that the last field is a 0. If it is 0, I will need to check more fields on this line. If it is 1, I can skip it and go on to the next line.

So, if the last field is 0, I need to calculate the "pages" that are in this line. My requirements for pages are:
1) A page will be made up of 4 sectors.
2) A page must start off at a <sector number> that is evenly divisible by 4. If it does not, the <sector address> should rounded DOWN to the nearest sector number that is evenly divisible by 4.
3) Additionally, since a page is 4 sectors, if the <size in sectors> is less than a multiple of 4, it will need to be rounded UP to the closest multiple of 4. In other words, if there are only 3 sectors in the <size in sectors>, that still takes up at least 1 page.

What I want to do is find the top 100 pages that are the most popular in terms of writes (the last column is 0) in a trace file.

Here is a small example to illustrate:

Code:

123.257 0 12 6 0
456.579 0 13 8 0
458.780 0 2 1
500.579 0 5 9 0

For the 1st line, there will be 2 pages: the 1st page starts at 12 and the 2nd page starts at 16.

For the 2nd line, there will also be 2 pages: 1st page starts at 12 and 2nd page starts at 16. Note that both of these pages are actually the same pages from line 1.

The 3rd line is ignored because the last column is a 1 (read request).

For the 4th line, there will be 3 pages: the 1st page starts at 4, the 2nd page starts at 8, and the last page starts at 12. Note that the page starting at 12 is the same as the page in lines 1 and 2.

So for this small example, I want to have a printout similar to this. It should be sorted by the 2nd column in descending order so I can see the most popular files.

Code:

Page (starting sector #)   |   # of Writes
---------------------------------------------
12 3
16 2
4 1
8 1

And if I haven't already asked you for the world...the faster it runs, the better! I will have to run this on several million lines, so speed is important. I already have awk or perl installed so hopefully it will be one of those. Perl seems to be much faster.

Thank you so much in advance! You guys are awesome!

A longer example of the trace is below for testing:

Code:

5839.257 0 303884 7 0
5839.257 0 206070 6 0
5839.257 0 817773 6 0
5878.579 0 303891 7 0
5878.579 0 361650 6 0
5878.579 0 973353 6 0
5970.329 0 841315 24 0
6009.651 0 16601 1 0
6009.651 0 285602 1 0
6009.651 0 140952 6 0
6009.651 0 211173 6 0
6009.651 0 878233 2 0
6009.651 0 1002247 2 0
6009.651 0 725319 1 0
6016.204 0 206070 6 0
6016.204 0 817773 6 0
6016.204 0 760113 1 0
6022.758 0 303898 24 0
6042.419 0 303922 7 0

jontjioe

View Public Profile for jontjioe

Find all posts by jontjioe

04-27-2011

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

Well, it sounds like you should not try to do too much at once! You have a sector start and length, so convert to sector range first last, and then to page range first last, and then iterate the pages in that inclusive range. For instance, 123 45 is 123 - 167 is 30 - 41 is 30, 31, 32, ..., 41. This assumes a zero-based sector and page numbering.

Use grep to filter out just the writes, do the conversions, and send the page writes to:

Code:

sort | uniq -c | sort -nr | head -100

If your conversion is too slow in bash/ksh/sh, you can move it to PERL/JAVA/C++/C. If your sort set gets too big, in place of "sort | uniq -c", you can use my aggsx.c in memory aggregator -l option: https://www.unix.com/shell-programmin...roup-unix.html If you want it really fast, do it all in C using a big array of int or long long. You might even pipe the trace to it, and print out periodic reports. Even better would be to decontruct your trace and integrate part of it, so you can work all in int/long, not string numbers.

Last edited by DGPickett; 04-27-2011 at 01:49 PM..

This User Gave Thanks to DGPickett For This Post:

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

05-01-2011

Registered User

21, 1

Join Date: Jun 2010

Last Activity: 9 September 2011, 9:34 AM EDT

Posts: 21

Thanks Given: 14

Thanked 1 Time in 1 Post

Ok, I got it working. I split up the problem and I was able to figure it out...just took me a while. Here it is for those who are interested. I left all of the pages instead of just keeping the top 100.

Thanks!

Code:

#!/bin/bash
#===============================================================================
# 
#===============================================================================
#@| SYNOPSIS
# |
#-|   Filename: findPopularPages.sh
#-|   Version:  1.0 
# |
#-|   Purpose:
#-|            This script finds the popularity of pages of a disksim
#-|            formatted trace file.
#-|
#-|   Inputs:  Real world trace file in the DiskSim format
#-|
#-|   Outputs: Prints trace statistics to screen
#-|
#------------------------------------------------------------------------------
#@| REQUIREMENTS / DEPENDENCIES / ASSUMPTIONS
# |
#------------------------------------------------------------------------------
#@| REVISION HISTORY
# |
#-|   Date Created/Revised: 4/30/2011  Author/Revisor: Jonathan Tjioe
#-|   STR #:  Bug ID:  Description:
#-|   1.0  Jonathan Tjioe (4/30/2011)
# |
#===============================================================================

FILE=$1
SCRIPT=findPopularPages.sh
PWD=`pwd`

printf "\nStarted $SCRIPT on $FILE at:\t\t `date`"

#Only print the Write requests ($5==0)
removeReads() {
cat $FILE | awk '$5==0 {print $0}' > TEMP1
}

#The LBA must start at a block address that is divisible by 4
#The size must be in multipes of 4
adjustBlockAndSize() {
cat TEMP1 | awk '$3=(int($3/4)*4)' |awk '{
if (($4%4)!=0) {print $1" "$2" "$3" "((int($4/4)*4)+4)" "$5}
else {print $0} }' > TEMP2
} 

#Need to save a list of all of the pages for each line
createPageListing() {
cat TEMP2 | awk '{for (i=$3; i<($3+$4); i+=4) {print i}}' > TEMP3
}

#Count how many times a page was written to
countUniquePages() {
cat TEMP3 | awk '{count[$1]++} END {for(i in count) print i, count[i]}' > TEMP4
}

#Sort the pages by the most popular in descending order
sortPopularity() {
cat TEMP4 | sort -nr -k2 > ${FILE}_mostPopularPages
}

#Delete temporary files
cleanUp() {
rm TEMP1 TEMP2 TEMP3 TEMP4
}

############## Start of Main ##############
removeReads
adjustBlockAndSize
createPageListing
countUniquePages
sortPopularity
cleanUp
############## End of Main ################

printf "\nFinished $SCRIPT on $FILE at:\t\t `date`\n\n"
exit 0

jontjioe

View Public Profile for jontjioe

Find all posts by jontjioe

05-01-2011

Registered User

2,759, 420

Join Date: Jun 2006

Last Activity: 13 September 2015, 8:58 PM EDT

Posts: 2,759

Thanks Given: 44

Thanked 420 Times in 408 Posts

All in one awk.

Code:

awk '$5==0 {$3=int($3/4)*4; $4=($4%4)?(int($4/4)*4)+4:$4; for (i=$3;i<=$3+$4;i+=4) a[i]++}
    END{for (i in a) print i,a[i]|"sort -nr -k2|head -100"}' $FILE

rdcwayx

View Public Profile for rdcwayx

Find all posts by rdcwayx

Shell Programming and Scripting

Script to find the top 100 most popular pages

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Find and cat top lines recursively

Discussion started by: darbs121

2. Red Hat

How to find memory taken by a process using top command?

Discussion started by: RHCE

3. UNIX for Dummies Questions & Answers

how to find top 3 users currently logged on

Discussion started by: whyatepies

4. Shell Programming and Scripting

find top 4 users currently logged on can i use grep

Discussion started by: whyatepies

5. Programming

code to find the top of the stack, not able to figure it out

Discussion started by: holla4ni

6. UNIX for Dummies Questions & Answers

find the size of a database by counting all the used pages

Discussion started by: family_guy

7. Shell Programming and Scripting

find top 100 files and move them

Discussion started by: ali560045

8. Cybersecurity

Recursively find and change Permissions on Man pages

Discussion started by: altamaha

9. Shell Programming and Scripting

How to exclude top level directory with find?

Discussion started by: siegfried

10. AIX

How to find the top 6 users (which consume most space)?

Discussion started by: RebelDac