02-18-2011
176,
5
Join Date: Oct 2008
Last Activity: 11 November 2015, 6:40 PM EST
Location: Orem, Utah
Posts: 176
Thanks Given: 16
Thanked 5 Times in 5 Posts
"Elements per page"... seeking ideas...
I work for a web hosting company uses Apache. We like to come up with composite models of what our customers do so that we can tailor our servers to what they need. One question we like to answer is, "For a given page downloaded from our customer's virtual server, what is the mean number of elements on that page?" An "element", roughly defined, is a transfer that appears in the Apache log in order to populate a page requested by a customer. The most common "element" type, of course, is images.
So, we'd like to have some reasonable way to determine the mean and dstandard deviation of number of "elements" per page. Possibly of help is that we are just building a general model, so some helpful assumptions may tend to even out over large numbers of log files. And I should mention that effectively our only source of information about this is the Apache logs from customers' virtual servers.
How would you approach this problem? We're certainly not helped by the fact that Apache logs really weren't designed for this. For that matter, neither was HTTP. Even so, without prejudicing your various brains toward one approach, here's a thought...
We know already that almost all pages served by our servers are transferred in less than 6 seconds. That's the HTML source page (or whatever dynamic page type it may be...) and the elements it calls. So, suppose we were to say that all log entries with a certain client IP address appearing within 6 seconds of each other are likely to be associated with a single customer page request. Then we could just record the IP and associated times and look for "clusters" of 6 seconds or less and count the number of elements in that grouping. But I'm uncertain of how to code that sort of "sliding window". I tend to do these things best in awk, and I'm not seeing how to do that.
Any thoughts? Like I said, that's just one approach, and certainly not necessarily the best. Thanks in advance.