Sed Comparing Parenthesized Values In Previous Line To Current Line

10-01-2012

Registered User

1, 0

Join Date: Sep 2012

Last Activity: 1 October 2012, 6:48 AM EDT

Posts: 1

Thanks Given: 0

Thanked 0 Times in 0 Posts

Sed Comparing Parenthesized Values In Previous Line To Current Line

I am trying to delete lines in archived Apache httpd logs

Each line has the pattern:

<ip-address> - - <date-time> <document-request-URL> <http-response> <size-of-req'd-doc> <referring-document-URL>

This pattern is shown in the example of 6 lines from the log in the code box below. These 6 lines are in a row, and all start with same IP address. These are actually the first 6 lines of about 20-25 lines, in which a document was served with multiple gif images.

My purpose is to delete lines under the following conditions:

First, determine whether a line in the log becomes a "reference line"
This happens if the line being tested ("subsequent line")

has a different IP address field from IP address field of the reference line or if there is no reference line currently to evaluate (uninitialized reference line)
has the same IP address to the IP address field in the reference line, but the referring document is not from my web site
has the same IP address to the IP address in the reference line but the requested (GET) document is from my web site but a different HTML document
has the same IP address to the IP address in the current reference line but the requested document is logged more than 60 seconds after the previous requested document in the current reference line

Note that a "subsequent line" is one that does not qualify to become a "reference line."

I have only one sed command line now, basically the regular expression to correspond to the pattern which identifies the line and parenthesized expressions/fields in the line. To be safe, I am using old-style regular expression syntax and not any "extended" kind, such as using `\d` metacharacters to indicate digits

Code:

/(([0-9]{1,3}\.){3}[0-9]{1,3}).*([0-9]{2}\/[a-zA-Z]{3}\/[0-9]{4}:[0-9]{2}:[0-9]{2}:[0-9]{2}) \"GET (.*) HTTP\/1\.1\" [0-9]{3} [0-9]{1,} \"http:\/\/my\.website\.org\//

I have wrapped three fields: the IP address, Date, requested document (between GET and HTTP/1.1), in parentheses.

This becomes the line to be tested: the tests are to do as above. 1) check for IP address difference, (2) check for time difference, (3) examine the requested docfield for file types ( gif | ico | css | js | png | jpg | jpeg | etc ) basically if they are not html, they get deleted (4) make sure the hostname/server name in the referring document is 'http://my.website.org/'

I am thinking that I need to use the Hold and eXchange pattern system, but am not sure how to go about that. More importantly, I must do comparisons on the text, converting date/time expressions into integers to be compared, and more doing string comparisons. The sed utility, as far as I know, has no built-in features for this, so I may have to pass these as parameters to a shell (?) to do the comparisons and return a result that sed can work with.

I have even more of a challenge too: see the NB below.

What I need is a good pointer or reference to what I should be telling sed to do, aside from just being given the answer. Thanks.

Code:

[line lengths broken up to avoid an annoying presentation]

172.16.77.182 - - [18/Sep/2012:20:48:16 +0300] "GET /reference/imagesHistoHTML/ethyl-eosin.gif HTTP/1.1" 200 3300 "http://www.google.com/imgres?hl=en&sa=X&rlz=1C1CHKZ_enUS433US433&biw=1280&
   bih=670&tbm=isch&prmd=imvns&tbnid=fEILNrTkTl2MzM:&imgrefurl=http://my.website.org/reference/histo.html&docid=r1tvojxQaLyVFM&imgurl=http://my.website.org/reference/imagesHistoHTML/ethyl-eosin.gif
   &w=366&h=265&ei=ybNYUPD5CNSO0QHoqYG4BQ&zoom=1&iact=hc&vpx=635&vpy=80&dur=2959&hovh=191&hovw=264&tx=168&ty=103&sig=114532110125230912831&page=1&tbnh=120&tbnw=166&start=0&
   ndsp=18&ved=1t:429,r:15,s:0,i:123" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1"
172.16.77.182 - - [18/Sep/2012:20:48:17 +0300] "GET /reference/style/std.css HTTP/1.1" 200 5429 "http://my.website.org/reference/histo.html" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 
   (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1"
172.16.77.182 - - [18/Sep/2012:20:48:17 +0300] "GET /style/std.css HTTP/1.1" 200 5429 "http://my.website.org/reference/histo.html" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) 
    Chrome/21.0.1180.89 Safari/537.1"
172.16.77.182 - - [18/Sep/2012:20:48:16 +0300] "GET /reference/histo.html HTTP/1.1" 200 118818 "http://www.google.com/imgres?hl=en&sa=X&rlz=1C1CHKZ_enUS433US433&biw=1280&bih=670&tbm=isch&
     prmd=imvns&tbnid=fEILNrTkTl2MzM:&imgrefurl=http://my.website.org/reference/histo.html&docid=r1tvojxQaLyVFM&imgurl=http://my.website.org/reference/imagesHistoHTML/ethyl-eosin.gif
    &w=366&h=265&ei=ybNYUPD5CNSO0QHoqYG4BQ&zoom=1&iact=hc&vpx=635&vpy=80&dur=2959&hovh=191&hovw=264&tx=168&ty=103&sig=114532110125230912831&page=1&tbnh=120
    &tbnw=166&start=0&ndsp=18&ved=1t:429,r:15,s:0,i:123" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1"
172.16.77.182 - - [18/Sep/2012:20:48:18 +0300] "GET /reference/imagesHistoHTML/dichlorotriazinyl.gif HTTP/1.1" 200 1406 "http://my.website.org/reference/histo.html" "Mozilla/5.0 (Windows NT 6.1; WOW64) 
    AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1"
172.16.77.182 - - [18/Sep/2012:20:48:19 +0300] "GET /reference/imagesHistoHTML/nitroso%20dye%20structure.gif HTTP/1.1" 200 2259 "http://my.website.org/reference/histo.html" "Mozilla/5.0
    (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1"

(to respect privacy of those accessing the server, I changed IP address to a recognized private LAN address [I hope])

NB: I am running this sed script as sed.exe (GNU sed version 4.2.1 (c) 2009) under Microsoft Windows 7, thus solutions requiring use of a shell should be a shell command processor installed (MS cmd version 6..7601 or or installable within Windos 7. I am aware that I can process the text of the logs within a VM running a Linux distro (I have, for instance, Ubuntu and TinyCore Linux installed as VMs), but (1) I have not kicked the MS Windows environment as an every-day use system and (2) my facility in bash scripting was more than a decade ago.

Last edited by Proteomist; 10-01-2012 at 03:03 AM.. Reason: break up code-boxed line lengths

Proteomist

View Public Profile for Proteomist

Find all posts by Proteomist

10-04-2012

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

Sed scripts that handle multiple lines usually have a different flavor -- I like to call them loopers.

You add more lines using N. Often, the only line not read with N is the first! The behavior of N at $ (eof) was buggy in some early versions, so I test for that before the N.
Then you can write regex that span or hook to the '\n' in between lines that also still matches '.'.
Using :labels and t or b branching, you can pile up lines in the buffrer to your heart's content (or your old sed version's fixed buffer size).
You can use P to spit out just the first line.
With s and  and \1 \2 ... you can swap lines around.
Not much use for D, since you start over.
The '\n' does not seem to be something you can put in [ ... ].

My sed to remove extra blank lines in a row:

Code:

sed '
  :loop
  $b
  N
  s/^\n$//
  t loop
  P
  s/.*\n//
  t loop
 '

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

Shell Programming and Scripting

Sed Comparing Parenthesized Values In Previous Line To Current Line

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Replace values in script reading line by line using sed

Discussion started by: crimsonengineer

2. Shell Programming and Scripting

Read column values from previous and next line using awk

Discussion started by: Nishi_Licious

3. Shell Programming and Scripting

Perl: Conditional replace based on previous and current value in a line

Discussion started by: naveen@

4. Shell Programming and Scripting

ksh comparing current and previous lines

Discussion started by: paulie

5. UNIX for Dummies Questions & Answers

Awk to print data from current and previous line

Discussion started by: awk_noob_456

6. Shell Programming and Scripting

How to use sed to search for string and Print previous two lines and current line

Discussion started by: nmadhuhb

7. Shell Programming and Scripting

awk;sed appending line to previous line....

Discussion started by: walkerwheeler

8. Shell Programming and Scripting

SED or AWK "append line to the previous line"

Discussion started by: research3

9. Shell Programming and Scripting

sed: appending alternate line after previous line

Discussion started by: rish_max

10. Shell Programming and Scripting

Print previous, current and next line using sed

Discussion started by: ysrinu