mail log parsing script in need of makeover


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting mail log parsing script in need of makeover
# 1  
Old 05-13-2008
mail log parsing script in need of makeover

Dear unix forum members,

I'm working on a script that will parse a mail machine's logs and print a list of email addresses in this format:

sender@domain,recipient@domain

The logs look something like this:

06:50:04 0048317AC863: client=localhost.com[127.0.0.1]
06:50:04 0048317AC863: message-id=<user@domain>
06:50:04 0048317AC863: from=<user@domain>,
06:50:04 0048317AC863: to=<user@domain>,
06:50:06 0048317AC863: to=<user@domain>,
06:50:18 0048317AC863: to=<user@domain>,
06:50:18 0048317AC863: to=<user@domain>,
06:50:18 0048317AC863: removed

The "from" and "to" are on different lines and there is another challenge which is that the results should be limited to messages who have 5 or fewer recipients.

I thought it would be easy enough, and I wrote a script that first gets a list of the tag numbers ( 0048317AC863Smilie which belong to messages with 5 or fewer recipients

#!/bin/sh
grep "to=<" /data/log/maillog | grep postfix | grep -vi noqueue | awk '{print $6}' | sort |uniq -c > all_ids

cat all_ids |awk '{print " "$1, $2}' | egrep " 1 | 2 | 3 | 4 | 5 " | cut -f 3 -d " " > ids

Very crude and spaghetti like...and even worse is the FOR loop that follows, which involves grepping through the entire 4000mb maillog file 33,000 times in order to print the sender and recipient addresses.

Needless to say, its not an efficient script, there must be a better way. Please help!! Any responses are appreciated, maybe someone can just point me in the right direction?
Thanks,
JJ
# 2  
Old 05-13-2008
The script which parses the log file to find the id:s might as well be a little more complex, and extract the final information you want while it's reading through the file anyway. If you're not very familiar with scripting languages, though, that's going to be a bit of a learning curve. I would pick Perl but there is certainly ample facilities for this sort of processing in awk as well.

In a typical mail long, there's rarely more than a few transactions going on at the same time, so memory requirements are probably not a problem. Just write and flush a transaction as soon as you see it's done, and you should never have more than a few in memory at any one time.

Without a more representative log snippet, it's a bit shaky, but here's a first start.

Code:
perl -nae 'if ($F[2] eq "removed") { &flush($F[1]); next; }
if ($F[2] =~ m/^from=(.*)/) { $s{$F[1]} = $1; next; }
if ($F[2] =~ m/^to=(.*),/) { push @{$r{$F[1]}}, $1; next; }
sub flush {
  my ($id) = @_;
  if (@{$r{$id}} <= 5) { map { print "$s{$id}$r{$id}->[$_]\n" } 0..$#{$r{$id}} }
  delete $s{$id};
  undef $r{$id}
}' logfile

This keeps a hash %s of senders and an array @{$r} of receipients for each sender. When a queue entry is removed, it is printed if it has 5 or fewer recipients, and removed from the hash and the array. The use of references makes it a bit hard to read, I'm afraid. It would be simpler but also a bit less efficient if there was just one long string for each sender, which then might or might not be printed. (Or, ahem, use Python.)
# 3  
Old 05-13-2008
Bug

Thanks for the advice era!

I was thinking perl would be the way to go, but you also mention python. What would be the specific advantage of using python in this situation?

I'm not very familiar with python, but i've used perl a bit and I like it. I'm also in a bit of a hurry to produce this script so I'm not likely to try and tackle it in python.

Anyway, your snippet is more than enough and much appreciated!
# 4  
Old 05-13-2008
I haven't used python much myself, but it does seem more elegant when it comes to handling nested data structures. If you can get by with the Perl then by all means use that.
# 5  
Old 05-14-2008
OK, this is what I have for the perl script so far. I got into trouble towards the end, and it doesn't like what i'm trying to do with the variable.


#!/usr/local/bin/perl
#use strict;
use locale;
use DBI;
use Cwd ;

my %sender_emails = () ;
my %recipient_emails = () ;
my %recipient_count = () ;

$logfile = '/data/log/maillog';


open(LOG, $logfile);
while (<LOG>)
{
($msgMon, $msgDay, $msgTime, $msgHost, $msgCmd, $QID, $from_to) = split(/\s+/, $_) ;

next if (/from=<>/) ;
next if (/from=<root>/) ;

if (($_ =~ /from=</) && ($_ =~ /qmgr/))
{
($tmpString, $from) = split("from=<", $_);
($from,$tmpString) = split(">", $from);
$sender_emails {$QID} = $from;
}
elsif (($_ =~ /to=</) && ($_ =~ /smtp/))
{
($tmpString, $to) = split("to=<", $_);
($to,$tmpString) = split(">", $to);
$recipient_emails {$QID} = $recipient_emails {$QID} . "$to " ;
$recipient_count {$QID}++ ;
}
}
close(LOG);

foreach $myQID (keys %sender_emails)
{
$myto = $recipient_emails{$myQID} ;
$myfrom = $sender_emails{$myQID} ;
$tocount = $recipient_count{$myQID} ;
next if $tocount >= 6;
foreach $rcpt_group (values %sender_emails)
{
($1, $2, $3, $4, $5) = split(/\s+/, $_);
@rcpt = ("$1", "$2", "$3", "$4", "$5");
{
foreach $rcpt (@rcpt)
{
print $myfrom . "," . $rcpt . \n;
}
}
}
}


This is a working version of the last portion of the script.


foreach $myQID (keys %sender_emails)
{
$myto = $recipient_emails{$myQID} ;
$myfrom = $sender_emails{$myQID} ;
$tocount = $recipient_count{$myQID} ;
next if $tocount >= 6;
{
print $myfrom . "," . $myto . \n;
}
}

the only problem is that it prints out lines with message recipients greater than one in the following fashion.

sender@domain,recipient1@domain recipient2@domain etc.

when I ultimately need:

sender@domain,recipient1@domain
sender@domain,recipient2@domain
sender@domain,recipient3@domain
and so on...


this is what the log entries actually look like.

May 14 01:08:38 mail11 postfix/smtpd[86997]: 21F9C17ADDEB: client=domain.com[127.0.0.1]
May 14 01:08:38 mail11 postfix/cleanup[87530]: 21F9C17ADDEB: message-id=<00ec01c8b580$73d85d60$da0ba8c0@domain>
May 14 01:08:38 mail11 postfix/qmgr[9455]: 21F9C17ADDEB: from=<user@domain>, size=18310, nrcpt=3 (queue active)
May 14 01:08:39 mail11 postfix/smtp[86884]: 21F9C17ADDEB: to=<user@domain>, relay=domain [127.0.0.1]:25, delay=1, delays=0.21/0/0.45/0.39, dsn=2.0.0, status=sent (250 ok: Message 149052398 accepted)
May 14 01:08:39 mail11 postfix/smtp[87444]: 21F9C17ADDEB: to=<user@domain>, relay=domain.com[127.0.0.1]:25, delay=1.8, delays=0.21/0/1.1/0.51, dsn=2.0.0, status=sent (250 Ok: queued as E572B24807B)
May 14 01:08:39 mail11 postfix/smtp[87444]: 21F9C17ADDEB: to=<user@domain>, relay=mail.domain.com[127.0.0.1]:25, delay=1.8, delays=0.21/0/1.1/0.51, dsn=2.0.0, status=sent (250 Ok: queued as E572B24807B)
May 14 01:08:39 mail11 postfix/qmgr[9455]: 21F9C17ADDEB: removed


As always any comments, criticisms, and questions are welcome and appreciated.
-JJ
# 6  
Old 05-14-2008
Quote:
Originally Posted by jjamd64
As always any comments, criticisms, and questions are welcome and appreciated.
Please use code tags for legibility. Fortunately I get proper indentation when I quote your message so it's not completely unreadable.

Code:
use locale;
use DBI;
use Cwd ;

You don't seem to be using the DBI stuff or Cwd; also doubtful what the locale is for. The logs don't use locale-dependent formatting, do they?

Code:
foreach $myQID (keys %sender_emails)
{
        $myto = $recipient_emails{$myQID} ;
        $myfrom = $sender_emails{$myQID} ;
        $tocount = $recipient_count{$myQID} ;
        next if $tocount >= 6;
        foreach $rcpt_group (values %sender_emails)
        {
                ($1, $2, $3, $4, $5) = split(/\s+/, $_);
                @rcpt = ("$1", "$2", "$3", "$4", "$5");
                {
                        foreach $rcpt (@rcpt)
                        {
                                print $myfrom . "," . $rcpt . \n;
                        }
                }
        }
}

You can certainly defer all processing until you have read the whole log file, but like I suggested earlier, it might be more memory-efficient to process and forget queue entries as you see them, assuming that "removed" is a good pattern for seeing when you have a full entry.

Assigning to $1 $2 etc seems wrong, I guess you might be able to do it but that's definitely not recommended. Why don't you assign the result of the split directly to the list @rcpt anyway? If you want to restrict it to just five fields and throw away the rest, you can restrict the split, or splice away the remainder after splitting. And I guess you want to be splitting $rcpt_group, not $_.

Better yet, collect the formatted output already in the initial loop, and just don't print it if the count is too large. I wanted to avoid having two variables and needlessly collect information which was not going to be printed, so I used references to lists instead, but this is certainly a workable solution as well.

Using the string join (dot) operator on a string just for printing it is mildly inefficient, you can just print a list without first gluing together its elements into a single string.

You need to double-quote the \n to make it into a newline; as it is, it's a reference to an undefined symbol (you should have used strict after all!) with a blank print value; that's why you aren't getting any newlines in your output.

To reiterate, that's basically as if you had has an identifier UNDEFINED_SYMBOL and used a backslash to produce a reference to it: \UNDEFINED_SYMBOL. Perl is unacceptable forgiving about these things when you don't use strict and use warnings, for legacy reasons -- you should always use strict and use warnings for new scripts (I wrote "nontrivial" scripts but all scripts appear trivial when you start optimistically working on them). I confess that I too sinned against this -- much against my better conscience.

Last edited by era; 05-14-2008 at 03:08 AM.. Reason: Elaborate on what unquoted \n means and why to use strict
# 7  
Old 05-14-2008
Here's an attempt at combining our efforts. The test data you posted is still not very useful; the recipients should be unique, so you can see which ones are being printed, and there should be multiple transactions, some with too many recipients, so we can test that too. But this does seem to do ... something.

Code:
#!/usr/local/bin/perl

use strict;
use warnings;

my %sender_emails = () ;
my %recipient_emails = () ;
my %recipient_count = () ;

while (<>)
{
    # Do this before split for efficiency
    next if (/from=<>/) ;
    next if (/from=<root>/) ;

    my ($msgMon, $msgDay, $msgTime, $msgHost, $msgCmd, $QID, $from_to)
	= split(/\s+/, $_) ;

    if (/qmgr/ && /from=<([^<>]+)>/)
    {
	my $from = $1;
	$sender_emails {$QID} = $from;
    }
    elsif (/smtp/ && /to=<([^<>]+)>/)
    {
	my ($to) = $1;
	# Preformat output the way we want it
	$recipient_emails {$QID} .= $sender_emails{$QID} . "," . $to . "\n";
	$recipient_count {$QID}++;
    }
    elsif (m/: removed/)
    {
	if ($recipient_count{$QID} <= 5)
	{
	    print $recipient_emails{$QID};
	}
	# Flush this key from memory; we're done with it
	delete $sender_emails{$QID};
	delete $recipient_emails{$QID};
	delete $recipient_count{$QID};
    }
}

I took out the hard-coded mail log path; I'd keep it outside the script, just to make it easier to test, or run it on older or just other log files than the main one.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Parsing a log file and creating a report script

The log file is huge and lot of information, i would like to parse and make a report . below is the log file looks like: REPORT DATE: Mon Aug 10 04:16:17 CDT 2017 SYSTEN VER: v1.3.0.9 TERMINAL TYPE: prod SYSTEM: nb11cu51 UPTIME: 04:16AM up 182 days 57 mins min MODEL, TYPE, and SN:... (8 Replies)
Discussion started by: amir07
8 Replies

2. Shell Programming and Scripting

Bash Script - Mail Secure.log

I'm putting together a fairly simple script, to check "secure.log" for content and email the results in a cron, nightly. The script runs fine upon manual execution, it's a problem when ran in cron. This is on a Mac server. Any thoughts? #!bin/bash #Email secure.log, nightly. Subject="Secure... (6 Replies)
Discussion started by: Nvizn
6 Replies

3. Shell Programming and Scripting

Issue with awk script parsing log file

Hello All, I am trying to parse a log file and i got this code from one of the good forum colleagues, However i realised later there is a problem with this awk script, being naive to awk world wanted to see if you guys can help me out. AWK script: awk '$1 ~ "^WRITER_" {p=1;next}... (18 Replies)
Discussion started by: Ariean
18 Replies

4. Shell Programming and Scripting

Script for parsing vertical log into horizontal

Hi, I have log like this : And i want the output like below : I have try using awk but doesn't work awk ' /ffff /{ts=$1} f && /SectorAntenna\=1/{sa1=$3} f && /SectorAntenna\=2/{sa2=$3} f && /SectorAntenna\=3/{sa3=$3} { s= ts "|" sa1 "|" sa2 "|" sa3 print s f=0 }' (7 Replies)
Discussion started by: justbow
7 Replies

5. Shell Programming and Scripting

Log parsing script

Hello, I have a script that parses logs and sends the output via digitally signed and encrypted email. This script uses grep -v to exclude patterns in a file. The problem I have is if this is run via cron none of the pattern matching seems to occur. If I run it by hand it runs exactly as it is... (2 Replies)
Discussion started by: wpfontenot
2 Replies

6. Shell Programming and Scripting

Script for Parsing Log File

Working on a script that inputs an IP, parses and outputs to another file. A Sample of the log is as follows: I need the script to be able to input IP and print the data in an output file in the following format or something similar: Thanks for any help you can give me! (8 Replies)
Discussion started by: Winsarc
8 Replies

7. Shell Programming and Scripting

Performance of log parsing shell script very slow

Hello, I am an absolute newbie and whatever I've written in the shell script (below) has all been built with generous help from googling the net and this forum. Please forgive any schoolboy mistakes. Now to the qn, my input file looks like this - 2009:04:03 08:21:41:513,INFO... (7 Replies)
Discussion started by: sowmitr
7 Replies

8. Shell Programming and Scripting

Shell script to parsing log

Hi I Have log like this : 0 234: { 3 2: 04 EE 7 14: '20081114081' 23 1: 00 79 10: '38809' 91 15: '528111510010159' 143 29: 'Streaming/downloading service' 174 3: 'MTV' 179 43: 'rtsp://172.28/MTV2GO-Loop.sdp' 224 1: 05 ... (10 Replies)
Discussion started by: justbow
10 Replies

9. Shell Programming and Scripting

Help with script parsing a log file

I have a large log file, which I want to first use grep to get the specific lines then send it to awk to print out the specific column and if the result is zero, don't do anything. What I have so far is: LOGDIR=/usr/local/oracle/Transcription/log ERRDIR=/home/edixftp/errors #I want to be... (3 Replies)
Discussion started by: mevasquez
3 Replies

10. Shell Programming and Scripting

Shell script for parsing 300mb log file..

am relatively new to Shell scripting. I have written a script for parsing a big file. The logic is: Apart from lot of other useless stuffs, there are many occurances of <abc> and corresponding </abc> tags. (All of them are properly closed) My requirement is to find a particular tag (say... (3 Replies)
Discussion started by: gurpreet470
3 Replies
Login or Register to Ask a Question