awk: Print fields between two delimiters on separate lines and send to variables

08-10-2012

Registered User

9, 0

Join Date: Aug 2012

Last Activity: 12 December 2012, 8:18 AM EST

Posts: 9

Thanks Given: 1

Thanked 0 Times in 0 Posts

awk: Print fields between two delimiters on separate lines and send to variables

I have email headers that look like the following. In the end I would like to accomplish sending each email address to its own variable, such as:
user1@domain.com='user1@domain.com'
user2@domain.com='user2@domain.com'
user3@domain.com='user3@domain.com'
etc...

I know the sed to get rid of the extra characters but I just need to know how to get all of the email addresses. So basically I need to run the awk between "X-Envelope-To: " and "X-". Then I need to know how to send the printed fields to their own variables as explained above. I've been at this for a while and can't figure it out. Any help is appreciated.

Code:

Return-Path: <>
Delivered-To: spam-quarantine
X-Envelope-From: <spammer@vnyu.com>
X-Envelope-To: <user1@domain.com>, <user2@domain.com>,
	<user3@domain.com>, <user4@domain.com>,
	<user5@domain.com>, <user6@domain.com>
X-Envelope-To-Blocked: <user1@domain.com>, <user2@domain.com>,
	<user3@domain.com>, <user4@domain.com>,
	<user5@domain.com>, <user6@domain.com>
X-Quarantine-ID: <83alzf-jEjhM>
X-Spam-Flag: YES
X-Spam-Score: 14.344
X-Spam-Level: **************

Code:

Return-Path: <>
Delivered-To: spam-quarantine
X-Envelope-From: <spammer@tortasgaby.net>
X-Envelope-To: <user@domain.com>
X-Envelope-To-Blocked: <user@domain.com>
X-Quarantine-ID: <80z4iI_8Fgy2>
X-Spam-Flag: YES
X-Spam-Score: 15.925
X-Spam-Level: ***************

tay9000

View Public Profile for tay9000

Find all posts by tay9000

08-10-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

This will get the emails. I'm not sure what you mean by 'set to its own variable', since user1@domain.com is not a valid variable name in either shell or awk.

Code:

$  cat env.awk
BEGIN { FS="<" }

/X-/ { E=0 }    /X-Envelope-To:/ { E=1; sub(/X-Envelope-To:/, ""); }

E {
        gsub(/[ >,\t]*/, ""); # Strip out all junk
        # Print any non-blank emails
        for(N=1; N<=NF; N++) if(length($N)) print $N;
}

$ awk -f env.awk data

user1@domain.com
user2@domain.com
user3@domain.com
user4@domain.com
user5@domain.com
user6@domain.com

$

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

08-10-2012

Registered User

9, 0

Join Date: Aug 2012

Last Activity: 12 December 2012, 8:18 AM EST

Posts: 9

Thanks Given: 1

Thanked 0 Times in 0 Posts

Sorry, I am a bit new to programming maybe I don't understand the different environments. I am having trouble with this. I am trying to use awk inside of the bash shell on FreeBSD. The "userx" and "domain.com" in my example are of course variable. The source files are in .gz format so I am using zcat to feed the data to the script. I've never tried using an @ sign as a variable so I didn't know that was invalid. But I guess just being able to set the variables like user1='user1@domain', user2='user2@domain.com' would work.

I'll try again later but wanted to put that on the table because I think our scripting environments are different. =P

tay9000

View Public Profile for tay9000

Find all posts by tay9000

08-10-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

If it doesn't work, please tell me in what way it did not work. Show me exactly what you did, word for word, letter for letter, keystroke for keystroke. I can't see your computer from here.

If nawk doesn't work, try awk.

You can feed it into awk on stdin, like zcat filename | awk -f env.awk

Why do you need each one to be its own variable? How would you even know which variable names to use, if they're always different? Wouldn't you rather put them in one string instead, so you could put it in a loop?

Code:

EMAILS=`zcat email.gz | awk -f env.awk`

for EMAIL in EMAILS
do
        echo "got email $EMAIL"
done

Or if you have thousands, put it in a while-read loop?

Code:

zcat email.gz | awk -f env.awk | while read EMAIL
do
        echo "Got email $EMAIL"
done

...or save to a temp file so you can use it more than once?

Code:

zcat email.gz | awk -f env.awk > /tmp/$$

while read EMAIL
do
        echo "got email $EMAIL"
done < /tmp/$$

while read EMAIL
do
        echo "got email $EMAIL"
done < /tmp/$$

rm -f /tmp/$$

Or if you really do want variable names:

Code:

N=1
zcat email.gz | awk -f env.awk | while read EMAIL
do
        # Set variables like email_0001, email_0002, ...
        echo "`printf "email_%04d" $N`=$EMAIL"
        N=`expr $N + 1`
done > /tmp/$$
. /tmp/$$
rm -f /tmp/$$

...but I really can't picture how this'd be more useful than the options above.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

08-10-2012

Registered User

9, 0

Join Date: Aug 2012

Last Activity: 12 December 2012, 8:18 AM EST

Posts: 9

Thanks Given: 1

Thanked 0 Times in 0 Posts

Below is my entire script. Yes, I am aware it is highly inefficient but I am a newbie and so far have just been writing very dirty scripts that get the job done. This script is going to go through about 600 files per day. Right now it is just echoing out all of the lines that I need. My idea with the variables was for the script to create a file inside of the $WorkingDir for each email address it finds and then write the lines I am echoing into each appropriate files based on the emails addresses the message is for. I haven't figured out the code to do this yet. I need to figure out how to extract those email addresses first! I did not try after your last post but now since I know how to pipe the email content to awk -f env.awk I think I could get further than before. Thank you. At first I was using the $To variable to get the addresses but then I noticed it was only getting the first line's worth of addresses. And then after banging my head on the keyboard enough, I came here.

Code:

#!/usr/local/bin/bash
# Variables
SpamDir='/home/tay/spam'
CurrentLine='0'
MaxLines=`ls $SpamDir/*.gz | wc -l`
WorkingDir='/tmp/spam-summary'

cd $SpamDir

# Variable Control
function VariableControl() {
CurrentLine=$(expr $CurrentLine + 1)
Mail=`ls *.gz | head -$CurrentLine | tail -1`
#MailConent=`zgrep $Mail`
From=`zgrep $Mail -e 'X-Envelope-From:' | awk '{print $2}' | sed 's/<//g;s/>//g'`
To=`zgrep $Mail -e 'X-Envelope-To:' | awk 'FS=":" {print $2}' | sed 's/<//g;s/>//g;s/,//g'`
Subject=`zgrep $Mail -e 'Subject:' | awk '{$1=""; print $0}'`
Score=`zgrep $Mail -e 'X-Spam-Score:' | awk '{print $2}'`
TimeEpoch=`ls -lh -D %s $Mail | awk '{print $6}'`; TimeHuman=`date -r $TimeEpoch +"%Y-%m-%d %l:%M %p"`
ID=`ls -lh $Mail | awk '{print $9}'`
RunControl
}

# AddQuery
function AddQuery() {
echo "$From	$Subject	$Score	$TimeHuman	$ID"
VariableControl
}

# Run Control
function RunControl() {
if [ $CurrentLine -gt $MaxLines ]
        then
exit
fi
AddQuery
}

VariableControl

Last edited by tay9000; 08-10-2012 at 02:23 PM..

tay9000

View Public Profile for tay9000

Find all posts by tay9000

08-10-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

I'm not sure how writing random variable names into a file is going to help you figure out which variable names to use later, either.

Other problems with this script that I've spotted on first blush include...

Why run zcat | head 999 times, to read 999 lines? The shell is capable of reading lines one by one with read in a while loop.

Why read lines to get filenames, you can just do a loop over *.gz very easily.

Processing single lines with awk is like using an orbiting laser weapon to light a campfire, an awful lot of effort and expense to accomplish something simple. awk is meant to process thousands of lines at a go.

If you find yourself using awk | sed | grep, you might as well just use awk. awk is a power-tool which can accomplish all three in one operation, not a glorified cut.

And all of what you're doing here can be done in a basic shell without externals. Especially useful is set, which can be used to set your $1 $2 ... variables, like so:

Code:

set -- a b c
echo $1 # should print a
echo $2 # should print b

IFS=":." # Will split on one or more of any of these characters.
VAR="1:2.3:.4:.:5"
set -- $VAR

echo $1 # Should print 1
echo $2 # should print 2

I'd try reducing the script to something like this:

Code:

#!/bin/sh

SpamDir='/home/tay/spam'
WorkingDir='/tmp/spam-summary'

IFS=":"

# Loop on the files directly, instead of doing loops on line numbers
for FILE in ${SpamDir}/*.gz
do
        # Clear out variables
        From=
        To=
        Subject=
        Score=

        # ? What is column 9 on your ls -l ?
        ID=`ls -lh $Mail | awk '{print $9}'`

        # Your time functions look okay
        TimeEpoch=`ls -lh -D %s "$FILE" | awk '{print $6}'`
        TimeHuman=`date -r $TimeEpoch +"%Y-%m-%d %l:%M %p"`

        # Decompress file once instead of 9 times
        zcat "$FILE" > /tmp/$$

        # Read and process lines from the decompressed file one by one
        while read LINE
        do
                IFS=":" # Split on : so $1=X-Envelope-From, $2=<spammer@vnyu.com>
                set -- $LINE
                # If line has a : in it, save the header, then get rid of $1
                if [ "$#" -gt 1 ]
                then
                        HEADER="$1"
                        shift
                fi

                # Split on spaces, commas, and <>
                IFS="<>, "
                # Split <spammer@vnyu.com>, <whatever@...> into $1=spammer@vnyu.com, $2=whatever@..., etc
                set -- $1

                case "$HEADER" in
                X-Envelope-From) From="$From $@" ;;
                X-Envelope-To)     To="$To $@" ;;
                Subject)              Subject="$@" ;;
                X-Spam-Score)     Score="$@" ;;
                esac
        done < /tmp/$$

        echo
        echo "Processed $FILE"
        echo "To:$To"
        echo "From:$From"
        echo "Subject:$Subject"
        echo "Score:$Score"
done

rm -f /tmp/$$

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

08-10-2012

Registered User

9, 0

Join Date: Aug 2012

Last Activity: 12 December 2012, 8:18 AM EST

Posts: 9

Thanks Given: 1

Thanked 0 Times in 0 Posts

Thanks a lot for your help. It is about 4x faster than my version. And makes me feel better about the amount of disk i/o and processor I am using. I used your script as-is but updated the "ID" variable to use the $FILE variable. The ID is actually just the filename without the path before it. But now since the script is no longer working inside of the folder, the full path gets printed. =[

Now I am getting output like so. I want to go back to my old habits and use sed to remove the extra characters but you'd probably want to smack me haha. And now the major challenge is to create a file for each user in the To: fields and redirect the line output to those files so I can email them to the receiver... again thank you for all the help!

Code:

Processed /home/tay/spam/spam-0fWSqXDpwom4.gz
To: <user1@domain1.com<<<user2@domain1.com< 	<user3@domain1.com<<<user4@domain1.com< 	<user5@domain1.com<<<user6@domain2.com< 	<user7@domain2.com
From: <ret@your.schoolsearch.us
Subject:Your<education<information
Score:12.403
ID:/home/tay/spam/spam-0fWSqXDpwom4.gz

Code:

Processed /home/tay/spam/spam-0fycklYG3rfD.gz
To: <user@domain1.com
From: <searchdentalinsurance.net@beastertaps.com
Subject:Find<affordable<dental<insurance
Score:18.222
ID:/home/tay/spam/spam-0fycklYG3rfD.gz

tay9000

View Public Profile for tay9000

Find all posts by tay9000

Shell Programming and Scripting

awk: Print fields between two delimiters on separate lines and send to variables

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to print lines based on text in field and value in two additional fields

Discussion started by: cmccabe

2. Shell Programming and Scripting

awk to print line is values between two fields in separate file

Discussion started by: cmccabe

3. UNIX for Beginners Questions & Answers

How to count lines of CSV file where 2 fields match variables?

Discussion started by: nmoore2843

4. Shell Programming and Scripting

awk sort based on difference of fields and print all fields

Discussion started by: newstart

5. Shell Programming and Scripting

How to print 1st field and last 2 fields together and the rest of the fields after it using awk?

Discussion started by: 100bees

6. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Discussion started by: Ophiuchus

7. Shell Programming and Scripting

awk print header as text from separate file with getline

Discussion started by: sdf

8. Shell Programming and Scripting

Compare Tab Separated Field with AWK to all and print lines of unique fields.

Discussion started by: rocket_dog

9. Shell Programming and Scripting

extract nth line of all files and print in output file on separate lines.

Discussion started by: yogeshkumkar

10. Shell Programming and Scripting

trying to print selected fields of selected lines by AWK

Discussion started by: ananyob