awk: Print fields between two delimiters on separate lines and send to variables


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk: Print fields between two delimiters on separate lines and send to variables
# 1  
Old 08-10-2012
awk: Print fields between two delimiters on separate lines and send to variables

I have email headers that look like the following. In the end I would like to accomplish sending each email address to its own variable, such as:
user1@domain.com='user1@domain.com'
user2@domain.com='user2@domain.com'
user3@domain.com='user3@domain.com'
etc...

I know the sed to get rid of the extra characters but I just need to know how to get all of the email addresses. So basically I need to run the awk between "X-Envelope-To: " and "X-". Then I need to know how to send the printed fields to their own variables as explained above. I've been at this for a while and can't figure it out. Any help is appreciated.
Code:
Return-Path: <>
Delivered-To: spam-quarantine
X-Envelope-From: <spammer@vnyu.com>
X-Envelope-To: <user1@domain.com>, <user2@domain.com>,
	<user3@domain.com>, <user4@domain.com>,
	<user5@domain.com>, <user6@domain.com>
X-Envelope-To-Blocked: <user1@domain.com>, <user2@domain.com>,
	<user3@domain.com>, <user4@domain.com>,
	<user5@domain.com>, <user6@domain.com>
X-Quarantine-ID: <83alzf-jEjhM>
X-Spam-Flag: YES
X-Spam-Score: 14.344
X-Spam-Level: **************

Code:
Return-Path: <>
Delivered-To: spam-quarantine
X-Envelope-From: <spammer@tortasgaby.net>
X-Envelope-To: <user@domain.com>
X-Envelope-To-Blocked: <user@domain.com>
X-Quarantine-ID: <80z4iI_8Fgy2>
X-Spam-Flag: YES
X-Spam-Score: 15.925
X-Spam-Level: ***************

# 2  
Old 08-10-2012
This will get the emails. I'm not sure what you mean by 'set to its own variable', since user1@domain.com is not a valid variable name in either shell or awk.

Code:
$  cat env.awk
BEGIN { FS="<" }

/X-/ { E=0 }    /X-Envelope-To:/ { E=1; sub(/X-Envelope-To:/, ""); }

E {
        gsub(/[ >,\t]*/, ""); # Strip out all junk
        # Print any non-blank emails
        for(N=1; N<=NF; N++) if(length($N)) print $N;
}

$ awk -f env.awk data

user1@domain.com
user2@domain.com
user3@domain.com
user4@domain.com
user5@domain.com
user6@domain.com

$

# 3  
Old 08-10-2012
Sorry, I am a bit new to programming maybe I don't understand the different environments. I am having trouble with this. I am trying to use awk inside of the bash shell on FreeBSD. The "userx" and "domain.com" in my example are of course variable. The source files are in .gz format so I am using zcat to feed the data to the script. I've never tried using an @ sign as a variable so I didn't know that was invalid. But I guess just being able to set the variables like user1='user1@domain', user2='user2@domain.com' would work.

I'll try again later but wanted to put that on the table because I think our scripting environments are different. =P
# 4  
Old 08-10-2012
If it doesn't work, please tell me in what way it did not work. Show me exactly what you did, word for word, letter for letter, keystroke for keystroke. I can't see your computer from here.

If nawk doesn't work, try awk.

You can feed it into awk on stdin, like zcat filename | awk -f env.awk

Why do you need each one to be its own variable? How would you even know which variable names to use, if they're always different? Wouldn't you rather put them in one string instead, so you could put it in a loop?

Code:
EMAILS=`zcat email.gz | awk -f env.awk`

for EMAIL in EMAILS
do
        echo "got email $EMAIL"
done

Or if you have thousands, put it in a while-read loop?

Code:
zcat email.gz | awk -f env.awk | while read EMAIL
do
        echo "Got email $EMAIL"
done

...or save to a temp file so you can use it more than once?

Code:
zcat email.gz | awk -f env.awk > /tmp/$$

while read EMAIL
do
        echo "got email $EMAIL"
done < /tmp/$$

while read EMAIL
do
        echo "got email $EMAIL"
done < /tmp/$$

rm -f /tmp/$$

Or if you really do want variable names:

Code:
N=1
zcat email.gz | awk -f env.awk | while read EMAIL
do
        # Set variables like email_0001, email_0002, ...
        echo "`printf "email_%04d" $N`=$EMAIL"
        N=`expr $N + 1`
done > /tmp/$$
. /tmp/$$
rm -f /tmp/$$

...but I really can't picture how this'd be more useful than the options above.
# 5  
Old 08-10-2012
Below is my entire script. Yes, I am aware it is highly inefficient but I am a newbie and so far have just been writing very dirty scripts that get the job done. This script is going to go through about 600 files per day. Right now it is just echoing out all of the lines that I need. My idea with the variables was for the script to create a file inside of the $WorkingDir for each email address it finds and then write the lines I am echoing into each appropriate files based on the emails addresses the message is for. I haven't figured out the code to do this yet. I need to figure out how to extract those email addresses first! I did not try after your last post but now since I know how to pipe the email content to awk -f env.awk I think I could get further than before. Thank you. At first I was using the $To variable to get the addresses but then I noticed it was only getting the first line's worth of addresses. And then after banging my head on the keyboard enough, I came here.
Code:
#!/usr/local/bin/bash
# Variables
SpamDir='/home/tay/spam'
CurrentLine='0'
MaxLines=`ls $SpamDir/*.gz | wc -l`
WorkingDir='/tmp/spam-summary'

cd $SpamDir

# Variable Control
function VariableControl() {
CurrentLine=$(expr $CurrentLine + 1)
Mail=`ls *.gz | head -$CurrentLine | tail -1`
#MailConent=`zgrep $Mail`
From=`zgrep $Mail -e 'X-Envelope-From:' | awk '{print $2}' | sed 's/<//g;s/>//g'`
To=`zgrep $Mail -e 'X-Envelope-To:' | awk 'FS=":" {print $2}' | sed 's/<//g;s/>//g;s/,//g'`
Subject=`zgrep $Mail -e 'Subject:' | awk '{$1=""; print $0}'`
Score=`zgrep $Mail -e 'X-Spam-Score:' | awk '{print $2}'`
TimeEpoch=`ls -lh -D %s $Mail | awk '{print $6}'`; TimeHuman=`date -r $TimeEpoch +"%Y-%m-%d %l:%M %p"`
ID=`ls -lh $Mail | awk '{print $9}'`
RunControl
}

# AddQuery
function AddQuery() {
echo "$From	$Subject	$Score	$TimeHuman	$ID"
VariableControl
}

# Run Control
function RunControl() {
if [ $CurrentLine -gt $MaxLines ]
        then
exit
fi
AddQuery
}

VariableControl


Last edited by tay9000; 08-10-2012 at 02:23 PM..
# 6  
Old 08-10-2012
I'm not sure how writing random variable names into a file is going to help you figure out which variable names to use later, either.

Other problems with this script that I've spotted on first blush include...

Why run zcat | head 999 times, to read 999 lines? The shell is capable of reading lines one by one with read in a while loop.

Why read lines to get filenames, you can just do a loop over *.gz very easily.

Processing single lines with awk is like using an orbiting laser weapon to light a campfire, an awful lot of effort and expense to accomplish something simple. awk is meant to process thousands of lines at a go.

If you find yourself using awk | sed | grep, you might as well just use awk. awk is a power-tool which can accomplish all three in one operation, not a glorified cut.

And all of what you're doing here can be done in a basic shell without externals. Especially useful is set, which can be used to set your $1 $2 ... variables, like so:

Code:
set -- a b c
echo $1 # should print a
echo $2 # should print b

IFS=":." # Will split on one or more of any of these characters.
VAR="1:2.3:.4:.:5"
set -- $VAR

echo $1 # Should print 1
echo $2 # should print 2

I'd try reducing the script to something like this:

Code:
#!/bin/sh

SpamDir='/home/tay/spam'
WorkingDir='/tmp/spam-summary'

IFS=":"

# Loop on the files directly, instead of doing loops on line numbers
for FILE in ${SpamDir}/*.gz
do
        # Clear out variables
        From=
        To=
        Subject=
        Score=

        # ? What is column 9 on your ls -l ?
        ID=`ls -lh $Mail | awk '{print $9}'`

        # Your time functions look okay
        TimeEpoch=`ls -lh -D %s "$FILE" | awk '{print $6}'`
        TimeHuman=`date -r $TimeEpoch +"%Y-%m-%d %l:%M %p"`

        # Decompress file once instead of 9 times
        zcat "$FILE" > /tmp/$$

        # Read and process lines from the decompressed file one by one
        while read LINE
        do
                IFS=":" # Split on : so $1=X-Envelope-From, $2=<spammer@vnyu.com>
                set -- $LINE
                # If line has a : in it, save the header, then get rid of $1
                if [ "$#" -gt 1 ]
                then
                        HEADER="$1"
                        shift
                fi

                # Split on spaces, commas, and <>
                IFS="<>, "
                # Split <spammer@vnyu.com>, <whatever@...> into $1=spammer@vnyu.com, $2=whatever@..., etc
                set -- $1

                case "$HEADER" in
                X-Envelope-From) From="$From $@" ;;
                X-Envelope-To)     To="$To $@" ;;
                Subject)              Subject="$@" ;;
                X-Spam-Score)     Score="$@" ;;
                esac
        done < /tmp/$$

        echo
        echo "Processed $FILE"
        echo "To:$To"
        echo "From:$From"
        echo "Subject:$Subject"
        echo "Score:$Score"
done

rm -f /tmp/$$

This User Gave Thanks to Corona688 For This Post:
# 7  
Old 08-10-2012
Thanks a lot for your help. It is about 4x faster than my version. And makes me feel better about the amount of disk i/o and processor I am using. I used your script as-is but updated the "ID" variable to use the $FILE variable. The ID is actually just the filename without the path before it. But now since the script is no longer working inside of the folder, the full path gets printed. =[

Now I am getting output like so. I want to go back to my old habits and use sed to remove the extra characters but you'd probably want to smack me haha. And now the major challenge is to create a file for each user in the To: fields and redirect the line output to those files so I can email them to the receiver... again thank you for all the help!
Code:
Processed /home/tay/spam/spam-0fWSqXDpwom4.gz
To: <user1@domain1.com<<<user2@domain1.com< 	<user3@domain1.com<<<user4@domain1.com< 	<user5@domain1.com<<<user6@domain2.com< 	<user7@domain2.com
From: <ret@your.schoolsearch.us
Subject:Your<education<information
Score:12.403
ID:/home/tay/spam/spam-0fWSqXDpwom4.gz

Code:
Processed /home/tay/spam/spam-0fycklYG3rfD.gz
To: <user@domain1.com
From: <searchdentalinsurance.net@beastertaps.com
Subject:Find<affordable<dental<insurance
Score:18.222
ID:/home/tay/spam/spam-0fycklYG3rfD.gz

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to print lines based on text in field and value in two additional fields

In the awk below I am trying to print the entire line, along with the header row, if $2 is SNV or MNV or INDEL. If that condition is met or is true, and $3 is less than or equal to 0.05, then in $7 the sub pattern :GMAF= is found and the value after the = sign is checked. If that value is less than... (0 Replies)
Discussion started by: cmccabe
0 Replies

2. Shell Programming and Scripting

awk to print line is values between two fields in separate file

I am trying to use awk to find all the $3 values in file2 that are between $2 and $3 in file1. If a value in $3 of file2 is between the file1 fields then it is printed along with the $6 value in file1. Both file1 and file2 are tab-delimited as well as the desired output. If there is nothing to... (4 Replies)
Discussion started by: cmccabe
4 Replies

3. UNIX for Beginners Questions & Answers

How to count lines of CSV file where 2 fields match variables?

I'm trying to use awk to count the occurrences of two matching fields of a CSV file. For instance, for data that looks like this... Joe,Blue,Yes,No,High Mike,Blue,Yes,Yes,Low Joe,Red,No,No,Low Joe,Red,Yes,Yes,Low I've been trying to use code like this... countvar=`awk ' $2~/$color/... (4 Replies)
Discussion started by: nmoore2843
4 Replies

4. Shell Programming and Scripting

awk sort based on difference of fields and print all fields

Hi I have a file as below <field1> <field2> <field3> ... <field_num1> <field_num2> Trying to sort based on difference of <field_num1> and <field_num2> in desceding order and print all fields. I tried this and it doesn't sort on the difference field .. Appreciate your help. cat... (9 Replies)
Discussion started by: newstart
9 Replies

5. Shell Programming and Scripting

How to print 1st field and last 2 fields together and the rest of the fields after it using awk?

Hi experts, I need to print the first field first then last two fields should come next and then i need to print rest of the fields. Input : a1,abc,jsd,fhf,fkk,b1,b2 a2,acb,dfg,ghj,b3,c4 a3,djf,wdjg,fkg,dff,ggk,d4,d5 Expected output: a1,b1,b2,abc,jsd,fhf,fkk... (6 Replies)
Discussion started by: 100bees
6 Replies

6. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Hello everyone, Maybe somebody could help me with an awk script. I have this input (field separator is comma ","): 547894982,M|N|J,U|Q|P,98,101,0,1,1 234900027,M|N|J,U|Q|P,98,101,0,1,1 234900023,M|N|J,U|Q|P,98,54,3,1,1 234900028,M|H|J,S|Q|P,98,101,0,1,1 234900030,M|N|J,U|F|P,98,101,0,1,1... (2 Replies)
Discussion started by: Ophiuchus
2 Replies

7. Shell Programming and Scripting

awk print header as text from separate file with getline

I would like to print the output beginning with a header from a seperate file like this: awk 'BEGIN{FS="_";print ((getline < "header.txt")>0)} { if (! ($0 ~ /EL/ ) print }" input.txtWhat am i doing wrong? (4 Replies)
Discussion started by: sdf
4 Replies

8. Shell Programming and Scripting

Compare Tab Separated Field with AWK to all and print lines of unique fields.

Hi. I have a tab separated file that has a couple nearly identical lines. When doing: sort file | uniq > file.new It passes through the nearly identical lines because, well, they still are unique. a) I want to look only at field x for uniqueness and if the content in field x is the... (1 Reply)
Discussion started by: rocket_dog
1 Replies

9. Shell Programming and Scripting

extract nth line of all files and print in output file on separate lines.

Hello UNIX experts, I have 124 text files in a directory. I want to extract the 45678th line of all the files sequentialy by file names. The extracted lines should be printed in the output file on seperate lines. e.g. The input Files are one.txt, two.txt, three.txt, four.txt The cat of four... (1 Reply)
Discussion started by: yogeshkumkar
1 Replies

10. Shell Programming and Scripting

trying to print selected fields of selected lines by AWK

I am trying to print 1st, 2nd, 13th and 14th fields of a file of line numbers from 29 to 10029. I dont know how to put this in one code. Currently I am removing the selected lines by awk 'NR==29,NR==10029' File1 > File2 and then doing awk '{print $1, $2, $13, $14}' File2 > File3 Can... (3 Replies)
Discussion started by: ananyob
3 Replies
Login or Register to Ask a Question