awk: Print fields between two delimiters on separate lines and send to variables

08-13-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Show your existing code.

Also, show your input. I cannot actually test my programs before giving them to you, if I don't have any actual input!

Also, show the output you want.

You posted some data there but didn't explain what was which, what your input was, and what you actually wanted.

With the little you've posted and the script I gave you I get:

Code:

Processed .//data1.gz
To:  user1@domain.com   user2@domain.com   user3@domain.com   user4@domain.com   user5@domain.com   user6@domain.com
From:  spammer@vnyu.com
Subject:
Score:14.344

Processed .//data2.gz
To:  user@domain.com
From:  spammer@tortasgaby.net
Subject:
Score:15.925

You may wish to modify one line in case your data includes tabs:

Code:

IFS="<>,        " # Not nine spaces -- one space, and one tab

Also: Did you change $@ into $*? That would cause some of the problems you're seeing. $@ is not a typo, they are different.

Last edited by Corona688; 08-13-2012 at 01:01 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

08-13-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Slightly updated script which gets rid of some extra spaces:

Code:

#!/bin/sh

SpamDir='./'
WorkingDir='/tmp/spam-summary'

IFS=":"

# Loop on the files directly, instead of doing loops on line numbers
for FILE in ${SpamDir}/*.gz
do
        # Clear out variables
        From=
        To=
        Subject=
        Score=

        # ? What is column 9 on your ls -l ?
        ID=`ls -lh $Mail | awk '{print $9}'`

        # Your time functions look okay
#        TimeEpoch=`ls -lh -D %s "$FILE" | awk '{print $6}'`
#        TimeHuman=`date -r $TimeEpoch +"%Y-%m-%d %l:%M %p"`

        # Decompress file once instead of 9 times
        zcat "$FILE" > /tmp/$$

        # Read and process lines from the decompressed file one by one
        while read LINE
        do
                IFS=":" # Split on : so $1=X-Envelope-From, $2=<spammer@vnyu.com>
                set -- $LINE
                # If line has a : in it, save the header, then get rid of $1
                if [ "$#" -gt 1 ]
                then
                        HEADER="$1"
                        shift
                fi

                # Split on spaces, commas, and <>
                IFS="<>,        " # Not nine spaces -- one tab, eight spaces
                # Split <spammer@vnyu.com>, <whatever@...> into $1=spammer@vnyu.com, $2=whatever@..., etc
                set -- $1
                IFS=" "
                set -- $* # Get rid of extra spaces in input

                case "$HEADER" in
                X-Envelope-From) From="$From $*" ;;
                X-Envelope-To)     To="$To $*" ;;
                Subject)              Subject="$*" ;;
                X-Spam-Score)     Score="$*" ;;
                esac
        done < /tmp/$$

        echo
        echo "Processed $FILE"
        echo "To:$To"
        echo "From:$From"
        echo "Subject:$Subject"
        echo "Score:$Score"
done

rm -f /tmp/$$

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

08-14-2012

Registered User

9, 0

Join Date: Aug 2012

Last Activity: 12 December 2012, 8:18 AM EST

Posts: 9

Thanks Given: 1

Thanked 0 Times in 0 Posts

Thanks a lot for your help. I've learned a bit. I've basically got the script completely working as below (still need to clean it up a little). The only big problem I have right now is I would like to loop on output from find "/home/tay/spam-all/spam" -iname "*.gz" -mtime 1 | xargs ls -t instead of the entire directory. Keep in mind, I want to loop on them in order by the file creation date with the newest ending up at the top of the file.

I've tried for FILE in `find "/home/tay/spam-all/spam" -iname "*.gz" -mtime 1 | xargs ls -t` and for FILE in $Spams but the script treats the output as one filename and then errors out saying the filename is too long.

The other minor issue is that I need to add <table> and </table> to the beginning and ends of the outputted files. I am thinking about having the script, after it is done writing all of the files, go through each one and add the tags to the beginning and end. I am sure I can figure out some sort of way to do that but haven't gotten there yet.

One last thing! I would like to shorten the $Subject to 70 characters. I will probably end up using sed for that unless you suggest a better way.

Thanks a lot!

Code:

#!/bin/sh

SpamDir='/home/tay/spam-all/spam'
WorkingDir='/tmp/spam-summary'
Spams=`find "/home/tay/spam-all/spam" -iname "*.gz" -mtime 1 | xargs ls -t`

IFS=":"

# Loop on the files directly, instead of doing loops on line numbers
for FILE in ${SpamDir}/*.gz
do
        # Clear out variables
        From=
        To=
        Subject=
        Score=

        # Changed this to get the basename of the file.
        ID=`ls $FILE | xargs -n1 basename`

        # Your time functions look okay
        TimeEpoch=`ls -lh -D %s "$FILE" | awk '{print $6}'`
        TimeHuman=`date -r $TimeEpoch +"%Y-%m-%d %l:%M %p"`

        # Decompress file once instead of 9 times
        zcat "$FILE" > /tmp/$$

        # Read and process lines from the decompressed file one by one
        while read LINE
        do
                IFS=":" # Split on : so $1=X-Envelope-From, $2=<spammer@vnyu.com>
                set -- $LINE
                # If line has a : in it, save the header, then get rid of $1
                if [ "$#" -gt 1 ]
                then
                        HEADER="$1"
                        shift
                fi

                # Split on spaces, commas, and <>
                IFS="<>, "
                # Split <spammer@vnyu.com>, <whatever@...> into $1=spammer@vnyu.com, $2=whatever@..., etc
                set -- $1

                case "$HEADER" in
                X-Envelope-From) From=`echo "$From $@" | sed 's/<//g'`;;
                X-Envelope-To)     To=`echo "$To $@" | sed 's/<//g;s/[	]//g'`;;
                Subject)              Subject=`echo "$Subject $@" | sed 's/<//g'`;;
                X-Spam-Score)     Score="$@" ;;
                esac
        done < /tmp/$$

echo "$From	$Subject	$Score	$TimeHuman	$ID	$To" #Debug Output

set -- $To

for i in $To
do
echo "<tr><td>$From</td><td>$Subject</td><td><td>$Score</td><td>$TimeHuman</td><td>$ID</td></tr>" >> $WorkingDir/$i
done

done

rm -f /tmp/$$

---------- Post updated 2012-08-14 at 02:23 AM ---------- Previous update was 2012-08-13 at 08:49 PM ----------

Okay I got 2/3 down. For the shortening of the subject line, I updated the following line to:

Code:

Subject)              Subject=`echo "$Subject $@" | sed 's/<//g' | cut -c -70`;;

For adding the <table> to beginning and </table> the end of the script now looks like this. Probably better ways accomplish this but it works! =P

Code:

        for i in $To
        do
                echo "<tr><td>$From</td><td>$Subject</td><td><td>$Score</td><td>$TimeHuman</td><td>$ID</td></tr>" >> $WorkingDir/$i
        done

done

        for SUMMARY in $WorkingDir/*.com
        do
                text="<table>"; exec 3<> $SUMMARY && awk -v TEXT="$text" 'BEGIN {print TEXT}{print}' $SUMMARY >&3
                echo '</table>' >> $SUMMARY
        done
rm -f /tmp/$$

Last edited by tay9000; 08-14-2012 at 12:59 AM..

tay9000

View Public Profile for tay9000

Find all posts by tay9000

08-14-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Mind incorporating some of the fixes I gave you...? They do make it make less mess...

Quote:

Originally Posted by tay9000

When all you have is a for loop, all problems look like nails, but not everything needs a hammer. There are better options. See Useless Use of Backticks.

Code:

find "/home/tay/spam-all/spam" -iname "*.gz" -mtime 1 | xargs ls -t > /tmp/$$-spams

while read LINE
do
        echo got "$LINE"
done </tmp/$$-spams
rm -f /tmp/$$-spams

Quote:

The other minor issue is that I need to add <table> and </table> to the beginning and ends of the outputted files.

Show the input you have and the output you want. A script which won't do what can't tell me what you do want.

Putting it in a temp file avoids the problem of variables inside the while-loop not being seen in the rest of the script.

Also: You don't need to open a file 500 times to write 500 lines to it.

Code:

for i in $To
do
echo "<tr><td>$From</td><td>$Subject</td><td><td>$Score</td><td>$TimeHuman</td><td>$ID</td></tr>"
done > $WorkingDir/$i

Your set -- $To appears useless, remove it.

Last edited by Corona688; 08-14-2012 at 12:29 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

08-14-2012

Registered User

9, 0

Join Date: Aug 2012

Last Activity: 12 December 2012, 8:18 AM EST

Posts: 9

Thanks Given: 1

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by Corona688

Mind incorporating some of the fixes I gave you...? They do make it make less mess...

Oh I do plan to take another look at what you've supplied and clean up my script. It's just that I was working on it myself before you posted a response. Geeze, I wasn't trying to get you to write my entire script! Although you pretty much have... and I'm thankful for that.

I'll post up what I've got when I finish.

Quote:

Also: You don't need to open a file 500 times to write 500 lines to it.

Oh I didn't really realize it was opening up the file 500 times. My train of thought was that I wanted to write each line to the end of the file so I use >> because I guess I thought > might just overwrite the file with the one line it was putting in each time it wrote the line. =P I like the page you linked. I'll read through and try not to do useless things!

tay9000

View Public Profile for tay9000

Find all posts by tay9000

Shell Programming and Scripting

awk: Print fields between two delimiters on separate lines and send to variables

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to print lines based on text in field and value in two additional fields

Discussion started by: cmccabe

2. Shell Programming and Scripting

awk to print line is values between two fields in separate file

Discussion started by: cmccabe

3. UNIX for Beginners Questions & Answers

How to count lines of CSV file where 2 fields match variables?

Discussion started by: nmoore2843

4. Shell Programming and Scripting

awk sort based on difference of fields and print all fields

Discussion started by: newstart

5. Shell Programming and Scripting

How to print 1st field and last 2 fields together and the rest of the fields after it using awk?

Discussion started by: 100bees

6. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Discussion started by: Ophiuchus

7. Shell Programming and Scripting

awk print header as text from separate file with getline

Discussion started by: sdf

8. Shell Programming and Scripting

Compare Tab Separated Field with AWK to all and print lines of unique fields.

Discussion started by: rocket_dog

9. Shell Programming and Scripting

extract nth line of all files and print in output file on separate lines.

Discussion started by: yogeshkumkar

10. Shell Programming and Scripting

trying to print selected fields of selected lines by AWK

Discussion started by: ananyob