awk: Print fields between two delimiters on separate lines and send to variables


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk: Print fields between two delimiters on separate lines and send to variables
# 8  
Old 08-13-2012
Show your existing code.

Also, show your input. I cannot actually test my programs before giving them to you, if I don't have any actual input!

Also, show the output you want.

You posted some data there but didn't explain what was which, what your input was, and what you actually wanted.

With the little you've posted and the script I gave you I get:

Code:
Processed .//data1.gz
To:  user1@domain.com   user2@domain.com   user3@domain.com   user4@domain.com   user5@domain.com   user6@domain.com
From:  spammer@vnyu.com
Subject:
Score:14.344

Processed .//data2.gz
To:  user@domain.com
From:  spammer@tortasgaby.net
Subject:
Score:15.925

You may wish to modify one line in case your data includes tabs:

Code:
IFS="<>,        " # Not nine spaces -- one space, and one tab

Also: Did you change $@ into $*? That would cause some of the problems you're seeing. $@ is not a typo, they are different.

Last edited by Corona688; 08-13-2012 at 01:01 PM..
# 9  
Old 08-13-2012
Slightly updated script which gets rid of some extra spaces:

Code:
#!/bin/sh

SpamDir='./'
WorkingDir='/tmp/spam-summary'

IFS=":"

# Loop on the files directly, instead of doing loops on line numbers
for FILE in ${SpamDir}/*.gz
do
        # Clear out variables
        From=
        To=
        Subject=
        Score=

        # ? What is column 9 on your ls -l ?
        ID=`ls -lh $Mail | awk '{print $9}'`

        # Your time functions look okay
#        TimeEpoch=`ls -lh -D %s "$FILE" | awk '{print $6}'`
#        TimeHuman=`date -r $TimeEpoch +"%Y-%m-%d %l:%M %p"`

        # Decompress file once instead of 9 times
        zcat "$FILE" > /tmp/$$

        # Read and process lines from the decompressed file one by one
        while read LINE
        do
                IFS=":" # Split on : so $1=X-Envelope-From, $2=<spammer@vnyu.com>
                set -- $LINE
                # If line has a : in it, save the header, then get rid of $1
                if [ "$#" -gt 1 ]
                then
                        HEADER="$1"
                        shift
                fi

                # Split on spaces, commas, and <>
                IFS="<>,        " # Not nine spaces -- one tab, eight spaces
                # Split <spammer@vnyu.com>, <whatever@...> into $1=spammer@vnyu.com, $2=whatever@..., etc
                set -- $1
                IFS=" "
                set -- $* # Get rid of extra spaces in input

                case "$HEADER" in
                X-Envelope-From) From="$From $*" ;;
                X-Envelope-To)     To="$To $*" ;;
                Subject)              Subject="$*" ;;
                X-Spam-Score)     Score="$*" ;;
                esac
        done < /tmp/$$

        echo
        echo "Processed $FILE"
        echo "To:$To"
        echo "From:$From"
        echo "Subject:$Subject"
        echo "Score:$Score"
done

rm -f /tmp/$$

# 10  
Old 08-14-2012
Thanks a lot for your help. I've learned a bit. I've basically got the script completely working as below (still need to clean it up a little). The only big problem I have right now is I would like to loop on output from find "/home/tay/spam-all/spam" -iname "*.gz" -mtime 1 | xargs ls -t instead of the entire directory. Keep in mind, I want to loop on them in order by the file creation date with the newest ending up at the top of the file.

I've tried for FILE in `find "/home/tay/spam-all/spam" -iname "*.gz" -mtime 1 | xargs ls -t` and for FILE in $Spams but the script treats the output as one filename and then errors out saying the filename is too long.

The other minor issue is that I need to add <table> and </table> to the beginning and ends of the outputted files. I am thinking about having the script, after it is done writing all of the files, go through each one and add the tags to the beginning and end. I am sure I can figure out some sort of way to do that but haven't gotten there yet.

One last thing! I would like to shorten the $Subject to 70 characters. I will probably end up using sed for that unless you suggest a better way.

Thanks a lot!

Code:
#!/bin/sh

SpamDir='/home/tay/spam-all/spam'
WorkingDir='/tmp/spam-summary'
Spams=`find "/home/tay/spam-all/spam" -iname "*.gz" -mtime 1 | xargs ls -t`

IFS=":"

# Loop on the files directly, instead of doing loops on line numbers
for FILE in ${SpamDir}/*.gz
do
        # Clear out variables
        From=
        To=
        Subject=
        Score=

        # Changed this to get the basename of the file.
        ID=`ls $FILE | xargs -n1 basename`

        # Your time functions look okay
        TimeEpoch=`ls -lh -D %s "$FILE" | awk '{print $6}'`
        TimeHuman=`date -r $TimeEpoch +"%Y-%m-%d %l:%M %p"`

        # Decompress file once instead of 9 times
        zcat "$FILE" > /tmp/$$

        # Read and process lines from the decompressed file one by one
        while read LINE
        do
                IFS=":" # Split on : so $1=X-Envelope-From, $2=<spammer@vnyu.com>
                set -- $LINE
                # If line has a : in it, save the header, then get rid of $1
                if [ "$#" -gt 1 ]
                then
                        HEADER="$1"
                        shift
                fi

                # Split on spaces, commas, and <>
                IFS="<>, "
                # Split <spammer@vnyu.com>, <whatever@...> into $1=spammer@vnyu.com, $2=whatever@..., etc
                set -- $1

                case "$HEADER" in
                X-Envelope-From) From=`echo "$From $@" | sed 's/<//g'`;;
                X-Envelope-To)     To=`echo "$To $@" | sed 's/<//g;s/[	]//g'`;;
                Subject)              Subject=`echo "$Subject $@" | sed 's/<//g'`;;
                X-Spam-Score)     Score="$@" ;;
                esac
        done < /tmp/$$

echo "$From	$Subject	$Score	$TimeHuman	$ID	$To" #Debug Output

set -- $To

for i in $To
do
echo "<tr><td>$From</td><td>$Subject</td><td><td>$Score</td><td>$TimeHuman</td><td>$ID</td></tr>" >> $WorkingDir/$i
done

done

rm -f /tmp/$$

---------- Post updated 2012-08-14 at 02:23 AM ---------- Previous update was 2012-08-13 at 08:49 PM ----------

Okay I got 2/3 down. For the shortening of the subject line, I updated the following line to:
Code:
Subject)              Subject=`echo "$Subject $@" | sed 's/<//g' | cut -c -70`;;

For adding the <table> to beginning and </table> the end of the script now looks like this. Probably better ways accomplish this but it works! =P
Code:
        for i in $To
        do
                echo "<tr><td>$From</td><td>$Subject</td><td><td>$Score</td><td>$TimeHuman</td><td>$ID</td></tr>" >> $WorkingDir/$i
        done

done

        for SUMMARY in $WorkingDir/*.com
        do
                text="<table>"; exec 3<> $SUMMARY && awk -v TEXT="$text" 'BEGIN {print TEXT}{print}' $SUMMARY >&3
                echo '</table>' >> $SUMMARY
        done
rm -f /tmp/$$


Last edited by tay9000; 08-14-2012 at 12:59 AM..
# 11  
Old 08-14-2012
Mind incorporating some of the fixes I gave you...? They do make it make less mess...
Quote:
Originally Posted by tay9000
Thanks a lot for your help. I've learned a bit. I've basically got the script completely working as below (still need to clean it up a little). The only big problem I have right now is I would like to loop on output from find "/home/tay/spam-all/spam" -iname "*.gz" -mtime 1 | xargs ls -t instead of the entire directory.

I've tried for FILE in `find "/home/tay/spam-all/spam" -iname "*.gz" -mtime 1 | xargs ls -t` and for FILE in $Spams but the script treats the output as one filename and then errors out saying the filename is too long.
When all you have is a for loop, all problems look like nails, but not everything needs a hammer. There are better options. See Useless Use of Backticks.

Code:
find "/home/tay/spam-all/spam" -iname "*.gz" -mtime 1 | xargs ls -t > /tmp/$$-spams

while read LINE
do
        echo got "$LINE"
done </tmp/$$-spams
rm -f /tmp/$$-spams

Quote:
The other minor issue is that I need to add <table> and </table> to the beginning and ends of the outputted files.
Show the input you have and the output you want. A script which won't do what can't tell me what you do want.

Putting it in a temp file avoids the problem of variables inside the while-loop not being seen in the rest of the script.

Also: You don't need to open a file 500 times to write 500 lines to it.

Code:
for i in $To
do
echo "<tr><td>$From</td><td>$Subject</td><td><td>$Score</td><td>$TimeHuman</td><td>$ID</td></tr>"
done > $WorkingDir/$i

Your set -- $To appears useless, remove it.

Last edited by Corona688; 08-14-2012 at 12:29 PM..
# 12  
Old 08-14-2012
Quote:
Originally Posted by Corona688
Mind incorporating some of the fixes I gave you...? They do make it make less mess...
Oh I do plan to take another look at what you've supplied and clean up my script. It's just that I was working on it myself before you posted a response. Geeze, I wasn't trying to get you to write my entire script! Although you pretty much have... and I'm thankful for that. Smilie

I'll post up what I've got when I finish.
Quote:
Also: You don't need to open a file 500 times to write 500 lines to it.
Oh I didn't really realize it was opening up the file 500 times. My train of thought was that I wanted to write each line to the end of the file so I use >> because I guess I thought > might just overwrite the file with the one line it was putting in each time it wrote the line. =P I like the page you linked. I'll read through and try not to do useless things!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to print lines based on text in field and value in two additional fields

In the awk below I am trying to print the entire line, along with the header row, if $2 is SNV or MNV or INDEL. If that condition is met or is true, and $3 is less than or equal to 0.05, then in $7 the sub pattern :GMAF= is found and the value after the = sign is checked. If that value is less than... (0 Replies)
Discussion started by: cmccabe
0 Replies

2. Shell Programming and Scripting

awk to print line is values between two fields in separate file

I am trying to use awk to find all the $3 values in file2 that are between $2 and $3 in file1. If a value in $3 of file2 is between the file1 fields then it is printed along with the $6 value in file1. Both file1 and file2 are tab-delimited as well as the desired output. If there is nothing to... (4 Replies)
Discussion started by: cmccabe
4 Replies

3. UNIX for Beginners Questions & Answers

How to count lines of CSV file where 2 fields match variables?

I'm trying to use awk to count the occurrences of two matching fields of a CSV file. For instance, for data that looks like this... Joe,Blue,Yes,No,High Mike,Blue,Yes,Yes,Low Joe,Red,No,No,Low Joe,Red,Yes,Yes,Low I've been trying to use code like this... countvar=`awk ' $2~/$color/... (4 Replies)
Discussion started by: nmoore2843
4 Replies

4. Shell Programming and Scripting

awk sort based on difference of fields and print all fields

Hi I have a file as below <field1> <field2> <field3> ... <field_num1> <field_num2> Trying to sort based on difference of <field_num1> and <field_num2> in desceding order and print all fields. I tried this and it doesn't sort on the difference field .. Appreciate your help. cat... (9 Replies)
Discussion started by: newstart
9 Replies

5. Shell Programming and Scripting

How to print 1st field and last 2 fields together and the rest of the fields after it using awk?

Hi experts, I need to print the first field first then last two fields should come next and then i need to print rest of the fields. Input : a1,abc,jsd,fhf,fkk,b1,b2 a2,acb,dfg,ghj,b3,c4 a3,djf,wdjg,fkg,dff,ggk,d4,d5 Expected output: a1,b1,b2,abc,jsd,fhf,fkk... (6 Replies)
Discussion started by: 100bees
6 Replies

6. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Hello everyone, Maybe somebody could help me with an awk script. I have this input (field separator is comma ","): 547894982,M|N|J,U|Q|P,98,101,0,1,1 234900027,M|N|J,U|Q|P,98,101,0,1,1 234900023,M|N|J,U|Q|P,98,54,3,1,1 234900028,M|H|J,S|Q|P,98,101,0,1,1 234900030,M|N|J,U|F|P,98,101,0,1,1... (2 Replies)
Discussion started by: Ophiuchus
2 Replies

7. Shell Programming and Scripting

awk print header as text from separate file with getline

I would like to print the output beginning with a header from a seperate file like this: awk 'BEGIN{FS="_";print ((getline < "header.txt")>0)} { if (! ($0 ~ /EL/ ) print }" input.txtWhat am i doing wrong? (4 Replies)
Discussion started by: sdf
4 Replies

8. Shell Programming and Scripting

Compare Tab Separated Field with AWK to all and print lines of unique fields.

Hi. I have a tab separated file that has a couple nearly identical lines. When doing: sort file | uniq > file.new It passes through the nearly identical lines because, well, they still are unique. a) I want to look only at field x for uniqueness and if the content in field x is the... (1 Reply)
Discussion started by: rocket_dog
1 Replies

9. Shell Programming and Scripting

extract nth line of all files and print in output file on separate lines.

Hello UNIX experts, I have 124 text files in a directory. I want to extract the 45678th line of all the files sequentialy by file names. The extracted lines should be printed in the output file on seperate lines. e.g. The input Files are one.txt, two.txt, three.txt, four.txt The cat of four... (1 Reply)
Discussion started by: yogeshkumkar
1 Replies

10. Shell Programming and Scripting

trying to print selected fields of selected lines by AWK

I am trying to print 1st, 2nd, 13th and 14th fields of a file of line numbers from 29 to 10029. I dont know how to put this in one code. Currently I am removing the selected lines by awk 'NR==29,NR==10029' File1 > File2 and then doing awk '{print $1, $2, $13, $14}' File2 > File3 Can... (3 Replies)
Discussion started by: ananyob
3 Replies
Login or Register to Ask a Question