text manipulation and pattern matching


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting text manipulation and pattern matching
# 1  
Old 07-24-2008
text manipulation and pattern matching

Hi guys,

I need help:
I started receiving automatic emails containing download information. The problem is that these emails are coming in a rich format (I have no control of this) so the important information is buried under a bunch of mumbo-jumbo. To complicated things even further I need to automated the download process too so I need to somehow identify and extract the exact path to the file and forward it for further processing

the relevant part of the email looks something like this:

more_blah_before
style=3D"font-size: 11px; margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: =
0px; padding-bottom: 0px; padding-left: 0px; ">Software</td><td =
style=3D"font-size: 11px; margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: =
0px; padding-bottom: 0px; padding-left: 0px; "><a =
href=3D"afp://server.company.com/del/e/QQ888-9999/Q=
Q888-9999-3/QQ888-9999-3.dmg">del/QQ888-9999/QQ888-9999-3</a></td=
></tr><tr style=3D"vertical-align: top; margin-top: 0px; margin-right: =
0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; =
padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "><td =
style=3D"font-size: 11px; margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: =
more_blah_after

so the part that I need to extract from here is
afp://server.company.com/del/e/QQ888-9999/QQ888-9999-3/QQ888-999-3.dmg

the problem is that the path to the file is split with "=" so that would have to be removed somehow (if present)

also I am not sure how to remove anything present before afp:// (like href=3D" in this case) or anything present after .dmg (
">del/QQ888-9999/QQ888-9999-3</a></td= in this case)

any help would be appreciated

thank you
# 2  
Old 07-24-2008
This should do the job:
Code:
tr -d '\n' < file | sed 's/^.*"afp/"afp/;s/>.*$//'

# 3  
Old 07-24-2008
wow. you're amazing. thank you!

to expand on this, most of the time I would get an email with not one, but two files to download (and two to avoid).
would you mind suggesting a loop that would extract both afp links

for example:
afps to get:

afp://MYserver.company.com/del/e/QQ888-9999/QQ888-9999-/QQ888-9999-3.dmg
and
afp://MYserver.company.com/del/e/QQ666-7777/QQ666-7777-/QQ666-7777-3.dmg

both buried in the rich formatting non-sense.

to makes things a bit more complicated, the email would also contain a couple of afp links to a different server, that I would need to be skipped

for example

afps to be skipped:
afp://NOTMYserver.company.com/del/e/QQ888-9999/QQ888-9999-/QQ888-9999-3.dmg
and
afp://NOTMYserver.company.com/del/e/QQ666-7777/QQ666-7777-/QQ666-7777-3.dmg


the sample email would look something like this:

more_blah_before
0px; padding-bottom: 0px; padding-left: 0px; ">Software</td><td =
style=3D"font-size: 11px; margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: =
0px; padding-bottom: 0px; padding-left: 0px; "><a =
href=3D"afp://NOTMYserver.company.com/del/e/QQ888-9999/Q=
Q888-9999-3/QQ888-9999-3.dmg">del/QQ888-9999/QQ888-9999-3</a></td=
></tr><tr style=3D"vertical-align: top; margin-top: 0px; margin-right: =
padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "><td =
href=3D"afp://NOTmyserver.company.com/del/e/QQ666-7777/Q=
Q666-7777-3/QQ666-7777-3.dmg">del/QQ666-7777/QQ666-7777-3</a></td=
style=3D"font-size: 11px; margin-top: 0px; margin-right: 0px; =
0px; padding-bottom: 0px; padding-left: 0px; ">Software</td><td =
style=3D"font-size: 11px; margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: =
0px; padding-bottom: 0px; padding-left: 0px; "><a =
href=3D"afp://MYserver.company.com/del/e/QQ888-9999/Q=
Q888-9999-3/QQ888-9999-3.dmg">del/QQ888-9999/QQ888-9999-3</a></td=
></tr><tr style=3D"vertical-align: top; margin-top: 0px; margin-right: =
padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "><td =
href=3D"afp://MYserver.company.com/del/e/QQ666-7777/Q=
Q666-7777-3/QQ666-7777-3.dmg">del/QQ666-7777/QQ666-7777-3</a></td=
style=3D"font-size: 11px; margin-top: 0px; margin-right: 0px; =
more_blah_after

thanks again, much appreciated
# 4  
Old 07-25-2008
Hammer & Screwdriver take a look at grep

The grep command should be able to select the records you want to include. Using grep -v allows you to exclude records that match a criteria.

So, you might want to append

grep "afp://MYserver.company.com"
or
grep -v "afp://NOTMYserver.company.com"

to the end of the previous command string.
# 5  
Old 07-25-2008
.. or replace sed by awk:
Code:
tr -d '\n' < file | awk -F'"' -v v="MYserver" '{for(i=1;i<=NF;i++){if(match($i,"/"v)) print $i}}'

# 6  
Old 07-25-2008
again, unbelievable. thank you guys, what would take me days (if not weeks) to figure out is sometimes just a couple of posts away. anyway this will be a great starting point for me to learn something new and useful.

thanks again.
# 7  
Old 07-25-2008
almost there:

I am now able to get the desired paths, and do further string replacements which is great.

my outputs ends up being something like this:

output="file1path file2path"

I'd like to further process this and have the two paths in two separate variables:

file1="file1path"
file2="file2path"

what's the best approach here ? I don't think I need an array, just two simple variables.

thanks again for any hints
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to add text to matching pattern in field

In the awk I am trying to add :p.=? to the end of each $9 that matches the pattern NM_. The below executes andis close but I can not seem to figure out why the :p.=? repeats in the split as in the green in the current output. I have added comments as well. Thank you :). file ... (4 Replies)
Discussion started by: cmccabe
4 Replies

2. UNIX for Dummies Questions & Answers

Grep -v lines starting with pattern 1 and not matching pattern 2

Hi all! Thanks for taking the time to view this! I want to grep out all lines of a file that starts with pattern 1 but also does not match with the second pattern. Example: Drink a soda Eat a banana Eat multiple bananas Drink an apple juice Eat an apple Eat multiple apples I... (8 Replies)
Discussion started by: demmel
8 Replies

3. Shell Programming and Scripting

PHP - Regex for matching string containing pattern but without pattern itself

The sample file: dept1: user1,user2,user3 dept2: user4,user5,user6 dept3: user7,user8,user9 I want to match by '/^dept2.*/' but don't want to have substring 'dept2:' in output. How to compose such regex? (8 Replies)
Discussion started by: urello
8 Replies

4. Shell Programming and Scripting

Sed: printing lines AFTER pattern matching EXCLUDING the line containing the pattern

'Hi I'm using the following code to extract the lines(and redirect them to a txt file) after the pattern match. But the output is inclusive of the line with pattern match. Which option is to be used to exclude the line containing the pattern? sed -n '/Conn.*User/,$p' > consumers.txt (11 Replies)
Discussion started by: essem
11 Replies

5. Shell Programming and Scripting

Find all matching words in text according to pattern

Hello dear Unix shell professionals, I am desperately trying to get a seemingly simple logic to work. I need to extract words from a text line and save them in an array. The text can look anything like that: aaaaaaa${important}xxxxxxxx${important2}ooooooo${importantstring3}...I am handicapped... (5 Replies)
Discussion started by: Grünspanix
5 Replies

6. Shell Programming and Scripting

Pattern Matching and text deletion using VI

Can someone please assist me, I'm trying to get vi to remove all the occurences of the text in a file i.e. "DEVICE=/dev/mt??". The "??" represents a number variable. Is there a globel search and delete command that I can use? Thank You in Advance. (3 Replies)
Discussion started by: roadrunner
3 Replies

7. Shell Programming and Scripting

insert text into another file after matching pattern

i am not sure what i should be using but would like a simple command that is able to insert a certain block of text that i define or from another text file into a xml file after a certain match is done for e.g insert the text </servlet-mapping> <!-- beechac added - for epic post-->... (3 Replies)
Discussion started by: cookie23patel
3 Replies

8. Shell Programming and Scripting

counting the lines matching a pattern, in between two pattern, and generate a tab

Hi all, I'm looking for some help. I have a file (very long) that is organized like below: >Cluster 0 0 283nt, >01_FRYJ6ZM12HMXZS... at +/99% 1 279nt, >01_FRYJ6ZM12HN12A... at +/99% 2 281nt, >01_FRYJ6ZM12HM4TS... at +/99% 3 283nt, >01_FRYJ6ZM12HM946... at +/99% 4 279nt,... (4 Replies)
Discussion started by: d.chauliac
4 Replies

9. UNIX for Advanced & Expert Users

pattern matching with comma delimited text

Hi, I have two files that I need to match patterns with and the second file has comma delimited rows of data that match but I'm having trouble getting a script to work that gives me the match output to these sets : file 1: PADG_05255 PADG_06803 PADG_07148 PADG_02849 PADG_02886... (8 Replies)
Discussion started by: greptastic
8 Replies

10. Shell Programming and Scripting

comment/delete a particular pattern starting from second line of the matching pattern

Hi, I have file 1.txt with following entries as shown: 0152364|134444|10.20.30.40|015236433 0233654|122555|10.20.30.50|023365433 ** ** ** In file 2.txt I have the following entries as shown: 0152364|134444|10.20.30.40|015236433 0233654|122555|10.20.30.50|023365433... (4 Replies)
Discussion started by: imas
4 Replies
Login or Register to Ask a Question