Visit Our UNIX and Linux User Community


Awk: print all URL addresses between iframe tags without repeating an already printed URL


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Awk: print all URL addresses between iframe tags without repeating an already printed URL
# 1  
Old 02-28-2012
Awk: print all URL addresses between iframe tags without repeating an already printed URL

Here is what I have so far:

Code:
find . -name "*php*" -or -name "*htm*" | xargs grep -i iframe | awk -F'"' '/<iframe*/{gsub(/.\*iframe>/,"\"");print $2}'

Here is an example content of a PHP or HTM(HTML) file:

HTML Code:
<iframe src="http://ADDRESS_1/?click=5BBB08\" width=1 height=1 style="visibility:hidden;position:absolute"></iframe><iframe src="http://ADDRESS_2/?click=5BBB08\" width=1 height=1 style="visibility:hidden;position:absolute"></iframe>

<iframe src="http://ADDRESS_3/?click=5BBB08\" width=1 height=1 style="visibility:hidden;position:absolute"></iframe> <iframe src="http://ADDRESS_4/?click=5BBB08\" width=1 height=1 style="visibility:hidden;position:absolute"></iframe>

<iframe src="http://ADDRESS_5/?click=5BBB08\" width=1 height=1 style="visibility:hidden;position:absolute"></iframe>
<iframe src="http://ADDRESS_6/?click=5BBB08\" width=1 height=1 style="visibility:hidden;position:absolute"></iframe>
<iframe src="http://ADDRESS_7/?click=5BBB08\" width=1 height=1 style="visibility:hidden;position:absolute"></iframe>
<iframe src="http://ADDRESS_6/?click=5BBB08\" width=1 height=1 style="visibility:hidden;position:absolute"></iframe>
<iframe src="http://ADDRESS_7/?click=5BBB08\" width=1 height=1 style="visibility:hidden;position:absolute"></iframe>
Here is the wanted output:

, and here is my output:

Any suggestions? Better ideas? What I want to do is get a single result for each URL address in an iframe, on a single line.

Thanks! ;->
# 2  
Old 02-28-2012
If you're using awk, you don't need grep. awk '/regex/ { print }' filenames is equivalent to grep "regex" filenames

I'm abusing awk's record-separator here, RS, so that it considers each < character to be a "newline". So it will split the two iframes apart itself, and "iframe src=" will always show up at the beginning of the 'line' if present, which I test with a regex /^iframe src=/. If a match is found, it rips the URL out of parameter $2 with a gsub, tests if we've printed it already, and if not, prints it.

Code:
find . -name "*php*" -or -name "*htm*" |
        xargs awk -v RS='<' '
# For all records beginning with 'iframe src':
/^iframe src=/ {
        # get rid of src= and "
        gsub(/(src=)|"/, "", $2);
        # Print only if we haven't seen it before
        if(!X[$2]++) print $2
}'

This User Gave Thanks to Corona688 For This Post:
# 3  
Old 02-28-2012
Try:
Code:
awk '/http/' RS=\" infile

These 2 Users Gave Thanks to Scrutinizer For This Post:
# 4  
Old 02-28-2012
Quote:
Originally Posted by Scrutinizer
Try:
Code:
awk '/http/' RS=\" infile

That never occurred to me. Thanks, it works very well. However, the output is:

Code:
http://ADDRESS_1/?click=5BBB08\
http://ADDRESS_2/?click=5BBB08\
http://ADDRESS_3/?click=5BBB08\
http://ADDRESS_4/?click=5BBB08\
http://ADDRESS_5/?click=5BBB08\
http://ADDRESS_6/?click=5BBB08\
http://ADDRESS_7/?click=5BBB08\
http://ADDRESS_6/?click=5BBB08\
http://ADDRESS_7/?click=5BBB08\

Now I only need to get rid of the duplicate entries.

I did try sort -u and got what I wanted.

So final:

Code:
awk '/http/' RS=\" infile | sort -u


Last edited by striker4o; 02-28-2012 at 04:03 PM.. Reason: Nevermind, I made it. Thanks for help.
# 5  
Old 02-28-2012
Wow, that's neat.

Combining these two ideas into one which checks duplicates and doesn't need grep:

Code:
find . -name "*php*" -or -name "*htm*" |
        xargs awk -F\" -v RS='<' '/^iframe src=/ { if(!X[$2]++) print $2 }'

# 6  
Old 02-28-2012
Code:
awk '/http/' RS=\" infile | sort -u

This User Gave Thanks to Scrutinizer For This Post:
# 7  
Old 02-28-2012
Yeah, thanks. I sorted out "sort -u" almost immediately after my dumb question Smilie

So here is the finalist:

Code:
find . -name "*php*" -or -name "*htm*" -type f| awk '/http/' RS=\" *.* | sort -u

 

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Reading URL using Mechanize and dump all the contents of the URL to a file

Hello, Am very new to perl , please help me here !! I need help in reading a URL from command line using PERL:: Mechanize and needs all the contents from the URL to get into a file. below is the script which i have written so far , #!/usr/bin/perl use LWP::UserAgent; use... (2 Replies)
Discussion started by: scott_cog
2 Replies

2. Shell Programming and Scripting

awk and or sed command to sum the value in repeating tags in a XML

I have a XML in which <Amt Ccy="EUR">3.1</Amt> tag repeats. This is under another tag <Main>. I need to sum all the values of <Amt Ccy=""> (Ccy may vary) coming under <Main> using awk and or sed command. can some help? Sample looks like below <root> <Main> ... (6 Replies)
Discussion started by: bk_12345
6 Replies

3. UNIX for Dummies Questions & Answers

URL decoding with awk

The challenge: Decode URL's, i.e. convert %HEX to the corresponding special characters, using only UNIX base utilities, and without having to type out each special character. I have an anonymous C code snippet where the author assigns each hex digit a number from 0 to 16 and then does some... (2 Replies)
Discussion started by: uiop44
2 Replies

4. Web Development

Regex to rewrite URL to another URL based on HTTP_HOST?

I am trying to find a way to test some code, but I need to rewrite a specific URL only from a specific HTTP_HOST The call goes out to http://SUB.DOMAIN.COM/showAssignment/7bde10b45efdd7a97629ef2fe01f7303/jsmodule/Nevow.Athena The ID in the middle is always random due to the cookie. I... (5 Replies)
Discussion started by: EXT3FSCK
5 Replies

5. Shell Programming and Scripting

Extract URL from RSS Feed in AWK

Hi, I have following data file; <outline title="Matt Cutts" type="rss" version="RSS" xmlUrl="http://www.mattcutts.com/blog/feed/" htmlUrl="http://www.mattcutts.com/blog"/> <outline title="Stone" text="Stone" type="rss" version="RSS" xmlUrl="http://feeds.feedburner.com/STC-Art"... (8 Replies)
Discussion started by: fahdmirza
8 Replies

6. Shell Programming and Scripting

how to judge wether a url is valid or not using awk

rt 3ks:confused: (6 Replies)
Discussion started by: rainboisterous
6 Replies

7. UNIX for Advanced & Expert Users

Need to grab URL and place between <A></A> Tags

my output looks like: <A HREF="http://support.apple.com/kb/HT1629"> </A> <A HREF="http://support.apple.com/kb/HT1200"> </A> <A HREF="http://old.nabble.com/AFP-eating-up-CPU-td19976358.html"> </A> <A HREF="http://jochsner.dyndns.org/scripts/NHR.html"> </A> <A... (3 Replies)
Discussion started by: glev2005
3 Replies

8. UNIX for Dummies Questions & Answers

ReDirecting a URL to another URL - Linux

Hello, I need to redirect an existing URL, how can i do that? There's a current web address to a GUI that I have to redirect to another webaddress. Does anyone know how to do this? This is on Unix boxes Linux. example: https://m45.testing.address.net/host.php make it so the... (3 Replies)
Discussion started by: SkySmart
3 Replies

9. Shell Programming and Scripting

url calling and parameter passing to url in script

Hi all, I need to write a unix script in which need to call a url. Then need to pass parameters to that url. please help. Regards, gander_ss (1 Reply)
Discussion started by: gander_ss
1 Replies

10. UNIX for Advanced & Expert Users

url calling and parameter passing to url in script

Hi all, I need to write a unix script in which need to call a url. Then need to pass parameters to that url. please help. Regards, gander_ss (1 Reply)
Discussion started by: gander_ss
1 Replies

Featured Tech Videos