Help with url filtering


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Help with url filtering
# 1  
Old 10-22-2011
Help with url filtering

Hi,

Here's my input file:

Code:
http://google.com/1.exe
http://google.com/2.exe
http://google.com/3.exe
http://rediff.jp/sample.zip
http://yahoo.com/1.exe

And here's what my output file should ideally look like:

Code:
http://google.com
http://rediff.jp/sample.zip
http://yahoo.com/1.exe

So what I'm trying to do is to take just 1 entry of those URLs which repeat themselves , and keep the once that don't repeat as it is, for URL blacklisting.

Can this be done using sed/grep/awk/uniq etc ?

Thanks in advance
# 2  
Old 10-22-2011
Try this...
Code:
awk -F/ '{if($3 in _1){gsub(/[a-z0-9.]*$/,"")} _1[$3]=$0}END{for(j in _1){print _1[j]}}' input_file

--ahamed
# 3  
Old 10-23-2011
Hi ahamed,

Thanks for the reply. This works, but only partially.

If i use this code on this set of URLs:

Code:
http://google.com/dogs/1.exe
http://google.com/cats/2.exe
http://google.com/apples/3.exe
http://rediff.jp/sample.zip
http://yahoo.com/1.exe

I should be getting this:

Code:
http://google.com/
http://rediff.jp/sample.zip
http://yahoo.com/1.exe

But instead I get this:

Code:
http://google.com/apples/
http://rediff.jp/sample.zip
http://yahoo.com/1.exe

Also, can you explain your code if you can please?

Thanks in advance
# 4  
Old 10-23-2011
Try this...

Code:
awk '{t=match($0,"[a-z]/");v=substr($0,0,t+1);v in _1?_1[v]=v:_1[v]=$0} END{for(j in _1){print _1[j]}}' input_file

--ahamed

Last edited by ahamed101; 10-23-2011 at 06:34 AM..
# 5  
Old 10-23-2011
Thanks....that did the trick. Can you explain the code if possible please ?

thanks in advance
# 6  
Old 10-23-2011
You can try this if you want strip "google" and leave other intact :
Code:
sed "s#\(http://google.com\)/.*#\1/#g" inp | sort -u

# 7  
Old 10-24-2011
Hi peasant,

Thanks for the reply, but I won't know the URL before hand. So I wouldn't be able to hard code it in the code.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Reading URL using Mechanize and dump all the contents of the URL to a file

Hello, Am very new to perl , please help me here !! I need help in reading a URL from command line using PERL:: Mechanize and needs all the contents from the URL to get into a file. below is the script which i have written so far , #!/usr/bin/perl use LWP::UserAgent; use... (2 Replies)
Discussion started by: scott_cog
2 Replies

2. Shell Programming and Scripting

Need help on filtering

Hi experts, I have a file image.csv as below: COMPUTERNAME,23/07/2013,22/07/2013,21/07/2013,20/07/2013,19/07/2013,18/07/2013,17/07/2013 AED03852180,3,3,3,3,3,3,3 AED03852181,3,3,3,3,3,3,1 AED09020382,3,0,3,0,3,3,3 AED09020383,1,3,3,3,2,1,3 AED09020386,3,3,0,3,3,0,3 ... (4 Replies)
Discussion started by: zaq1xsw2
4 Replies

3. Shell Programming and Scripting

Filtering

Hi I am interested in DNS resolving a set of sites and each time the output is different- $ host www.yahoo.com www.yahoo.com is an alias for fd-fp3.wg1.b.yahoo.com. fd-fp3.wg1.b.yahoo.com is an alias for ds-fp3.wg1.b.yahoo.com. ds-fp3.wg1.b.yahoo.com is an alias for... (1 Reply)
Discussion started by: jamie_123
1 Replies

4. AIX

Need help with filtering

Hi!! I have a bit of a task here and filtering/scripting not my strongest. I have to collect info of approx 1100 hdiskpower.so i have appended all the hdisk into a text file and i need it to run the command lscfg -vl to confirm if the drive is symmetrix. here's what i have so far at... (3 Replies)
Discussion started by: vpundit
3 Replies

5. UNIX for Dummies Questions & Answers

Awk: print all URL addresses between iframe tags without repeating an already printed URL

Here is what I have so far: find . -name "*php*" -or -name "*htm*" | xargs grep -i iframe | awk -F'"' '/<iframe*/{gsub(/.\*iframe>/,"\"");print $2}' Here is an example content of a PHP or HTM(HTML) file: <iframe src="http://ADDRESS_1/?click=5BBB08\" width=1 height=1... (18 Replies)
Discussion started by: striker4o
18 Replies

6. Shell Programming and Scripting

Please help me to do some filtering

I have to grep a pattern. scenario is like :- Suppose "/etc/sec/one" is a string, i need to check if this string contains "one" using any utility something like if /etc/sec/one | grep ; then Thanks in advance Renjesh Raju (3 Replies)
Discussion started by: Renjesh
3 Replies

7. Web Development

Regex to rewrite URL to another URL based on HTTP_HOST?

I am trying to find a way to test some code, but I need to rewrite a specific URL only from a specific HTTP_HOST The call goes out to http://SUB.DOMAIN.COM/showAssignment/7bde10b45efdd7a97629ef2fe01f7303/jsmodule/Nevow.Athena The ID in the middle is always random due to the cookie. I... (5 Replies)
Discussion started by: EXT3FSCK
5 Replies

8. UNIX for Dummies Questions & Answers

ReDirecting a URL to another URL - Linux

Hello, I need to redirect an existing URL, how can i do that? There's a current web address to a GUI that I have to redirect to another webaddress. Does anyone know how to do this? This is on Unix boxes Linux. example: https://m45.testing.address.net/host.php make it so the... (3 Replies)
Discussion started by: SkySmart
3 Replies

9. Shell Programming and Scripting

url calling and parameter passing to url in script

Hi all, I need to write a unix script in which need to call a url. Then need to pass parameters to that url. please help. Regards, gander_ss (1 Reply)
Discussion started by: gander_ss
1 Replies

10. UNIX for Advanced & Expert Users

url calling and parameter passing to url in script

Hi all, I need to write a unix script in which need to call a url. Then need to pass parameters to that url. please help. Regards, gander_ss (1 Reply)
Discussion started by: gander_ss
1 Replies
Login or Register to Ask a Question