URL partial matching


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting URL partial matching
# 1  
Old 08-01-2014
URL partial matching

I have two files: file 1
Code:
http://www.hello.com        http://neo.com/peace/development.html, www.japan.com,  http://example.com/abc/abc.html
http://news.net             http://lolz.com/country/list.html,www.telecom.net, www.highlands.net, www.software.com
http://example2.com         http://earth.net, http://abc.gov.cn/department/1.html

file 2:
Code:
www.neo.com/1/2/3/names.html
http://abc.gov.cn/script.aspx
http://example.com/abc/abc.html

file 2 are the search urls that is used for the partial match in file1 at column2. If it has the partial match it has to return the column 1 url with the partial match url in column 2 of file 1 like this:

Desired output:
Code:
http://www.hello.com    http://neo.com/peace/development.html, http://example.com/abc/abc.html
http://news.net
http://example2.com     http://abc.gov.cn/department/1.html

I am using this script which can give me exact match url pattern at column 2 but cannot work with the partial match which is as follows:

Code:
awk -F '[ \t,]' '
FNR == NR {
    a[$1]
    next
}
{    o = $1
    c = 0
    for(i = 2; i <= NF; i++)
        if($i in a)
            o = o (c++ ? ", " : "\t") $i
    print o
}' file2 file1

The output is :
Code:
http://www.hello.com    http://example.com/abc/abc.html
http://news.net
http://example2.com

Any suggestion to fix this ?
# 2  
Old 08-02-2014
Code:
akshay@nio:/tmp$ cat file1
http://www.hello.com        http://neo.com/peace/development.html, www.japan.com,  http://example.com/abc/abc.html
http://news.net             http://lolz.com/country/list.html,www.telecom.net, www.highlands.net, www.software.com
http://example2.com         http://earth.net, http://abc.gov.cn/department/1.html

Code:
akshay@nio:/tmp$ cat file2
www.neo.com/1/2/3/names.html
http://abc.gov.cn/script.aspx
http://example.com/abc/abc.html

Code:
akshay@nio:/tmp$ cat cmp_url.awk 
function host(s){
	gsub(/^(http|https):\/\//,"",s)
	gsub(/\/.*|[[:space:]]+|www\./,"",s)
	return s
}
FNR==NR{
	HOSTS_IN_FILE2[host($0)]
	next
}
NF{
	gsub(/,/," "); str = ""
	for(i=2;i<=NF;i++)
	{
		if( host($i) in HOSTS_IN_FILE2 )
		{
			str = length(str) ? str "," $i : $i
		}
	}
	print $1 ( length(str)? OFS str : "" )
	
}

Resulting
Code:
akshay@nio:/tmp$ awk -vOFS="\t" -f cmp_url.awk file2 file1
http://www.hello.com	http://neo.com/peace/development.html,http://example.com/abc/abc.html
http://news.net
http://example2.com	http://abc.gov.cn/department/1.html

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to extract the partial matching strings among two files?

I have a two file as shown below, file:1 >Contig_152_415 (REVERSE SENSE) >Contig_152_420 (REVERSE SENSE) >Contig_152_472 (REVERSE SENSE) >Contig_152_484 (REVERSE SENSE) File:2 >Contig_152:49081-49929 ATCGAGCAGCGCCGCGTGCGGTGCACCCTTGTGCAGATCGGGAGTAACCACGCGCACGGC... (2 Replies)
Discussion started by: dineshkumarsrk
2 Replies

2. UNIX for Beginners Questions & Answers

awk to update file with partial matching line in another file and append text

In the awk below I am trying to cp and paste each matching line in f2 to $3 in f1 if $2 of f1 is in the line in f2 somewhere. There will always be a match (usually more then 1) and my actual data is much larger (several hundreds of lines) in both f1 and f2. When the line in f2 is pasted to $3 in... (4 Replies)
Discussion started by: cmccabe
4 Replies

3. Shell Programming and Scripting

Reading URL using Mechanize and dump all the contents of the URL to a file

Hello, Am very new to perl , please help me here !! I need help in reading a URL from command line using PERL:: Mechanize and needs all the contents from the URL to get into a file. below is the script which i have written so far , #!/usr/bin/perl use LWP::UserAgent; use... (2 Replies)
Discussion started by: scott_cog
2 Replies

4. UNIX for Dummies Questions & Answers

Awk: print all URL addresses between iframe tags without repeating an already printed URL

Here is what I have so far: find . -name "*php*" -or -name "*htm*" | xargs grep -i iframe | awk -F'"' '/<iframe*/{gsub(/.\*iframe>/,"\"");print $2}' Here is an example content of a PHP or HTM(HTML) file: <iframe src="http://ADDRESS_1/?click=5BBB08\" width=1 height=1... (18 Replies)
Discussion started by: striker4o
18 Replies

5. Shell Programming and Scripting

AWK - Print partial line/partial field

Hello, this is probably a simple request but I've been toying with it for a while. I have a large list of devices and commands that were run with a script, now I have lines such as: a-router-hostname-C#show ver I want to print everything up to (and excluding) the # and everything after it... (3 Replies)
Discussion started by: ippy98
3 Replies

6. UNIX for Dummies Questions & Answers

Matching A URL pattern

egrep -iow '(http*+|www)*' url.txt is this command logically incorrect to match a url pattern inside a file and display only the urls in the terminal??? Please rectify the error in my syntax , (2 Replies)
Discussion started by: an2up
2 Replies

7. Web Development

Regex to rewrite URL to another URL based on HTTP_HOST?

I am trying to find a way to test some code, but I need to rewrite a specific URL only from a specific HTTP_HOST The call goes out to http://SUB.DOMAIN.COM/showAssignment/7bde10b45efdd7a97629ef2fe01f7303/jsmodule/Nevow.Athena The ID in the middle is always random due to the cookie. I... (5 Replies)
Discussion started by: EXT3FSCK
5 Replies

8. Shell Programming and Scripting

url calling and parameter passing to url in script

Hi all, I need to write a unix script in which need to call a url. Then need to pass parameters to that url. please help. Regards, gander_ss (1 Reply)
Discussion started by: gander_ss
1 Replies

9. UNIX for Advanced & Expert Users

url calling and parameter passing to url in script

Hi all, I need to write a unix script in which need to call a url. Then need to pass parameters to that url. please help. Regards, gander_ss (1 Reply)
Discussion started by: gander_ss
1 Replies

10. Shell Programming and Scripting

Grep all files matching partial filename

What would be the easiest way to grep all files within a particular directory that match a partial filename? For example, searching all files that begin with "filename.txt" and are appended with the date they were created. I am using Ksh 88, btw. (3 Replies)
Discussion started by: mharley
3 Replies
Login or Register to Ask a Question