Get only domain from url file bind


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Get only domain from url file bind
# 1  
Old 11-01-2015
Linux Get only domain from url file bind

Hello everybody
I have been trying to extract the domain name from the bind query log with different options, however always get stuck with domains that end with link .co.uk or .co.nz.

I tried the following, however only provides the first level:
Code:
awk -F"." '{print $(NF-1)"."$NF}' list.txt > test.txt

Current raw file:
Code:
csc.beap.bc.yahoo.com
googleads.g.doubleclick.net
dances.with.wolves.tracker.prq.to
fbcdn-sphotos-g-a.akamaihd.net
api.smoot.apple.com
glistockisti.it
apps.ad-x.co.uk
configuration.apple.com.edgekey.net
walter-producer-cdn.api.bbci.co.uk
www.google.co.nz
bbc.co.uk

Desired output file:
Code:
yahoo.com
doubleclick.net
tracker.prq.to
akamaihd.net
apple.com
glistockisti.it
ad-x.co.uk
edgekey.net
bbci.co.uk
google.co.nz
bbc.co.uk

Is it possible to get the domain names through a command or must the list be compared to another file that contains a list of all domains on the internet?
Moderator's Comments:
Mod Comment Please do not use FONT and SIZE tags when posting to The UNIX & Linux Forums.
Please use CODE tags; not ICODE tags for multi-line sample input, output, and code.

Last edited by Don Cragun; 11-01-2015 at 06:13 PM.. Reason: Change ICODE tags to CODE tags, get rid of FONT and SIZE tags.
# 2  
Old 11-01-2015
Quote:
Originally Posted by omuhans123
Is it possible to get the domain names through a command or must the list be compared to another file that contains a list of all domains on the internet?
You have to compare to another list that defines the sub-domains.
Follow this link, if still active, for more information.
A compilation list can be found here.
# 3  
Old 11-03-2015
Thank you for the response Aia, however that post is quite old and does not seem to be active anymore. or have a solution as such.Also thank you for the publicsuffix list this is very helpful and has provided me with an new possible approach to the challenge.

Unfortunately I am very new to the shell scripting world and would appreciate assistance in this regard. Here is the idea:

The URL is longer than the publicsuffix listed items and the url is separated by "." so if there a possibility to grep or search the url starting from the right hand side and finding the most accurate match. Let me provide an example:

Code:
walter-producer-cdn.api.bbci.co.uk

starting from the right hand site matching agains the publicsuffix list:
publicsuffix list for uk:
Code:
uk
ac.uk
co.uk
gov.uk
ltd.uk
me.uk
net.uk
nhs.uk
org.uk
plc.uk
police.uk
*.sch.uk

URL lookup:
Code:
uk

-Match
Code:
co.uk

-Match
Code:
bbci.co.uk

-No Match

When No Match was returned getting the co.uk with one segment addition of the URL to end up with bbci.co.uk.

Would this be possible to script it in a possible way?
# 4  
Old 11-03-2015
A slightly different approach:
Code:
awk '
NR==FNR                 {C[$0]
                         next
                        }
$(NF-1) OFS $(NF) in C  {print $(NF-2) OFS $(NF-1) OFS $NF
                         next
                        }
                        {print $(NF-1) OFS $NF
                        }
' FS="." OFS="." publicsuffix.lst raw
yahoo.com
doubleclick.net
prq.to
akamaihd.net
apple.com
glistockisti.it
ad-x.co.uk
edgekey.net
bbci.co.uk
co.nz
bbc.co.uk

# 5  
Old 11-03-2015
Thank you RudiC, could I kindly ask you to elaborate on the code, as mentioned before, I am very new to this. I have two files the one that contains the URL and the other one the publicsuffic list. Thank you
# 6  
Old 11-03-2015
Code:
awk '
NR==FNR                 {C[$0]                  # read first file (= NR==FNR) into the indices of the associative array C
                         next                   # stop processing the actual line; proceed with next line
                        }
$(NF-1) OFS $(NF) in C  {print $(NF-2) OFS $(NF-1) OFS $NF
                                                # if second last ($(NF-1) and last ($NF) fields, joint by a dot, are found in C
                                                # print third last, second last, and last field
                         next                   # stop ... see above
                        }
                        {print $(NF-1) OFS $NF  # if above doesn't apply, print second last and last fields 
                        }
' FS="." OFS="." publicsuffix.lst raw           # supply the field separators and two files to awk

This code certainly is not perfect; e.g. the co.nz is missing in the publicsuffix.lst, but it may serve as a starting point...
# 7  
Old 11-04-2015
RudiC, thank you very much for providing this solution, it is truly appreciated. I checked through the publicsuffix list and found that the longest domain is 4 as such added this to the script you provided. Now it works and provides all the different domains. Here is the code I am now using:
Code:
awk '
NR==FNR                 {C[$0]
                         next
                        }
$(NF-1) OFS $(NF) in C  {print $(NF-2) OFS $(NF-1) OFS $NF
                         next
                        }
$(NF-1) OFS $(NF) in C  {print $(NF-3) OFS $(NF-1) OFS $NF
                         next
                        }
$(NF-1) OFS $(NF) in C  {print $(NF-4) OFS $(NF-1) OFS $NF
                         next
                        }
                        {print $(NF-1) OFS $NF
                        }
' FS="." OFS="." public_suffix_list.dat url.txt

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Extracting URL with domain

I have a file like this: http://article.wn.com/view/2010/11/26/IV_drug_policy_feels_HIV_patients_Red_Cross/ http://aidsjournal.com/,www.cfpa.org.cn/page1/page2 , www.youtube.com http://seattletimes.nwsource.com/html/jerrybrewer/2013517803_brewer25.html... (1 Reply)
Discussion started by: csim_mohan
1 Replies

2. UNIX for Dummies Questions & Answers

Putting the colon infront of the URL domain

I have a file like this: http://hello.com www.examplecom computer Company I wanted to keep dot (.) infront of com. to make the file like this http://hello.com www.example.com computer Company I applied this expression sed -r 's/com/.com/g'but what I get is: http://hello.com ... (4 Replies)
Discussion started by: csim_mohan
4 Replies

3. Shell Programming and Scripting

Reading URL using Mechanize and dump all the contents of the URL to a file

Hello, Am very new to perl , please help me here !! I need help in reading a URL from command line using PERL:: Mechanize and needs all the contents from the URL to get into a file. below is the script which i have written so far , #!/usr/bin/perl use LWP::UserAgent; use... (2 Replies)
Discussion started by: scott_cog
2 Replies

4. Shell Programming and Scripting

Hit multiple URL from a text file and store result in other test file

Hi, I have a problem where i have to hit multiple URL that are stored in a text file (input.txt) and save their output in different text file (output.txt) somewhat like : cat input.txt http://192.168.21.20:8080/PPUPS/international?NUmber=917875446856... (3 Replies)
Discussion started by: mukulverma2408
3 Replies

5. UNIX for Dummies Questions & Answers

Awk: print all URL addresses between iframe tags without repeating an already printed URL

Here is what I have so far: find . -name "*php*" -or -name "*htm*" | xargs grep -i iframe | awk -F'"' '/<iframe*/{gsub(/.\*iframe>/,"\"");print $2}' Here is an example content of a PHP or HTM(HTML) file: <iframe src="http://ADDRESS_1/?click=5BBB08\" width=1 height=1... (18 Replies)
Discussion started by: striker4o
18 Replies

6. Web Development

Regex to rewrite URL to another URL based on HTTP_HOST?

I am trying to find a way to test some code, but I need to rewrite a specific URL only from a specific HTTP_HOST The call goes out to http://SUB.DOMAIN.COM/showAssignment/7bde10b45efdd7a97629ef2fe01f7303/jsmodule/Nevow.Athena The ID in the middle is always random due to the cookie. I... (5 Replies)
Discussion started by: EXT3FSCK
5 Replies

7. Windows & DOS: Issues & Discussions

How to: Linux BOX in Windows Domain (w/out joining the domain)

Dear Expert, i have linux box that is running in the windows domain, BUT did not being a member of the domain. as I am not the System Administrator so I have no control on the server in the network, such as modify dns entry , add the linux box in AD and domain record and so on that relevant. ... (2 Replies)
Discussion started by: regmaster
2 Replies

8. UNIX for Dummies Questions & Answers

ReDirecting a URL to another URL - Linux

Hello, I need to redirect an existing URL, how can i do that? There's a current web address to a GUI that I have to redirect to another webaddress. Does anyone know how to do this? This is on Unix boxes Linux. example: https://m45.testing.address.net/host.php make it so the... (3 Replies)
Discussion started by: SkySmart
3 Replies

9. Shell Programming and Scripting

url calling and parameter passing to url in script

Hi all, I need to write a unix script in which need to call a url. Then need to pass parameters to that url. please help. Regards, gander_ss (1 Reply)
Discussion started by: gander_ss
1 Replies

10. UNIX for Advanced & Expert Users

url calling and parameter passing to url in script

Hi all, I need to write a unix script in which need to call a url. Then need to pass parameters to that url. please help. Regards, gander_ss (1 Reply)
Discussion started by: gander_ss
1 Replies
Login or Register to Ask a Question