Get only domain from url file bind

11-04-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by omuhans123

RudiC, thank you very much for providing this solution, it is truly appreciated. I checked through the publicsuffix list and found that the longest domain is 4 as such added this to the script you provided. Now it works and provides all the different domains. Here is the code I am now using:

Code:

awk '
NR==FNR                 {C[$0]
                         next
                        }
$(NF-1) OFS $(NF) in C  {print $(NF-2) OFS $(NF-1) OFS $NF
                         next
                        }
$(NF-1) OFS $(NF) in C  {print $(NF-3) OFS $(NF-1) OFS $NF
                         next
                        }
$(NF-1) OFS $(NF) in C  {print $(NF-4) OFS $(NF-1) OFS $NF
                        }
                        {print $(NF-1) OFS $NF
                        }
' FS="." OFS="." public_suffix_list.dat url.txt

I'm surprised this is working for you. There seem to be a few problems:

The code shown in red in your awk script will never be executed. Since the condition on the two red condition/action sets is identical to the orange condition and the action section with that condition ends with a next command, the actions shown in red cannot be executed.
I believe your code should explicitly ignore blank lines and comment lines in public_suffix_list.dat (unless you have pruned those lines out of the public suffix list provided here) when you downloaded the public list into your file).
I don't see how this code handles wildcards in rules (e.g., *.sch.uk).
I don't see how this code handles exception rules (although there aren't any exception rules if you're just trying to process UK domains).
And, according to the rules published for the public list, you should be loading values in your array with C[$1] instead of C[$0], but I don't see anything in the public list that includes a comment at the end of any rules so (if you ignored comment lines and blank lines) it might not matter.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-04-2015

Registered User

23, 0

Join Date: Nov 2015

Last Activity: 27 January 2018, 9:09 AM EST

Posts: 23

Thanks Given: 2

Thanked 0 Times in 0 Posts

Don Cragun, you are correct, I got excited to early. After running the script through a few hundred examples I found it is not working as desired. Do you maybe have suggestion how to extract the domain from the URL?

omuhans123

View Public Profile for omuhans123

Find all posts by omuhans123

11-04-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

You might want to give this a try:

Code:

awk '
/^\/\/|^ *$/    {next
                }

FNR!=NR         {for (f in FIVE)  if ($0 ~ "\." f "$")  {print $(NF-5), $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (f in FOUR)  if ($0 ~ "\." f "$")  {print $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF ; next}
                 for (t in THREE) if ($0 ~ "\." t "$")  {print $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (t in TWO)   if ($0 ~ "\." t "$")  {print $(NF-2), $(NF-1), $NF; next}
                 for (o in ONE)   if ($0 ~ "\." o "$")  {print $(NF-1), $NF; next}
                 next
                }

/^\*/           {next}

NF==5           {FIVE[$0]}
NF==4           {FOUR[$0]}
NF==3           {THREE[$0]}
NF==2           {TWO[$0]}  
NF==1           {ONE[$0]}  
' FS="." OFS="." public_suffix_list.dat rawfile
yahoo.com
doubleclick.net
prq.to
akamaihd.net
apple.com
glistockisti.it
ad-x.co.uk
edgekey.net
bbci.co.uk
google.co.nz
bbc.co.uk

The wildcard issue at begin-of-line hasn't been solved yet; and some optimization might come in handy...

Last edited by RudiC; 11-04-2015 at 06:29 PM..

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-05-2015

Registered User

23, 0

Join Date: Nov 2015

Last Activity: 27 January 2018, 9:09 AM EST

Posts: 23

Thanks Given: 2

Thanked 0 Times in 0 Posts

Hi RudiC, Thank you for the script, I am trying to resolve one challenge to check if it is working. I am currently getting: warning: escape sequence `\.' treated as plain `.'

Will try and figure out the sequence.

omuhans123

View Public Profile for omuhans123

Find all posts by omuhans123

11-05-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by omuhans123

In an ERE . matches any character. The intent is to match only a period at the start of those patterns. Change each occurrence of "\." in the script to "[.]" and it should get rid of the warnings and restrict the match to what was intended. (You could also use "\\.", but I find the matching list expression easier to use than trying to remember how many times a quoted expression will be evaluated by awk in cases like this.)

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-07-2015

Registered User

23, 0

Join Date: Nov 2015

Last Activity: 27 January 2018, 9:09 AM EST

Posts: 23

Thanks Given: 2

Thanked 0 Times in 0 Posts

Thank you RudiC, for the script and assistance, it is truly appreciated. The script works very well now and extracts the Domain from the URL.

Also thank you Don Cragun, for the assistance.

Here is the final script I am currently using that was written by RudiC:

Code:

awk '
/^\/\/|^ *$/    {next}

FNR!=NR         {for (f in FIVE)  if ($0 ~ "[.]" f "$")  {print $(NF-5), $(NF-4)                                                                                        , $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (f in FOUR)  if ($0 ~ "[.]" f "$")  {print $(NF-4), $(NF-3)                                                                                        , $(NF-2), $(NF-1), $NF ; next}
                 for (t in THREE) if ($0 ~ "[.]" t "$")  {print $(NF-3), $(NF-2)                                                                                        , $(NF-1), $NF; next}
                 for (t in TWO)   if ($0 ~ "[.]" t "$")  {print $(NF-2), $(NF-1)                                                                                        , $NF; next}
                 for (o in ONE)   if ($0 ~ "[.]" o "$")  {print $(NF-1), $NF; ne                                                                                        xt}
                 next
                }

/^\*/           {next}

NF==5           {FIVE[$0]}
NF==4           {FOUR[$0]}
NF==3           {THREE[$0]}
NF==2           {TWO[$0]}
NF==1           {ONE[$0]}
' FS="." OFS="." public_suffix_list.dat rawfile

---------- Post updated 11-07-15 at 01:36 PM ---------- Previous update was 11-06-15 at 02:53 PM ----------

Hi RudiC and Don Cragun, could I kindly ask you one final favor to optimize the script that I have currently. The objective is to take the raw log from BIND and enrich this with extraction of the URL and adding content categorization to this. Then writing these to different files to summarize this. The challenge is that with the script below it processes 3.83 lines a second and I have 9 million lines a day

The input log from the DNS1 file look like the following:

Code:

04-Nov-2015 08:28:39.261 queries: info: client 192.168.169.122#59319: query: istatic.eshopcomp.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.269 queries: info: client 192.168.212.136#48872: query: idsync.rlcdn.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.269 queries: info: client 192.168.19.61#53970: query: 3-courier.sandbox.push.apple.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.270 queries: info: client 192.168.169.122#59319: query: ajax.googleapis.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.272 queries: info: client 192.168.251.24#37028: query: um.simpli.fi IN A + (10.10.80.50)
04-Nov-2015 08:28:39.272 queries: info: client 192.168.251.24#37028: query: www.wtp101.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.273 queries: info: client 192.168.251.24#37028: query: magnetic.t.domdex.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.273 queries: info: client 172.25.111.175#59612: query: api.smoot.apple.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.275 queries: info: client 192.168.7.181#45913: query: www.miniclip.com IN A + (10.10.80.50)

Code:

while read -r line
do
dt=$(awk -F " " '/ / {print $1}' <<< $line) #Reading the date from the log file into variable
tm=$(awk -F " " '/ / {print $2}' <<< $line) #Reading the time from the log file into variable
ipt=$(awk -F " " '/ / {print $6}'<<< $line) #Reading the IP address from the log file into variable
ip=$(cut -d'#' -f1 <<< $ipt) #removing the port from the IP address and write into variable
url=$(awk -F " " '/ / {print $8}' <<< $line) #Reading the URL from the log file into variable
type=$(awk -F " " '/ / {print $10}' <<< $line) #Reading the Record Type from the log file into variable

echo $url > temp-url #Writing the URL into temp file as I could not get the variable automatically reading this into the awk statement below

dom=$(awk '
/^\/\/|^ *$/    {next}

FNR!=NR         {for (f in FIVE)  if ($0 ~ "[.]" f "$")  {print $(NF-5), $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (f in FOUR)  if ($0 ~ "[.]" f "$")  {print $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF ; next}
                 for (t in THREE) if ($0 ~ "[.]" t "$")  {print $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (t in TWO)   if ($0 ~ "[.]" t "$")  {print $(NF-2), $(NF-1), $NF; next}
                 for (o in ONE)   if ($0 ~ "[.]" o "$")  {print $(NF-1), $NF; next}
                 next
                }

/^\*/           {next}

NF==5           {FIVE[$0]}
NF==4           {FOUR[$0]}
NF==3           {THREE[$0]}
NF==2           {TWO[$0]}
NF==1           {ONE[$0]}
' FS="." OFS="." public_suffix_list.dat temp-url) #extracting the Domain from the URL

ct=$(grep -i -r $dom /opt/URL/BL/ | cut -d'/' -f5 | uniq -d | head ) #Here I am using http://www.shalla.de/ categorization database to look at every domain and read the folder location to add the category it is in

echo $dt,$tm,$ip,$url,$dom,$type,$ct >> DNS1_Logs 	#Rewriting the log file that contains now also the domain and category of the lookup and removing unnecessary information
echo $dom >> DNS1_DOM								#Wringing on the Domain names into separate file
echo $dom,$ct >> DNS1_CT							#Wringing on the Domain and category names into separate file
done < DNS1

sort DNS1_DOM | uniq -cd | sort -nr > DNS1_Sort 	#Sorting the Domains to get the most utilized once

Thank you very much already in advance.

Last edited by omuhans123; 11-07-2015 at 07:48 AM..

omuhans123

View Public Profile for omuhans123

Find all posts by omuhans123

11-07-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Obviously, replacing:

Code:

while read -r line
do
dt=$(awk -F " " '/ / {print $1}' <<< $line) #Reading the date from the log file into variable
tm=$(awk -F " " '/ / {print $2}' <<< $line) #Reading the time from the log file into variable
ipt=$(awk -F " " '/ / {print $6}'<<< $line) #Reading the IP address from the log file into variable
ip=$(cut -d'#' -f1 <<< $ipt) #removing the port from the IP address and write into variable
url=$(awk -F " " '/ / {print $8}' <<< $line) #Reading the URL from the log file into variable
type=$(awk -F " " '/ / {print $10}' <<< $line) #Reading the Record Type from the log file into variable

with:

Code:

while read -r dt tm _ _ _ int _ url _ type _
do	ip=${ipt%%#*}

(which eliminates 5 executions of awk and 1 execution of cut per line in your log file) should let you process MANY more lines per second. Or, just build this into an awk script that will do all of this and do the URL processing you requested before in a single awk (instead of invoking awk again for every line in your log file).

What is the format of the files in the directory /opt/URL/BL? How many files are there? How many categories are there? Running 5 processes for every line in your log file to grab whatever it is that you want to get is going to keep things running slow. If we can preprocess those files into a table we can search more efficiently for each line's data, that would help immensely.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Get only domain from url file bind

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Extracting URL with domain

Discussion started by: csim_mohan

2. UNIX for Dummies Questions & Answers

Putting the colon infront of the URL domain

Discussion started by: csim_mohan

3. Shell Programming and Scripting

Reading URL using Mechanize and dump all the contents of the URL to a file

Discussion started by: scott_cog

4. Shell Programming and Scripting

Hit multiple URL from a text file and store result in other test file

Discussion started by: mukulverma2408

5. UNIX for Dummies Questions & Answers

Awk: print all URL addresses between iframe tags without repeating an already printed URL

Discussion started by: striker4o

6. Web Development

Regex to rewrite URL to another URL based on HTTP_HOST?

Discussion started by: EXT3FSCK

7. Windows & DOS: Issues & Discussions

How to: Linux BOX in Windows Domain (w/out joining the domain)

Discussion started by: regmaster

8. UNIX for Dummies Questions & Answers

ReDirecting a URL to another URL - Linux

Discussion started by: SkySmart

9. Shell Programming and Scripting

url calling and parameter passing to url in script

Discussion started by: gander_ss

10. UNIX for Advanced & Expert Users

url calling and parameter passing to url in script

Discussion started by: gander_ss