Get only domain from url file bind

11-07-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Would this come close to what you need?

Code:

awk '
/^\/\/|^ *$|^\*/        {next
                        }

FNR!=NR         {split ($6, X, "#")
                 printf "%s  %s  %s  %s  ", $1, $2, X[1], $8
                 n=split ($8, U, ".")
                 found = 0
                 if (!found && n >= 5)  for (f in FIVE) if ($8 ~ "\." f "$")    {found = 1; st = 5
                                                                                }
                 if (!found && n >= 4)  for (f in FOUR)  if ($8 ~ "\." f "$")   {found = 1; st = 4
                                                                                }
                 if (!found && n >= 3)  for (t in THREE) if ($8 ~ "\." t "$")   {found = 1; st = 3
                                                                                }
                 if (!found && n >= 2)  for (t in TWO)   if ($8 ~ "\." t "$")   {found = 1; st = 2
                                                                                }
                 if (!found && n >= 1)  for (o in ONE)   if ($8 ~ "\." o "$")   {found = 1; st = 1
                                                                                }
                 for (i=st; i>0; i--) printf "%s.", U[n-i]
                 printf "%s  %s  $CT_goes_here\n", U[n], $10
                }


NF==5           {FIVE[$0]}
NF==4           {FOUR[$0]}
NF==3           {THREE[$0]}
NF==2           {TWO[$0]}  
NF==1           {ONE[$0]}  

' FS="."  public_suffix_list.dat  FS=" "  DNS1
04-Nov-2015  08:28:39.261  192.168.169.122  istatic.eshopcomp.com  eshopcomp.com  A  $CT_goes_here
04-Nov-2015  08:28:39.269  192.168.212.136  idsync.rlcdn.com  rlcdn.com  A  $CT_goes_here
04-Nov-2015  08:28:39.269  192.168.19.61  3-courier.sandbox.push.apple.com  apple.com  A  $CT_goes_here
04-Nov-2015  08:28:39.270  192.168.169.122  ajax.googleapis.com  ajax.googleapis.com  A  $CT_goes_here
04-Nov-2015  08:28:39.272  192.168.251.24  um.simpli.fi  simpli.fi  A  $CT_goes_here
04-Nov-2015  08:28:39.272  192.168.251.24  www.wtp101.com  wtp101.com  A  $CT_goes_here
04-Nov-2015  08:28:39.273  192.168.251.24  magnetic.t.domdex.com  domdex.com  A  $CT_goes_here
04-Nov-2015  08:28:39.273  172.25.111.175  api.smoot.apple.com  apple.com  A  $CT_goes_here
04-Nov-2015  08:28:39.275  192.168.7.181  www.miniclip.com  miniclip.com  A  $CT_goes_here

It still takes its time, though, as it needs to creep through on average half the suffixes for every line in DNS1.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-08-2015

Registered User

23, 0

Join Date: Nov 2015

Last Activity: 27 January 2018, 9:09 AM EST

Posts: 23

Thanks Given: 2

Thanked 0 Times in 0 Posts

Thank you Don Cragun for the optimization of the code, I was not aware of this method of defining variables in one line. With this method I was able to improve the line reading speed by 5 times.

Regarding the

Code:

/opt/URL/BL

directory, here the file downloaded has different directories that contains files with the list of URLs defined (here is the link: wget http://www.shallalist.de/Downloads/shallalist.tar.gz). In the directory there are the following categories:

Code:

adv
aggressive
alcohol
anonvpn
automobile
chat
COPYRIGHT
costtraps
dating
downloads
drugs
dynamic
education
finance
fortunetelling
forum
gamble
global_usage
government
hacking
hobby
homestyle
hospitals
imagehosting
isp
jobsearch
library
military
models
movies
music
news
podcasts
politics
porn
radiotv
recreation
redirector
religion
remotecontrol
ringtones
science
searchengines
sex
shopping
socialnet
spyware
tracker
updatesites
urlshortener
violence
warez
weapons
webmail
webphone
webradio
webtv

In these category directories these are two files domains and urls. There are numerous hits for one domain e.g: facebook.com

Code:

adv porn movies hobby socialnet spyware redirector finance chat

So with the below line I am trying to grep all directory names that define the category and add these to the variable $ct with spaces between them if more than one.

Code:

ct=$(grep -i -r $dom /opt/URL/BL/ | cut -d'/' -f5 | uniq -d | head )

Here is the entire code now after updating:

Code:

while read -r dt tm _ _ _ ipt _ url _ type _
do      ip=${ipt%%#*}

echo $url > temp-url

dom=$(awk '
/^\/\/|^ *$/    {next}

FNR!=NR         {for (f in FIVE)  if ($0 ~ "[.]" f "$")  {print $(NF-5), $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (f in FOUR)  if ($0 ~ "[.]" f "$")  {print $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF ; next}
                 for (t in THREE) if ($0 ~ "[.]" t "$")  {print $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (t in TWO)   if ($0 ~ "[.]" t "$")  {print $(NF-2), $(NF-1), $NF; next}
                 for (o in ONE)   if ($0 ~ "[.]" o "$")  {print $(NF-1), $NF; next}
                 next
                }

/^\*/           {next}

NF==5           {FIVE[$0]}
NF==4           {FOUR[$0]}
NF==3           {THREE[$0]}
NF==2           {TWO[$0]}
NF==1           {ONE[$0]}
' FS="." OFS="." public_suffix_list.dat temp-url)

ct=$(grep -i -r $dom /opt/URL/BL/ | cut -d'/' -f5 | uniq -d | head )

echo $dt,$tm,$ip,$url,$dom,$type,$ct >> DNS1_Logs
echo $dom >> DNS1_DOM
echo $dom,$ct >> DNS1_CT
done < DNS1

sort DNS1_DOM | uniq -cd | sort -nr > DNS1_Sort

one additional question, the domain awk code, is it possible to read a variable like $dom instead of the tmp-url that I am currently first wiring to a temp file? and is it possible to do additional optimization?

---------- Post updated at 03:59 PM ---------- Previous update was at 03:48 PM ----------

Hi RudiC, thank you very much, WOW this is an amazing code. This code is much faster than the code currently working with. Two challenges that I am currently facing with this code is to read the Category files from the folders

Code:

/opt/URL/BL

(http://www.shallalist.de/Downloads/shallalist.tar.gz) and the second challenge is that the spaces are there not comma as separators. I have tried to figure out where exactly the spaces are defined however I have not been able to find this until now.

---------- Post updated at 03:59 PM ---------- Previous update was at 03:59 PM ----------

Hi RudiC, thank you very much, WOW this is an amazing code. This code is much faster than the code currently working with. Two challenges that I am currently facing with this code is to read the Category files from the folders

Code:

/opt/URL/BL

omuhans123

View Public Profile for omuhans123

Find all posts by omuhans123

11-08-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

The commas you can insert easily in the format strings used by the printf statements.

Actually, I don't really understand what you want to achieve with your processing of the category files. cutting the fifth / separated field yields many empty strings, uniq -d prints duplicate lines that occur randomly unless preceded by a sort operation, and head prints 10 lines, yielding an unpredictable result. Please specify exactly what you want/need.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-09-2015

Registered User

23, 0

Join Date: Nov 2015

Last Activity: 27 January 2018, 9:09 AM EST

Posts: 23

Thanks Given: 2

Thanked 0 Times in 0 Posts

Hi RudiC, apologies for not being clear on the point of the categories. lets me try and elaborate, for every URL of the logs I would like to extract the domain, this is the script you already provided. However what I would like to know is also the category of the url. this is where the shalla files or BL directory comes in. The Domain is then searched through the directories and returning the directory names they are in. Challenge is that the domains and urls files in the BL directories are only containing the one item that is the url or domains or ip addresses. as such what I am trying to do is identify from the initial url what the category is of the domain. this will enable me to check how many url's are browsing news or socialnet, etc.

omuhans123

View Public Profile for omuhans123

Find all posts by omuhans123

11-09-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Sorry, not clear. If you want help on categorizing your domains, you'd best describe an algorithm how to extract that from those BL files/domains/URLs. As already stated, right now your approach as well as the data structures are quite diffuse to me.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-10-2015

Registered User

23, 0

Join Date: Nov 2015

Last Activity: 27 January 2018, 9:09 AM EST

Posts: 23

Thanks Given: 2

Thanked 0 Times in 0 Posts

Hi RudiC, I have thought about the categorization part of each line however it is taking too much processing power. I will rather do this at the end when I have summarized the files and sorted them. Thank you for the clarification on the uniq -d and sort part, I was not aware of the sort that needs to be there. Also the head that it only prints 10 lines. The last script you provided is working great and I have managed to get the commas in place now.

I however still am faced with a challenge as there a 9 million records and 3 servers, so in total 27 million records to be processed. The optimized script is now doing about 50 thousand lines an hour. As such I must find an alternative method of doing this even though it is working great, I need to be able to process these records as close as possible to real time.

Thank you RudiC, it is truly a pleasure and appreciated your assistance.

omuhans123

View Public Profile for omuhans123

Find all posts by omuhans123

11-10-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

As I said before, crunching through those data amount will take its time.

Are you sure that you are taking the right approach? Why don't you explain you problem here in a broader context? Somebody might come up with a more efficient solution...

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Get only domain from url file bind

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Extracting URL with domain

Discussion started by: csim_mohan

2. UNIX for Dummies Questions & Answers

Putting the colon infront of the URL domain

Discussion started by: csim_mohan

3. Shell Programming and Scripting

Reading URL using Mechanize and dump all the contents of the URL to a file

Discussion started by: scott_cog

4. Shell Programming and Scripting

Hit multiple URL from a text file and store result in other test file

Discussion started by: mukulverma2408

5. UNIX for Dummies Questions & Answers

Awk: print all URL addresses between iframe tags without repeating an already printed URL

Discussion started by: striker4o

6. Web Development

Regex to rewrite URL to another URL based on HTTP_HOST?

Discussion started by: EXT3FSCK

7. Windows & DOS: Issues & Discussions

How to: Linux BOX in Windows Domain (w/out joining the domain)

Discussion started by: regmaster

8. UNIX for Dummies Questions & Answers

ReDirecting a URL to another URL - Linux

Discussion started by: SkySmart

9. Shell Programming and Scripting

url calling and parameter passing to url in script

Discussion started by: gander_ss

10. UNIX for Advanced & Expert Users

url calling and parameter passing to url in script

Discussion started by: gander_ss