Sponsored Content
Top Forums Shell Programming and Scripting Get only domain from url file bind Post 302959947 by omuhans123 on Sunday 8th of November 2015 08:59:33 AM
Old 11-08-2015
Thank you Don Cragun for the optimization of the code, I was not aware of this method of defining variables in one line. With this method I was able to improve the line reading speed by 5 times.

Regarding the
Code:
/opt/URL/BL

directory, here the file downloaded has different directories that contains files with the list of URLs defined (here is the link: wget http://www.shallalist.de/Downloads/shallalist.tar.gz). In the directory there are the following categories:
Code:
adv
aggressive
alcohol
anonvpn
automobile
chat
COPYRIGHT
costtraps
dating
downloads
drugs
dynamic
education
finance
fortunetelling
forum
gamble
global_usage
government
hacking
hobby
homestyle
hospitals
imagehosting
isp
jobsearch
library
military
models
movies
music
news
podcasts
politics
porn
radiotv
recreation
redirector
religion
remotecontrol
ringtones
science
searchengines
sex
shopping
socialnet
spyware
tracker
updatesites
urlshortener
violence
warez
weapons
webmail
webphone
webradio
webtv

In these category directories these are two files domains and urls. There are numerous hits for one domain e.g: facebook.com
Code:
adv porn movies hobby socialnet spyware redirector finance chat

So with the below line I am trying to grep all directory names that define the category and add these to the variable $ct with spaces between them if more than one.

Code:
ct=$(grep -i -r $dom /opt/URL/BL/ | cut -d'/' -f5 | uniq -d | head )

Here is the entire code now after updating:
Code:
while read -r dt tm _ _ _ ipt _ url _ type _
do      ip=${ipt%%#*}

echo $url > temp-url

dom=$(awk '
/^\/\/|^ *$/    {next}

FNR!=NR         {for (f in FIVE)  if ($0 ~ "[.]" f "$")  {print $(NF-5), $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (f in FOUR)  if ($0 ~ "[.]" f "$")  {print $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF ; next}
                 for (t in THREE) if ($0 ~ "[.]" t "$")  {print $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (t in TWO)   if ($0 ~ "[.]" t "$")  {print $(NF-2), $(NF-1), $NF; next}
                 for (o in ONE)   if ($0 ~ "[.]" o "$")  {print $(NF-1), $NF; next}
                 next
                }

/^\*/           {next}

NF==5           {FIVE[$0]}
NF==4           {FOUR[$0]}
NF==3           {THREE[$0]}
NF==2           {TWO[$0]}
NF==1           {ONE[$0]}
' FS="." OFS="." public_suffix_list.dat temp-url)

ct=$(grep -i -r $dom /opt/URL/BL/ | cut -d'/' -f5 | uniq -d | head )

echo $dt,$tm,$ip,$url,$dom,$type,$ct >> DNS1_Logs
echo $dom >> DNS1_DOM
echo $dom,$ct >> DNS1_CT
done < DNS1

sort DNS1_DOM | uniq -cd | sort -nr > DNS1_Sort

one additional question, the domain awk code, is it possible to read a variable like $dom instead of the tmp-url that I am currently first wiring to a temp file? and is it possible to do additional optimization?

---------- Post updated at 03:59 PM ---------- Previous update was at 03:48 PM ----------

Hi RudiC, thank you very much, WOW this is an amazing code. This code is much faster than the code currently working with. Two challenges that I am currently facing with this code is to read the Category files from the folders
Code:
/opt/URL/BL

(http://www.shallalist.de/Downloads/shallalist.tar.gz) and the second challenge is that the spaces are there not comma as separators. I have tried to figure out where exactly the spaces are defined however I have not been able to find this until now.

---------- Post updated at 03:59 PM ---------- Previous update was at 03:59 PM ----------

Hi RudiC, thank you very much, WOW this is an amazing code. This code is much faster than the code currently working with. Two challenges that I am currently facing with this code is to read the Category files from the folders
Code:
/opt/URL/BL

(http://www.shallalist.de/Downloads/shallalist.tar.gz) and the second challenge is that the spaces are there not comma as separators. I have tried to figure out where exactly the spaces are defined however I have not been able to find this until now.
 

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

url calling and parameter passing to url in script

Hi all, I need to write a unix script in which need to call a url. Then need to pass parameters to that url. please help. Regards, gander_ss (1 Reply)
Discussion started by: gander_ss
1 Replies

2. Shell Programming and Scripting

url calling and parameter passing to url in script

Hi all, I need to write a unix script in which need to call a url. Then need to pass parameters to that url. please help. Regards, gander_ss (1 Reply)
Discussion started by: gander_ss
1 Replies

3. UNIX for Dummies Questions & Answers

ReDirecting a URL to another URL - Linux

Hello, I need to redirect an existing URL, how can i do that? There's a current web address to a GUI that I have to redirect to another webaddress. Does anyone know how to do this? This is on Unix boxes Linux. example: https://m45.testing.address.net/host.php make it so the... (3 Replies)
Discussion started by: SkySmart
3 Replies

4. Windows & DOS: Issues & Discussions

How to: Linux BOX in Windows Domain (w/out joining the domain)

Dear Expert, i have linux box that is running in the windows domain, BUT did not being a member of the domain. as I am not the System Administrator so I have no control on the server in the network, such as modify dns entry , add the linux box in AD and domain record and so on that relevant. ... (2 Replies)
Discussion started by: regmaster
2 Replies

5. Web Development

Regex to rewrite URL to another URL based on HTTP_HOST?

I am trying to find a way to test some code, but I need to rewrite a specific URL only from a specific HTTP_HOST The call goes out to http://SUB.DOMAIN.COM/showAssignment/7bde10b45efdd7a97629ef2fe01f7303/jsmodule/Nevow.Athena The ID in the middle is always random due to the cookie. I... (5 Replies)
Discussion started by: EXT3FSCK
5 Replies

6. UNIX for Dummies Questions & Answers

Awk: print all URL addresses between iframe tags without repeating an already printed URL

Here is what I have so far: find . -name "*php*" -or -name "*htm*" | xargs grep -i iframe | awk -F'"' '/<iframe*/{gsub(/.\*iframe>/,"\"");print $2}' Here is an example content of a PHP or HTM(HTML) file: <iframe src="http://ADDRESS_1/?click=5BBB08\" width=1 height=1... (18 Replies)
Discussion started by: striker4o
18 Replies

7. Shell Programming and Scripting

Hit multiple URL from a text file and store result in other test file

Hi, I have a problem where i have to hit multiple URL that are stored in a text file (input.txt) and save their output in different text file (output.txt) somewhat like : cat input.txt http://192.168.21.20:8080/PPUPS/international?NUmber=917875446856... (3 Replies)
Discussion started by: mukulverma2408
3 Replies

8. Shell Programming and Scripting

Reading URL using Mechanize and dump all the contents of the URL to a file

Hello, Am very new to perl , please help me here !! I need help in reading a URL from command line using PERL:: Mechanize and needs all the contents from the URL to get into a file. below is the script which i have written so far , #!/usr/bin/perl use LWP::UserAgent; use... (2 Replies)
Discussion started by: scott_cog
2 Replies

9. UNIX for Dummies Questions & Answers

Putting the colon infront of the URL domain

I have a file like this: http://hello.com www.examplecom computer Company I wanted to keep dot (.) infront of com. to make the file like this http://hello.com www.example.com computer Company I applied this expression sed -r 's/com/.com/g'but what I get is: http://hello.com ... (4 Replies)
Discussion started by: csim_mohan
4 Replies

10. UNIX for Dummies Questions & Answers

Extracting URL with domain

I have a file like this: http://article.wn.com/view/2010/11/26/IV_drug_policy_feels_HIV_patients_Red_Cross/ http://aidsjournal.com/,www.cfpa.org.cn/page1/page2 , www.youtube.com http://seattletimes.nwsource.com/html/jerrybrewer/2013517803_brewer25.html... (1 Reply)
Discussion started by: csim_mohan
1 Replies
All times are GMT -4. The time now is 04:08 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy