Get only domain from url file bind


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Get only domain from url file bind
# 22  
Old 11-11-2015
Hi RudiC, Ok let me try to explain the challenge I am faced with and you are correct there are many great minds out there that might have a possible solution for this.

So there are three DNS servers and each one of them produces around 9 million lines a day, a total of 27 million records a day. When looking at the information produced by these DNS servers it does not have much data to work with except the date, time, client IP, URL and record type. As such the objective is to enrich this information by extracting additional information from the base content received.

Original Log File:

Code:
04-Nov-2015 08:28:39.261 queries: info: client 192.168.169.122#59319: query: istatic.eshopcomp.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.269 queries: info: client 192.168.212.136#48872: query: idsync.rlcdn.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.269 queries: info: client 192.168.19.61#53970: query: 3-courier.sandbox.push.apple.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.270 queries: info: client 192.168.169.122#59319: query: ajax.googleapis.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.272 queries: info: client 192.168.251.24#37028: query: um.simpli.fi IN A + (10.10.80.50)
04-Nov-2015 08:28:39.272 queries: info: client 192.168.251.24#37028: query: www.wtp101.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.273 queries: info: client 192.168.251.24#37028: query: magnetic.t.domdex.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.273 queries: info: client 172.25.111.175#59612: query: api.smoot.apple.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.275 queries: info: client 192.168.7.181#45913: query: www.miniclip.com IN A + (10.10.80.50)

From the domain or URL one should be able to categorize the URL or domain into different categories, like socialnet or news and some have multiple categories. Additional to this getting the GeoIP information of the location. Then it is possible to identify the destinations of these URL's, where different URL's are hosted in different countries as an example Google has multiple locations for different content. Below is a sample of the output then:

Code:
04-Nov-2015,08:28:39.261,192.168.169.122,istatic.eshopcomp.com,205.185.208.26,eshopcomp.com,A,US,UnitedStates,AZ,Arizona,Phoenix,85012,33.508301,-112.071701,602,
04-Nov-2015,08:28:39.269,192.168.212.136,idsync.rlcdn.com,54.172.162.24,rlcdn.com,A,US,UnitedStates,VA,Virginia,Ashburn,20147,39.033501,-77.483803,703,
04-Nov-2015,08:28:39.269,192.168.19.61,3-courier.sandbox.push.apple.com,17.172.232.11,apple.com,A,US,UnitedStates,CA,California,Cupertino,95014,37.304199,-122.094597,408,updatesites news forum porn movies hobby podcasts downloads shopping webradio
04-Nov-2015,08:28:39.270,192.168.169.122,ajax.googleapis.com,216.58.223.10,ajax.googleapis.com,A,US,UnitedStates,CA,California,MountainView,94043,37.419201,-122.057404,650,
04-Nov-2015,08:28:39.272,192.168.251.24,um.simpli.fi,158.85.41.203,simpli.fi,A,US,UnitedStates,VA,Virginia,Chantilly,22022,38.894299,-77.431099,703,
04-Nov-2015,08:28:39.272,192.168.251.24,www.wtp101.com,54.86.5.94,wtp101.com,A,US,UnitedStates,VA,Virginia,Ashburn,20147,39.033501,-77.483803,703,
04-Nov-2015,08:28:39.273,192.168.251.24,magnetic.t.domdex.com,54.217.251.207,domdex.com,A,IE,Ireland,07,Dublin,Dublin,N/A,53.333099,-6.248900,0,tracker
04-Nov-2015,08:28:39.273,172.25.111.175,api.smoot.apple.com,17.252.75.246,apple.com,A,US,UnitedStates,CA,California,Cupertino,95014,37.304199,-122.094597,408,updatesites news forum porn movies hobby podcasts downloads shopping webradio
04-Nov-2015,08:28:39.275,192.168.7.181,www.miniclip.com,54.192.207.82,miniclip.com,A,US,UnitedStates,WA,Washington,Seattle,98144,47.583900,-122.299500,206,hobby

Now the information has been enriched and adds additional value where it can be imported to a GIS tool and provides a nice graphical interface, etc.
SO the challenge is that the 27million records produced a day is taking days to be processed and enriched with additional information. So is there a solution to get all this information within minutes or hours not days into a file?
# 23  
Old 11-11-2015
Well, you still didn't say where and how to get that enriching information, and what I see in your sample above is not found in the BL files you presented earlier - at least I didn't find them.

Some thoughts to improve processing:
- could you remove duplicates from the input files?
- could you preprocess (condense) those BL files?
- could you split the info needed into separate outputs, each requiring less processing?

Last edited by RudiC; 11-11-2015 at 04:48 PM..
This User Gave Thanks to RudiC For This Post:
# 24  
Old 11-11-2015
Yes, here is the information:
Some thoughts to improve processing:
- could you remove duplicates from the input files?
No as they have different time-stamps
- could you preprocess (condense) those BL files?
For the BL information it would be possible to do this only at the end and not for every line.
- could you split the info needed into separate outputs, each requiring less processing?
yes as longs as there is one file that includes everything in one or are linked with the information.

Here is the current script used:
Code:
while read -r line
do
dt=$(awk -F " " '/ / {print $1}' <<< $line)
tm=$(awk -F " " '/ / {print $2}' <<< $line)
ipt=$(awk -F " " '/ / {print $6}'<<< $line)
ip=$(cut -d'#' -f1 <<< $ipt)
url=$(awk -F " " '/ / {print $8}' <<< $line)
urlip=$(geoiplookup -i -f /usr/share/GeoIP/GeoIP.dat $url | awk -F ":" '/ / {print $2}' | cut -d',' -f1 | awk 'NR==2' | tr -d '[[:space:]]')
type=$(awk -F " " '/ / {print $10}' <<< $line)
countrys=$(geoiplookup -f /usr/share/GeoIP/GeoIP.dat $url | awk -F ":" '/ / {print $2}' | cut -d',' -f1 | awk 'NR==1' | tr -d '[[:space:]]')
country=$(geoiplookup -f /usr/share/GeoIP/GeoIP.dat $url | awk -F "," '/ / {print $2}' | cut -d',' -f1 | awk 'NR==1' | tr -d '[[:space:]]')
as=$(geoiplookup -f /usr/share/GeoIP/GeoIPASNum.dat $url | awk -F " " '/ / {print $4}' | cut -d',' -f1 | awk 'NR==2' | tr -d '[[:space:]]')
regions=$(geoiplookup -f /usr/share/GeoIP/GeoLiteCity.dat $url | awk -F "," '/ / {print $3}' | cut -d',' -f1 | tr -d '[[:space:]]')
region=$(geoiplookup -f /usr/share/GeoIP/GeoLiteCity.dat $url | awk -F "," '/ / {print $4}' | cut -d',' -f1 | tr -d '[[:space:]]')
city=$(geoiplookup -f /usr/share/GeoIP/GeoLiteCity.dat $url | awk -F "," '/ / {print $5}' | cut -d',' -f1 | tr -d '[[:space:]]')
postalCode=$(geoiplookup -f /usr/share/GeoIP/GeoLiteCity.dat $url | awk -F "," '/ / {print $6}' | cut -d',' -f1 | tr -d '[[:space:]]')
lat=$(geoiplookup -f /usr/share/GeoIP/GeoLiteCity.dat $url | awk -F "," '/ / {print $7}' | cut -d',' -f1 | tr -d '[[:space:]]')
long=$(geoiplookup -f /usr/share/GeoIP/GeoLiteCity.dat $url | awk -F "," '/ / {print $8}' | cut -d',' -f1 | tr -d '[[:space:]]')
areaCode=$(geoiplookup -f /usr/share/GeoIP/GeoLiteCity.dat $url | awk -F "," '/ / {print $10}' | cut -d',' -f1 | tr -d '[[:space:]]')


echo $url > temp-url

dom=$(awk '
/^\/\/|^ *$/    {next}

FNR!=NR         {for (f in FIVE)  if ($0 ~ "[.]" f "$")  {print $(NF-5), $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (f in FOUR)  if ($0 ~ "[.]" f "$")  {print $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF ; next}
                 for (t in THREE) if ($0 ~ "[.]" t "$")  {print $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (t in TWO)   if ($0 ~ "[.]" t "$")  {print $(NF-2), $(NF-1), $NF; next}
                 for (o in ONE)   if ($0 ~ "[.]" o "$")  {print $(NF-1), $NF; next}
                 next
                }

/^\*/           {next}

NF==5           {FIVE[$0]}
NF==4           {FOUR[$0]}
NF==3           {THREE[$0]}
NF==2           {TWO[$0]}
NF==1           {ONE[$0]}
' FS="." OFS="." public_suffix_list.dat temp-url)

ct=$(grep -i -r $dom /opt/URL/BL/ | cut -d'/' -f5 | uniq -d | head )

echo $dt,$tm,$ip,$url,$urlip,$dom,$type,$countrys,$country,$regions,$region,$city,$postalCode,$lat,$long,$areaCode,$ct >> tmp_Logs
echo $dom >> tmp_DOM
echo $dom,$country,$city,$city,$lat,$long,$ct >> tmp_CT
done < tmp

sort DNS1_DOM | uniq -cd | sort -nr > tmp_Sort

# 25  
Old 11-22-2015
Hi RudiC, Quick question is it not possible to split a file into say 1 million lines and have multiple instances running. The joining the files after completing. So for the 27 million lines there would be 27 simultaneous instances of the script running?
# 26  
Old 11-22-2015
First thought: yes, why not. If you make sure output files are unique. Still the instances might be competing for resources like memory, CPU, ...
This User Gave Thanks to RudiC For This Post:
# 27  
Old 11-22-2015
Thank you for the feedback, do you maybe have any suggestions or examples on how to best script this?
I was thinking of maybe starting off with a
Code:
wc -l filename

and then dividing this into the number of say 30 files. Then doing the splitting of the files and starting 30 scripts? I just do not know exactly how to do this.

Something like this however not sure how to split this into different files:
Code:
A=0
while IFS= read -r LINE ; do
  printf '%s\n' "$LINE" > newfile$A
  (( A++ ))
done < "$INPUTFILE"


Last edited by omuhans123; 11-22-2015 at 11:36 AM..
# 28  
Old 11-22-2015
man split
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Extracting URL with domain

I have a file like this: http://article.wn.com/view/2010/11/26/IV_drug_policy_feels_HIV_patients_Red_Cross/ http://aidsjournal.com/,www.cfpa.org.cn/page1/page2 , www.youtube.com http://seattletimes.nwsource.com/html/jerrybrewer/2013517803_brewer25.html... (1 Reply)
Discussion started by: csim_mohan
1 Replies

2. UNIX for Dummies Questions & Answers

Putting the colon infront of the URL domain

I have a file like this: http://hello.com www.examplecom computer Company I wanted to keep dot (.) infront of com. to make the file like this http://hello.com www.example.com computer Company I applied this expression sed -r 's/com/.com/g'but what I get is: http://hello.com ... (4 Replies)
Discussion started by: csim_mohan
4 Replies

3. Shell Programming and Scripting

Reading URL using Mechanize and dump all the contents of the URL to a file

Hello, Am very new to perl , please help me here !! I need help in reading a URL from command line using PERL:: Mechanize and needs all the contents from the URL to get into a file. below is the script which i have written so far , #!/usr/bin/perl use LWP::UserAgent; use... (2 Replies)
Discussion started by: scott_cog
2 Replies

4. Shell Programming and Scripting

Hit multiple URL from a text file and store result in other test file

Hi, I have a problem where i have to hit multiple URL that are stored in a text file (input.txt) and save their output in different text file (output.txt) somewhat like : cat input.txt http://192.168.21.20:8080/PPUPS/international?NUmber=917875446856... (3 Replies)
Discussion started by: mukulverma2408
3 Replies

5. UNIX for Dummies Questions & Answers

Awk: print all URL addresses between iframe tags without repeating an already printed URL

Here is what I have so far: find . -name "*php*" -or -name "*htm*" | xargs grep -i iframe | awk -F'"' '/<iframe*/{gsub(/.\*iframe>/,"\"");print $2}' Here is an example content of a PHP or HTM(HTML) file: <iframe src="http://ADDRESS_1/?click=5BBB08\" width=1 height=1... (18 Replies)
Discussion started by: striker4o
18 Replies

6. Web Development

Regex to rewrite URL to another URL based on HTTP_HOST?

I am trying to find a way to test some code, but I need to rewrite a specific URL only from a specific HTTP_HOST The call goes out to http://SUB.DOMAIN.COM/showAssignment/7bde10b45efdd7a97629ef2fe01f7303/jsmodule/Nevow.Athena The ID in the middle is always random due to the cookie. I... (5 Replies)
Discussion started by: EXT3FSCK
5 Replies

7. Windows & DOS: Issues & Discussions

How to: Linux BOX in Windows Domain (w/out joining the domain)

Dear Expert, i have linux box that is running in the windows domain, BUT did not being a member of the domain. as I am not the System Administrator so I have no control on the server in the network, such as modify dns entry , add the linux box in AD and domain record and so on that relevant. ... (2 Replies)
Discussion started by: regmaster
2 Replies

8. UNIX for Dummies Questions & Answers

ReDirecting a URL to another URL - Linux

Hello, I need to redirect an existing URL, how can i do that? There's a current web address to a GUI that I have to redirect to another webaddress. Does anyone know how to do this? This is on Unix boxes Linux. example: https://m45.testing.address.net/host.php make it so the... (3 Replies)
Discussion started by: SkySmart
3 Replies

9. Shell Programming and Scripting

url calling and parameter passing to url in script

Hi all, I need to write a unix script in which need to call a url. Then need to pass parameters to that url. please help. Regards, gander_ss (1 Reply)
Discussion started by: gander_ss
1 Replies

10. UNIX for Advanced & Expert Users

url calling and parameter passing to url in script

Hi all, I need to write a unix script in which need to call a url. Then need to pass parameters to that url. please help. Regards, gander_ss (1 Reply)
Discussion started by: gander_ss
1 Replies
Login or Register to Ask a Question