RudiC, thank you very much for providing this solution, it is truly appreciated. I checked through the publicsuffix list and found that the longest domain is 4 as such added this to the script you provided. Now it works and provides all the different domains. Here is the code I am now using:
I'm surprised this is working for you. There seem to be a few problems:
The code shown in red in your awk script will never be executed. Since the condition on the two red condition/action sets is identical to the orange condition and the action section with that condition ends with a next command, the actions shown in red cannot be executed.
I believe your code should explicitly ignore blank lines and comment lines in public_suffix_list.dat (unless you have pruned those lines out of the public suffix list provided here) when you downloaded the public list into your file).
I don't see how this code handles wildcards in rules (e.g., *.sch.uk).
I don't see how this code handles exception rules (although there aren't any exception rules if you're just trying to process UK domains).
And, according to the rules published for the public list, you should be loading values in your array with C[$1] instead of C[$0], but I don't see anything in the public list that includes a comment at the end of any rules so (if you ignored comment lines and blank lines) it might not matter.
Don Cragun, you are correct, I got excited to early. After running the script through a few hundred examples I found it is not working as desired. Do you maybe have suggestion how to extract the domain from the URL?
Hi RudiC, Thank you for the script, I am trying to resolve one challenge to check if it is working. I am currently getting: warning: escape sequence `\.' treated as plain `.'
Hi RudiC, Thank you for the script, I am trying to resolve one challenge to check if it is working. I am currently getting: warning: escape sequence `\.' treated as plain `.'
Will try and figure out the sequence.
In an ERE . matches any character. The intent is to match only a period at the start of those patterns. Change each occurrence of "\." in the script to "[.]" and it should get rid of the warnings and restrict the match to what was intended. (You could also use "\\.", but I find the matching list expression easier to use than trying to remember how many times a quoted expression will be evaluated by awk in cases like this.)
Thank you RudiC, for the script and assistance, it is truly appreciated. The script works very well now and extracts the Domain from the URL.
Also thank you Don Cragun, for the assistance.
Here is the final script I am currently using that was written by RudiC: ---------- Post updated 11-07-15 at 01:36 PM ---------- Previous update was 11-06-15 at 02:53 PM ----------
Hi RudiC and Don Cragun, could I kindly ask you one final favor to optimize the script that I have currently. The objective is to take the raw log from BIND and enrich this with extraction of the URL and adding content categorization to this. Then writing these to different files to summarize this. The challenge is that with the script below it processes 3.83 lines a second and I have 9 million lines a day
The input log from the DNS1 file look like the following:
Thank you very much already in advance.
Last edited by omuhans123; 11-07-2015 at 07:48 AM..
Obviously, replacing:
with:
(which eliminates 5 executions of awk and 1 execution of cut per line in your log file) should let you process MANY more lines per second. Or, just build this into an awk script that will do all of this and do the URL processing you requested before in a single awk (instead of invoking awk again for every line in your log file).
What is the format of the files in the directory /opt/URL/BL? How many files are there? How many categories are there? Running 5 processes for every line in your log file to grab whatever it is that you want to get is going to keep things running slow. If we can preprocess those files into a table we can search more efficiently for each line's data, that would help immensely.
I have a file like this:
http://article.wn.com/view/2010/11/26/IV_drug_policy_feels_HIV_patients_Red_Cross/ http://aidsjournal.com/,www.cfpa.org.cn/page1/page2 , www.youtube.com
http://seattletimes.nwsource.com/html/jerrybrewer/2013517803_brewer25.html... (1 Reply)
I have a file like this:
http://hello.com www.examplecom computer Company
I wanted to keep dot (.) infront of com. to make the file like this
http://hello.com www.example.com computer Company
I applied this expression
sed -r 's/com/.com/g'but what I get is:
http://hello.com ... (4 Replies)
Hello,
Am very new to perl , please help me here !!
I need help in reading a URL from command line using PERL:: Mechanize and needs all the contents from the URL to get into a file.
below is the script which i have written so far ,
#!/usr/bin/perl
use LWP::UserAgent;
use... (2 Replies)
Hi,
I have a problem where i have to hit multiple URL that are stored in a text file (input.txt) and save their output in different text file (output.txt) somewhat like :
cat input.txt
http://192.168.21.20:8080/PPUPS/international?NUmber=917875446856... (3 Replies)
Here is what I have so far:
find . -name "*php*" -or -name "*htm*" | xargs grep -i iframe | awk -F'"' '/<iframe*/{gsub(/.\*iframe>/,"\"");print $2}'
Here is an example content of a PHP or HTM(HTML) file:
<iframe src="http://ADDRESS_1/?click=5BBB08\" width=1 height=1... (18 Replies)
I am trying to find a way to test some code, but I need to rewrite a specific URL only from a specific HTTP_HOST
The call goes out to
http://SUB.DOMAIN.COM/showAssignment/7bde10b45efdd7a97629ef2fe01f7303/jsmodule/Nevow.Athena
The ID in the middle is always random due to the cookie.
I... (5 Replies)
Dear Expert,
i have linux box that is running in the windows domain, BUT did not being a member of the domain. as I am not the System Administrator so I have no control on the server in the network, such as modify dns entry , add the linux box in AD and domain record and so on that relevant.
... (2 Replies)
Hello,
I need to redirect an existing URL, how can i do that?
There's a current web address to a GUI that I have to redirect to another webaddress. Does anyone know how to do this?
This is on Unix boxes Linux.
example:
https://m45.testing.address.net/host.php
make it so the... (3 Replies)