Help with gawk script that aggregates ip address ranges
Howdy folks,
perhaps someone can help me with this problem. My knowledge of awk is not the best... but I've managed to a certain degree and now I'm stuck.
These are the steps and the format outputs, problem is written in red text after the STEP 2:
STEP 1 Unformated text file (100+ megs ) containing IP addresses "unformated-unsorted-IP-list.txt" | | V
script is called: | | V output of "formated-unsorted-ip-list.p2p" file would look like this:
Explanation: IP addresses must be formated this way cause of sorting issues. The script generates the file size from original input file of some ~100MB to result file of around ~300MB. Yea, that's a HUGE BUNCH (9+ million ip addresses) of spammers and malware sources...
This step takes around a minute on my puter. Gawk is REALLY really fast - even when running under windows unxtools
STEP 2: Sorting of the IP addresses
You will notice that the output of the previous step contains an array for each IP address, but that arrays are not sorted. This step does that by using cmsort or sort utility.
so, lets give example with cmsort - sorts 400MB file in under 25 mins.
script is called: the output file "formated-sorted-ip-list.p2p" would look like this:
Dum ta dum... and this is as far as I've managed.
The problem is in STEP 3
**** PROBLEM **** HOW to make an gawk script that would check the adjenced IP addresses and group them into ranges, thus cutting down the number of arrays and significantly reducing the file size.
*****************
STEP 3:
Aggregating the formated, sorted IP list into arrays consisting of adjenced IP addresses.
calling the script that I need your help to create output of that script should convert the input from formated-sorted-ip-list.p2p into this:
I don't mind if the STEP 1 and STEP 2 are placed in a single line... but somehow I think it would increase the time it will take to produce the sorted-p2p-formated output. Currently it takes around 25 minutes for 9+ milion arrays consisting of a single ip address.
I even tried doing this with my script:
....but that unfortunately blocked helluvalot of IP addresses that were definitely NOT spammers or malware spreaders, so I can't use that method
In case you were wondering why I need this particular script... its cause I need it for protowall or peerguardian ... I am sure some of you use that software for torrents (legal ofc ), but I need it to block spammers... I hate the number of DNS queries that go from my server when checking if the inbound mail sender's ip address is in their dnsbls list.
And for curiosity sake =) wonder what happens if I load it with more then a handful of IPs to block
So, what do you say folks, can someone help me with this script? =)
Hope its not as complicated as I've presented it har har!
When I tried Ygor's script:
After a few minutes my puter froze and bam... this error popped up.
I believe I didn't mention the hardware specs of my PC and the OS:
3GB (thats 2x1GB and 2x512MB, not 4 GB on a 32bit OS), 2x dual core Intel proc (that's dual proc mobo), 4 HDDs of which two are in mirror and two are striped for better filesystem performance.
I still believe the prog should've had enough free mem to perform... but as I mentioned before - the file I need to run aggregation on is around 300-400MB in size (after being converted from initial 100-150MB of pure unique IPs) - that's not a small chunk =) - so it might have attempted a huge allocation of memory as it worked its way through array creations?
If you want, I can provide you the file via some file-exchange service so you can test it on or something?
The OS is WinXP 32bit since that's where protowall is located, but scripts are all running via unxutils.
I tried running the script on a 6MB file containing just IP addresses:
Will take a peek and see whats with the asorti function.
When I tried to run summer_cherry's script:
Oopsie... did I just do that? Guess my gawk doesn't like it Perhaps my perl would like it better?
I am running gawk via unxtools and not on a native unix so environment, so that might be a slight problem in some cases.
Lets see if we (well - you is more correct ) can make these scripts work. Har Har
I still gotta properly figure out the fine magic behind those lines.
And again - thanks a bunch for helping me work this one out!
I've written a very simple gawk program which runs when I execute it at the POSIX shell but the corresponding '.awk' script I wrote doesn't return
any data. I'm on an HP-UX running gawk version 3.1. (w/all the defaults)
(As you can probably guess I'm a newbie going through the manual and trying... (2 Replies)
Hello!
I've got script to write. It should read databases (names, volumes) from table testdatabase and compares it to actually existing databases in /var/lib/mysql/. If there is no informations about database in table - we should see information "There is no declared informations about database... (1 Reply)
I have been struggling with a script to automate some security related activities. I have it pretty much working, all except the search. I have an input file formatted as such:
216.234.246.158 216.234.246.158 ``
24.249.221.22 24.249.221.200 ``
24.249.226.0 ... (4 Replies)
Hi, I'm trying to break a large csv file into smaller files and use unique values for the file names. The shell script i'm using is tcsh and i'm after a gawk one-liner to get the desired outcome. To keep things simple I have the following example with the desired output.
fruitlist.csv
apples... (6 Replies)
Good morning all. I have been running into a problem running a simple gawk script that selects every third line from an input file and writes it to an output file.
gawk "NR%3==0" FileIn > FileOut
I am attempting to run this command from a batch file at the command line. I have several hundred... (6 Replies)
hi i've already created this script. When I execute the script it takes the argument and compares it to the 3rd column of the script. What I was wondering if I could get some help with is. I want to add another column to the script and it will be the result of a set number for example, (2000 - 3rd... (3 Replies)
Hi All,
I have huge trade file with milions of trades.I need to remove duplicate records (e.g I have following records)
30/10/2009,trdeId1,..,..
26/10/2009.tradeId1,..,..,,
30/10/2009,tradeId2,..
In the above case i need to filter duplicate recods and I should get following output.... (2 Replies)
Hey guys need your help with an gawk script... here's what I have so far
gawk '^d/ {printf "%-20s %-10s %-10s %-10s %-4s%2s %5s\n",$9,$1,$3,$4,$6,$7,$8}' ls.kbr
The file ls.kbr is a capture of 'ls-al'
What I want gawk to do is:
1) Find only directories (this is working)
2) skip lines... (2 Replies)
Hello all,
I'm totally new to UNIX/Linux but I'm taking a course in it at my local JC.
My question: I have been tasked with writing a gawk script that will create a nicely formatted report. That part I've done ok on...however, the very last thing that must be done is a calculation of a... (4 Replies)