Help with gawk script that aggregates ip address ranges


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Help with gawk script that aggregates ip address ranges
# 1  
Old 07-13-2009
Help with gawk script that aggregates ip address ranges

Howdy folks,

perhaps someone can help me with this problem. My knowledge of awk is not the best... but I've managed to a certain degree and now I'm stuck.

These are the steps and the format outputs, problem is written in red text after the STEP 2:


STEP 1
Unformated text file (100+ megs Smilie) containing IP addresses
"unformated-unsorted-IP-list.txt"
Code:
#TEXTLINE1
#TEXTLINE2
1.1.1.1
2.2.2.2
1.1.1.2
1.1.1.3
3.3.3.3
4.4.4.4

|
|
V
script is called:
Code:
gawk -f convert2p2p.awk unformated-unsorted-IP-list.txt > formated-unsorted-ip-list.p2p

Code:
 
content of convert2p2p.awk script
#!/bin/sh
#this is a gawk script     
#this script strips first two lines from the file since they are not IP addresses, then adds CBLSPAMMER and 0 in front of IPs that are not in 000.000.000.000 format, finaly - it creates an array for each ip address in the unformated IP list file. (this is my problem, I want it to sort em, and aggregate the ranges, explained as you read onwards)
 
BEGIN {FS="\\.*"}
NR > 2 {printf "CBLSPAMMER:%03d.%03d.%03d.%03d-%03d.%03d.%03d.%03d\n" ,$1,$2,$3,$4,$1,$2,$3,$4}

|
|
V
output of "formated-unsorted-ip-list.p2p" file would look like this:
Code:
EXAMPLE:001.001.001.001-001.001.001.001
EXAMPLE:002.002.002.002-002.002.002.002
EXAMPLE:001.001.001.002-001.001.001.002
EXAMPLE:001.001.001.003-001.001.001.003
EXAMPLE:003.003.003.003-003.003.003.003
EXAMPLE:004.004.004.004-004.004.004.004

Explanation: IP addresses must be formated this way cause of sorting issues. The script generates the file size from original input file of some ~100MB to result file of around ~300MB. Yea, that's a HUGE BUNCH (9+ million ip addresses) of spammers and malware sources...

This step takes around a minute on my puter. Gawk is REALLY really fast - even when running under windows unxtools Smilie


STEP 2:
Sorting of the IP addresses

You will notice that the output of the previous step contains an array for each IP address, but that arrays are not sorted. This step does that by using cmsort or sort utility.

so, lets give example with cmsort - sorts 400MB file in under 25 mins.
script is called:
Code:
cmsort /Q /B /T=d:\temp formated-unsorted-ip-list.p2p formated-sorted-ip-list.p2p

the output file "formated-sorted-ip-list.p2p" would look like this:
Code:
EXAMPLE:001.001.001.001-001.001.001.001
EXAMPLE:001.001.001.002-001.001.001.002
EXAMPLE:001.001.001.003-001.001.001.003
EXAMPLE:002.002.002.002-002.002.002.002
EXAMPLE:003.003.003.003-003.003.003.003
EXAMPLE:004.004.004.004-004.004.004.004

Dum ta dum... and this is as far as I've managed.
The problem is in STEP 3

**** PROBLEM ****
HOW to make an gawk script that would check the adjenced IP addresses and group them into ranges, thus cutting down the number of arrays and significantly reducing the file size.

*****************

STEP 3:

Aggregating the formated, sorted IP list into arrays consisting of adjenced IP addresses.

calling the script that I need your help to create Smilie
Code:
 
gawk -f theuberscript.awk formated-sorted-ip-list.p2p > formated-sorted-AGGREGATED-ip-list.p2p

output of that script should convert the input from formated-sorted-ip-list.p2p into this:
Code:
EXAMPLE:001.001.001.001-001.001.001.003       (putting three arrays into one line - sure cuts down the filesize, right? hehe)
EXAMPLE:002.002.002.002-002.002.002.002
EXAMPLE:003.003.003.003-003.003.003.003
EXAMPLE:004.004.004.004-004.004.004.004

I don't mind if the STEP 1 and STEP 2 are placed in a single line... but somehow I think it would increase the time it will take to produce the sorted-p2p-formated output. Currently it takes around 25 minutes for 9+ milion arrays consisting of a single ip address.

I even tried doing this with my script:

Code:
 
BEGIN {FS="\\.*"}
NR > 2 {printf "CBLSPAMMER:%03d.%03d.%03d.1-%03d.%03d.%03d.254\r\n" ,$1,$2,$3,$1,$2,$3}

....but that unfortunately blocked helluvalot of IP addresses that were definitely NOT spammers or malware spreaders, so I can't use that method Smilie

In case you were wondering why I need this particular script... its cause I need it for protowall or peerguardian ... I am sure some of you use that software for torrents (legal ofc Smilie), but I need it to block spammers... Smilie I hate the number of DNS queries that go from my server when checking if the inbound mail sender's ip address is in their dnsbls list.
And for curiosity sake =) wonder what happens if I load it with more then a handful of IPs to block Smilie

So, what do you say folks, can someone help me with this script? =)
Hope its not as complicated as I've presented it har har! Smilie

Best regards,

Matt
# 2  
Old 07-13-2009
Code:
use strict;
my %hash;
while(<DATA>){
	chomp;
	my @tmp=split("[.]",$_);
	map {$_=sprintf("%03d",$_)} @tmp;
	my $key = sprintf("%s.%s.%s",$tmp[0],$tmp[1],$tmp[2]);
	push @{$hash{$key}}, $tmp[3];
}
foreach my $key (sort keys %hash){
	my @tmp=@{$hash{$key}};
	@tmp=sort @tmp;
	print $key,".",$tmp[0],"--",$key,".",$tmp[$#tmp],"\n";
}
__DATA__
1.1.1.1
2.2.2.2
1.1.1.2
4.4.4.3
1.1.1.3
3.3.3.3
4.4.4.4
2.2.2.3

# 3  
Old 07-14-2009
Try...
Code:
BEGIN {
   FS = "."
}
NF == 4 {
   Key = sprintf("%03d.%03d.%03d.%03d", $1, $2, $3, $4)
   Arr[Key] = Key
}
END {
   n = asorti(Arr)
   for (Idx = 1; Idx <= n; Idx++) {
      Curr = substr(Arr[Idx], 1, 11) (substr(Arr[Idx], 13, 3) + 0)
      Prev = substr(Arr[Idx-1], 1, 11) (substr(Arr[Idx-1], 13, 3) + 1)
      Next = substr(Arr[Idx+1], 1, 11) (substr(Arr[Idx+1], 13, 3) - 1)
      if (Curr != Prev && Curr != Next ) {
          print Arr[Idx] "-" Arr[Idx]
      } else if (Curr != Prev && Curr == Next ) {
          printf Arr[Idx] "-"
      } else if (Curr == Prev && Curr != Next ) {
          print Arr[Idx]
      }
   }
}

Tested...
Code:
$ cat file1
#TEXTLINE1
#TEXTLINE2
5.5.5.53
5.5.5.54
5.5.5.55
1.1.1.1
2.2.2.2
1.1.1.2
1.1.1.3
3.3.3.3
4.4.4.4
5.5.5.5
5.5.5.51
5.5.5.52

$ gawk -f a1.awk file1 > file2

$ cat file2
001.001.001.001-001.001.001.003
002.002.002.002-002.002.002.002
003.003.003.003-003.003.003.003
004.004.004.004-004.004.004.004
005.005.005.005-005.005.005.005
005.005.005.051-005.005.005.055

$


Last edited by Ygor; 07-14-2009 at 03:12 AM..
# 4  
Old 07-14-2009
Howdy Ygor and summer_cherry Smilie

Thanks for trying to help! Smilie


Okie, lets see now:

When I tried Ygor's script:
Code:
gawk -f test.awk cbl-block-list.txt > cbl-block-list.p2p

After a few minutes my puter froze and bam... this error popped up.

Code:
gawk: test.awk:5: (FILENAME=cbl-block-list.txt FNR=3102944) fatal: format_tree:
obuf: can't allocate memory (Not enough space)

I believe I didn't mention the hardware specs of my PC and the OS:
3GB (thats 2x1GB and 2x512MB, not 4 GB on a 32bit OS), 2x dual core Intel proc (that's dual proc mobo), 4 HDDs of which two are in mirror and two are striped for better filesystem performance.

I still believe the prog should've had enough free mem to perform... but as I mentioned before - the file I need to run aggregation on is around 300-400MB in size (after being converted from initial 100-150MB of pure unique IPs) - that's not a small chunk =) - so it might have attempted a huge allocation of memory as it worked its way through array creations?
If you want, I can provide you the file via some file-exchange service so you can test it on or something?

The OS is WinXP 32bit since that's where protowall is located, but scripts are all running via unxutils.

I tried running the script on a 6MB file containing just IP addresses:
Code:
gawk -f test.awk small-chunk.txt > small-chunk.p2p
gawk: test.awk:9: (FILENAME=small-chunk.txt FNR=443897) fatal: function `asorti'
 not defined

Will take a peek and see whats with the asorti function.


When I tried to run summer_cherry's script:
Code:
 
gawk -f test2.awk cbl-block-list.txt > cbl-block-list.p2p
gawk: test2.awk:3: while(<DATA>){
gawk: test2.awk:3: ^ parse error
gawk: test2.awk:3: while(<DATA>){
gawk: test2.awk:3:              ^ parse error
gawk: test2.awk:5:      my @tmp=split("[.]",$_);
gawk: test2.awk:5:         ^ invalid char '@' in expression

Oopsie... did I just do that? Smilie Guess my gawk doesn't like it Smilie Perhaps my perl would like it better?

I am running gawk via unxtools and not on a native unix so environment, so that might be a slight problem in some cases.

Lets see if we (well - you is more correct Smilie) can make these scripts work. Har Har Smilie
I still gotta properly figure out the fine magic behind those lines.

And again - thanks a bunch for helping me work this one out!

Cheers,

Matt

PS. My appologies on broken English .)

Last edited by gustisok; 07-14-2009 at 04:56 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Gawk program not working in a script

I've written a very simple gawk program which runs when I execute it at the POSIX shell but the corresponding '.awk' script I wrote doesn't return any data. I'm on an HP-UX running gawk version 3.1. (w/all the defaults) (As you can probably guess I'm a newbie going through the manual and trying... (2 Replies)
Discussion started by: RUCerius
2 Replies

2. Shell Programming and Scripting

[BASH] Gawk + MYSQL script

Hello! I've got script to write. It should read databases (names, volumes) from table testdatabase and compares it to actually existing databases in /var/lib/mysql/. If there is no informations about database in table - we should see information "There is no declared informations about database... (1 Reply)
Discussion started by: Zimny
1 Replies

3. Shell Programming and Scripting

Search IP Address in list of ranges -- not working great

I have been struggling with a script to automate some security related activities. I have it pretty much working, all except the search. I have an input file formatted as such: 216.234.246.158 216.234.246.158 `` 24.249.221.22 24.249.221.200 `` 24.249.226.0 ... (4 Replies)
Discussion started by: tsunami4u
4 Replies

4. Shell Programming and Scripting

Help with gawk array, loop in tcsh script

Hi, I'm trying to break a large csv file into smaller files and use unique values for the file names. The shell script i'm using is tcsh and i'm after a gawk one-liner to get the desired outcome. To keep things simple I have the following example with the desired output. fruitlist.csv apples... (6 Replies)
Discussion started by: theflamingmoe
6 Replies

5. Windows & DOS: Issues & Discussions

Gawk Script in Windows batch file - Help

Good morning all. I have been running into a problem running a simple gawk script that selects every third line from an input file and writes it to an output file. gawk "NR%3==0" FileIn > FileOut I am attempting to run this command from a batch file at the command line. I have several hundred... (6 Replies)
Discussion started by: 10000springs
6 Replies

6. Programming

need help with gawk script

hi i've already created this script. When I execute the script it takes the argument and compares it to the 3rd column of the script. What I was wondering if I could get some help with is. I want to add another column to the script and it will be the result of a set number for example, (2000 - 3rd... (3 Replies)
Discussion started by: gengar
3 Replies

7. Shell Programming and Scripting

Issues with filtering duplicate records using gawk script

Hi All, I have huge trade file with milions of trades.I need to remove duplicate records (e.g I have following records) 30/10/2009,trdeId1,..,.. 26/10/2009.tradeId1,..,..,, 30/10/2009,tradeId2,.. In the above case i need to filter duplicate recods and I should get following output.... (2 Replies)
Discussion started by: nmumbarkar
2 Replies

8. Shell Programming and Scripting

gawk script

Hey guys need your help with an gawk script... here's what I have so far gawk '^d/ {printf "%-20s %-10s %-10s %-10s %-4s%2s %5s\n",$9,$1,$3,$4,$6,$7,$8}' ls.kbr The file ls.kbr is a capture of 'ls-al' What I want gawk to do is: 1) Find only directories (this is working) 2) skip lines... (2 Replies)
Discussion started by: zoo591
2 Replies

9. UNIX for Dummies Questions & Answers

Calculating field using AWK, or GAWK script

Hello all, I'm totally new to UNIX/Linux but I'm taking a course in it at my local JC. My question: I have been tasked with writing a gawk script that will create a nicely formatted report. That part I've done ok on...however, the very last thing that must be done is a calculation of a... (4 Replies)
Discussion started by: Trellot
4 Replies

10. Shell Programming and Scripting

how to use variables and gawk in a script?

Hi, I want to define variables in a shell script and make gawk use them to make some operations Mfn = $(grep " 1 " $fitxer | gawk '{print $2}') Xfn = $(grep " 1 " $fitxer | gawk '{print $3}') Yfn = $(grep " 1 " $fitxer | gawk '{print $4}') Zfn = $(grep " 1 " $fitxer | gawk... (9 Replies)
Discussion started by: pau
9 Replies
Login or Register to Ask a Question