Help with gawk script that aggregates ip address ranges

07-13-2009

Registered User

4, 0

Join Date: Jul 2009

Last Activity: 21 September 2009, 10:50 AM EDT

Location: Zageb, Croatia

Posts: 4

Thanks Given: 0

Thanked 0 Times in 0 Posts

Help with gawk script that aggregates ip address ranges

Howdy folks,

perhaps someone can help me with this problem. My knowledge of awk is not the best... but I've managed to a certain degree and now I'm stuck.

These are the steps and the format outputs, problem is written in red text after the STEP 2:

STEP 1
Unformated text file (100+ megs ) containing IP addresses
"unformated-unsorted-IP-list.txt"

Code:

#TEXTLINE1
#TEXTLINE2
1.1.1.1
2.2.2.2
1.1.1.2
1.1.1.3
3.3.3.3
4.4.4.4

|
|
V
script is called:

Code:

gawk -f convert2p2p.awk unformated-unsorted-IP-list.txt > formated-unsorted-ip-list.p2p

Code:

 
content of convert2p2p.awk script
#!/bin/sh
#this is a gawk script     
#this script strips first two lines from the file since they are not IP addresses, then adds CBLSPAMMER and 0 in front of IPs that are not in 000.000.000.000 format, finaly - it creates an array for each ip address in the unformated IP list file. (this is my problem, I want it to sort em, and aggregate the ranges, explained as you read onwards)
 
BEGIN {FS="\\.*"}
NR > 2 {printf "CBLSPAMMER:%03d.%03d.%03d.%03d-%03d.%03d.%03d.%03d\n" ,$1,$2,$3,$4,$1,$2,$3,$4}

|
|
V
output of "formated-unsorted-ip-list.p2p" file would look like this:

Code:

EXAMPLE:001.001.001.001-001.001.001.001
EXAMPLE:002.002.002.002-002.002.002.002
EXAMPLE:001.001.001.002-001.001.001.002
EXAMPLE:001.001.001.003-001.001.001.003
EXAMPLE:003.003.003.003-003.003.003.003
EXAMPLE:004.004.004.004-004.004.004.004

Explanation: IP addresses must be formated this way cause of sorting issues. The script generates the file size from original input file of some ~100MB to result file of around ~300MB. Yea, that's a HUGE BUNCH (9+ million ip addresses) of spammers and malware sources...

This step takes around a minute on my puter. Gawk is REALLY really fast - even when running under windows unxtools

STEP 2:
Sorting of the IP addresses

You will notice that the output of the previous step contains an array for each IP address, but that arrays are not sorted. This step does that by using cmsort or sort utility.

so, lets give example with cmsort - sorts 400MB file in under 25 mins.
script is called:

Code:

cmsort /Q /B /T=d:\temp formated-unsorted-ip-list.p2p formated-sorted-ip-list.p2p

the output file "formated-sorted-ip-list.p2p" would look like this:

Code:

EXAMPLE:001.001.001.001-001.001.001.001
EXAMPLE:001.001.001.002-001.001.001.002
EXAMPLE:001.001.001.003-001.001.001.003
EXAMPLE:002.002.002.002-002.002.002.002
EXAMPLE:003.003.003.003-003.003.003.003
EXAMPLE:004.004.004.004-004.004.004.004

Dum ta dum... and this is as far as I've managed.
The problem is in STEP 3

**** PROBLEM ****
HOW to make an gawk script that would check the adjenced IP addresses and group them into ranges, thus cutting down the number of arrays and significantly reducing the file size.

*****************

STEP 3:

Aggregating the formated, sorted IP list into arrays consisting of adjenced IP addresses.

calling the script that I need your help to create

Code:

 
gawk -f theuberscript.awk formated-sorted-ip-list.p2p > formated-sorted-AGGREGATED-ip-list.p2p

output of that script should convert the input from formated-sorted-ip-list.p2p into this:

Code:

EXAMPLE:001.001.001.001-001.001.001.003       (putting three arrays into one line - sure cuts down the filesize, right? hehe)
EXAMPLE:002.002.002.002-002.002.002.002
EXAMPLE:003.003.003.003-003.003.003.003
EXAMPLE:004.004.004.004-004.004.004.004

I don't mind if the STEP 1 and STEP 2 are placed in a single line... but somehow I think it would increase the time it will take to produce the sorted-p2p-formated output. Currently it takes around 25 minutes for 9+ milion arrays consisting of a single ip address.

I even tried doing this with my script:

Code:

 
BEGIN {FS="\\.*"}
NR > 2 {printf "CBLSPAMMER:%03d.%03d.%03d.1-%03d.%03d.%03d.254\r\n" ,$1,$2,$3,$1,$2,$3}

....but that unfortunately blocked helluvalot of IP addresses that were definitely NOT spammers or malware spreaders, so I can't use that method

In case you were wondering why I need this particular script... its cause I need it for protowall or peerguardian ... I am sure some of you use that software for torrents (legal ofc

), but I need it to block spammers...

I hate the number of DNS queries that go from my server when checking if the inbound mail sender's ip address is in their dnsbls list.
And for curiosity sake =) wonder what happens if I load it with more then a handful of IPs to block

So, what do you say folks, can someone help me with this script? =)
Hope its not as complicated as I've presented it har har!

Best regards,

Matt

gustisok

View Public Profile for gustisok

Find all posts by gustisok

07-13-2009

Registered User

1,305, 26

Join Date: Jun 2007

Last Activity: 11 November 2016, 3:44 AM EST

Location: Beijing China

Posts: 1,305

Thanks Given: 0

Thanked 26 Times in 26 Posts

Code:

use strict;
my %hash;
while(<DATA>){
	chomp;
	my @tmp=split("[.]",$_);
	map {$_=sprintf("%03d",$_)} @tmp;
	my $key = sprintf("%s.%s.%s",$tmp[0],$tmp[1],$tmp[2]);
	push @{$hash{$key}}, $tmp[3];
}
foreach my $key (sort keys %hash){
	my @tmp=@{$hash{$key}};
	@tmp=sort @tmp;
	print $key,".",$tmp[0],"--",$key,".",$tmp[$#tmp],"\n";
}
__DATA__
1.1.1.1
2.2.2.2
1.1.1.2
4.4.4.3
1.1.1.3
3.3.3.3
4.4.4.4
2.2.2.3

summer_cherry

View Public Profile for summer_cherry

Find all posts by summer_cherry

07-14-2009

Registered User

1,801, 116

Join Date: Oct 2003

Last Activity: 15 May 2015, 11:55 AM EDT

Location: 54.23, -4.53

Posts: 1,801

Thanks Given: 1

Thanked 116 Times in 101 Posts

Try...

Code:

BEGIN {
   FS = "."
}
NF == 4 {
   Key = sprintf("%03d.%03d.%03d.%03d", $1, $2, $3, $4)
   Arr[Key] = Key
}
END {
   n = asorti(Arr)
   for (Idx = 1; Idx <= n; Idx++) {
      Curr = substr(Arr[Idx], 1, 11) (substr(Arr[Idx], 13, 3) + 0)
      Prev = substr(Arr[Idx-1], 1, 11) (substr(Arr[Idx-1], 13, 3) + 1)
      Next = substr(Arr[Idx+1], 1, 11) (substr(Arr[Idx+1], 13, 3) - 1)
      if (Curr != Prev && Curr != Next ) {
          print Arr[Idx] "-" Arr[Idx]
      } else if (Curr != Prev && Curr == Next ) {
          printf Arr[Idx] "-"
      } else if (Curr == Prev && Curr != Next ) {
          print Arr[Idx]
      }
   }
}

Tested...

Code:

$ cat file1
#TEXTLINE1
#TEXTLINE2
5.5.5.53
5.5.5.54
5.5.5.55
1.1.1.1
2.2.2.2
1.1.1.2
1.1.1.3
3.3.3.3
4.4.4.4
5.5.5.5
5.5.5.51
5.5.5.52

$ gawk -f a1.awk file1 > file2

$ cat file2
001.001.001.001-001.001.001.003
002.002.002.002-002.002.002.002
003.003.003.003-003.003.003.003
004.004.004.004-004.004.004.004
005.005.005.005-005.005.005.005
005.005.005.051-005.005.005.055

$

Last edited by Ygor; 07-14-2009 at 03:12 AM..

Ygor

View Public Profile for Ygor

Find all posts by Ygor

07-14-2009

Registered User

4, 0

Join Date: Jul 2009

Last Activity: 21 September 2009, 10:50 AM EDT

Location: Zageb, Croatia

Posts: 4

Thanks Given: 0

Thanked 0 Times in 0 Posts

Howdy Ygor and summer_cherry

Thanks for trying to help!

Okie, lets see now:

When I tried Ygor's script:

Code:

gawk -f test.awk cbl-block-list.txt > cbl-block-list.p2p

After a few minutes my puter froze and bam... this error popped up.

Code:

gawk: test.awk:5: (FILENAME=cbl-block-list.txt FNR=3102944) fatal: format_tree:
obuf: can't allocate memory (Not enough space)

I believe I didn't mention the hardware specs of my PC and the OS:
3GB (thats 2x1GB and 2x512MB, not 4 GB on a 32bit OS), 2x dual core Intel proc (that's dual proc mobo), 4 HDDs of which two are in mirror and two are striped for better filesystem performance.

I still believe the prog should've had enough free mem to perform... but as I mentioned before - the file I need to run aggregation on is around 300-400MB in size (after being converted from initial 100-150MB of pure unique IPs) - that's not a small chunk =) - so it might have attempted a huge allocation of memory as it worked its way through array creations?
If you want, I can provide you the file via some file-exchange service so you can test it on or something?

The OS is WinXP 32bit since that's where protowall is located, but scripts are all running via unxutils.

I tried running the script on a 6MB file containing just IP addresses:

Code:

gawk -f test.awk small-chunk.txt > small-chunk.p2p
gawk: test.awk:9: (FILENAME=small-chunk.txt FNR=443897) fatal: function `asorti'
 not defined

Will take a peek and see whats with the asorti function.

When I tried to run summer_cherry's script:

Code:

 
gawk -f test2.awk cbl-block-list.txt > cbl-block-list.p2p
gawk: test2.awk:3: while(<DATA>){
gawk: test2.awk:3: ^ parse error
gawk: test2.awk:3: while(<DATA>){
gawk: test2.awk:3:              ^ parse error
gawk: test2.awk:5:      my @tmp=split("[.]",$_);
gawk: test2.awk:5:         ^ invalid char '@' in expression

Oopsie... did I just do that?

Guess my gawk doesn't like it

Perhaps my perl would like it better?

I am running gawk via unxtools and not on a native unix so environment, so that might be a slight problem in some cases.

Lets see if we (well - you is more correct

) can make these scripts work. Har Har

I still gotta properly figure out the fine magic behind those lines.

And again - thanks a bunch for helping me work this one out!

Cheers,

Matt

PS. My appologies on broken English .)

Last edited by gustisok; 07-14-2009 at 04:56 AM..

gustisok

View Public Profile for gustisok

Find all posts by gustisok

Shell Programming and Scripting

Help with gawk script that aggregates ip address ranges

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Gawk program not working in a script

Discussion started by: RUCerius

2. Shell Programming and Scripting

[BASH] Gawk + MYSQL script

Discussion started by: Zimny

3. Shell Programming and Scripting

Search IP Address in list of ranges -- not working great

Discussion started by: tsunami4u

4. Shell Programming and Scripting

Help with gawk array, loop in tcsh script

Discussion started by: theflamingmoe

5. Windows & DOS: Issues & Discussions

Gawk Script in Windows batch file - Help

Discussion started by: 10000springs

6. Programming

need help with gawk script

Discussion started by: gengar

7. Shell Programming and Scripting

Issues with filtering duplicate records using gawk script

Discussion started by: nmumbarkar

8. Shell Programming and Scripting

gawk script

Discussion started by: zoo591

9. UNIX for Dummies Questions & Answers

Calculating field using AWK, or GAWK script

Discussion started by: Trellot

10. Shell Programming and Scripting

how to use variables and gawk in a script?

Discussion started by: pau