Sponsored Content
Top Forums Shell Programming and Scripting Help with gawk script that aggregates ip address ranges Post 302333610 by gustisok on Monday 13th of July 2009 02:00:04 PM
Old 07-13-2009
Help with gawk script that aggregates ip address ranges

Howdy folks,

perhaps someone can help me with this problem. My knowledge of awk is not the best... but I've managed to a certain degree and now I'm stuck.

These are the steps and the format outputs, problem is written in red text after the STEP 2:


STEP 1
Unformated text file (100+ megs Smilie) containing IP addresses
"unformated-unsorted-IP-list.txt"
Code:
#TEXTLINE1
#TEXTLINE2
1.1.1.1
2.2.2.2
1.1.1.2
1.1.1.3
3.3.3.3
4.4.4.4

|
|
V
script is called:
Code:
gawk -f convert2p2p.awk unformated-unsorted-IP-list.txt > formated-unsorted-ip-list.p2p

Code:
 
content of convert2p2p.awk script
#!/bin/sh
#this is a gawk script     
#this script strips first two lines from the file since they are not IP addresses, then adds CBLSPAMMER and 0 in front of IPs that are not in 000.000.000.000 format, finaly - it creates an array for each ip address in the unformated IP list file. (this is my problem, I want it to sort em, and aggregate the ranges, explained as you read onwards)
 
BEGIN {FS="\\.*"}
NR > 2 {printf "CBLSPAMMER:%03d.%03d.%03d.%03d-%03d.%03d.%03d.%03d\n" ,$1,$2,$3,$4,$1,$2,$3,$4}

|
|
V
output of "formated-unsorted-ip-list.p2p" file would look like this:
Code:
EXAMPLE:001.001.001.001-001.001.001.001
EXAMPLE:002.002.002.002-002.002.002.002
EXAMPLE:001.001.001.002-001.001.001.002
EXAMPLE:001.001.001.003-001.001.001.003
EXAMPLE:003.003.003.003-003.003.003.003
EXAMPLE:004.004.004.004-004.004.004.004

Explanation: IP addresses must be formated this way cause of sorting issues. The script generates the file size from original input file of some ~100MB to result file of around ~300MB. Yea, that's a HUGE BUNCH (9+ million ip addresses) of spammers and malware sources...

This step takes around a minute on my puter. Gawk is REALLY really fast - even when running under windows unxtools Smilie


STEP 2:
Sorting of the IP addresses

You will notice that the output of the previous step contains an array for each IP address, but that arrays are not sorted. This step does that by using cmsort or sort utility.

so, lets give example with cmsort - sorts 400MB file in under 25 mins.
script is called:
Code:
cmsort /Q /B /T=d:\temp formated-unsorted-ip-list.p2p formated-sorted-ip-list.p2p

the output file "formated-sorted-ip-list.p2p" would look like this:
Code:
EXAMPLE:001.001.001.001-001.001.001.001
EXAMPLE:001.001.001.002-001.001.001.002
EXAMPLE:001.001.001.003-001.001.001.003
EXAMPLE:002.002.002.002-002.002.002.002
EXAMPLE:003.003.003.003-003.003.003.003
EXAMPLE:004.004.004.004-004.004.004.004

Dum ta dum... and this is as far as I've managed.
The problem is in STEP 3

**** PROBLEM ****
HOW to make an gawk script that would check the adjenced IP addresses and group them into ranges, thus cutting down the number of arrays and significantly reducing the file size.

*****************

STEP 3:

Aggregating the formated, sorted IP list into arrays consisting of adjenced IP addresses.

calling the script that I need your help to create Smilie
Code:
 
gawk -f theuberscript.awk formated-sorted-ip-list.p2p > formated-sorted-AGGREGATED-ip-list.p2p

output of that script should convert the input from formated-sorted-ip-list.p2p into this:
Code:
EXAMPLE:001.001.001.001-001.001.001.003       (putting three arrays into one line - sure cuts down the filesize, right? hehe)
EXAMPLE:002.002.002.002-002.002.002.002
EXAMPLE:003.003.003.003-003.003.003.003
EXAMPLE:004.004.004.004-004.004.004.004

I don't mind if the STEP 1 and STEP 2 are placed in a single line... but somehow I think it would increase the time it will take to produce the sorted-p2p-formated output. Currently it takes around 25 minutes for 9+ milion arrays consisting of a single ip address.

I even tried doing this with my script:

Code:
 
BEGIN {FS="\\.*"}
NR > 2 {printf "CBLSPAMMER:%03d.%03d.%03d.1-%03d.%03d.%03d.254\r\n" ,$1,$2,$3,$1,$2,$3}

....but that unfortunately blocked helluvalot of IP addresses that were definitely NOT spammers or malware spreaders, so I can't use that method Smilie

In case you were wondering why I need this particular script... its cause I need it for protowall or peerguardian ... I am sure some of you use that software for torrents (legal ofc Smilie), but I need it to block spammers... Smilie I hate the number of DNS queries that go from my server when checking if the inbound mail sender's ip address is in their dnsbls list.
And for curiosity sake =) wonder what happens if I load it with more then a handful of IPs to block Smilie

So, what do you say folks, can someone help me with this script? =)
Hope its not as complicated as I've presented it har har! Smilie

Best regards,

Matt
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

how to use variables and gawk in a script?

Hi, I want to define variables in a shell script and make gawk use them to make some operations Mfn = $(grep " 1 " $fitxer | gawk '{print $2}') Xfn = $(grep " 1 " $fitxer | gawk '{print $3}') Yfn = $(grep " 1 " $fitxer | gawk '{print $4}') Zfn = $(grep " 1 " $fitxer | gawk... (9 Replies)
Discussion started by: pau
9 Replies

2. UNIX for Dummies Questions & Answers

Calculating field using AWK, or GAWK script

Hello all, I'm totally new to UNIX/Linux but I'm taking a course in it at my local JC. My question: I have been tasked with writing a gawk script that will create a nicely formatted report. That part I've done ok on...however, the very last thing that must be done is a calculation of a... (4 Replies)
Discussion started by: Trellot
4 Replies

3. Shell Programming and Scripting

gawk script

Hey guys need your help with an gawk script... here's what I have so far gawk '^d/ {printf "%-20s %-10s %-10s %-10s %-4s%2s %5s\n",$9,$1,$3,$4,$6,$7,$8}' ls.kbr The file ls.kbr is a capture of 'ls-al' What I want gawk to do is: 1) Find only directories (this is working) 2) skip lines... (2 Replies)
Discussion started by: zoo591
2 Replies

4. Shell Programming and Scripting

Issues with filtering duplicate records using gawk script

Hi All, I have huge trade file with milions of trades.I need to remove duplicate records (e.g I have following records) 30/10/2009,trdeId1,..,.. 26/10/2009.tradeId1,..,..,, 30/10/2009,tradeId2,.. In the above case i need to filter duplicate recods and I should get following output.... (2 Replies)
Discussion started by: nmumbarkar
2 Replies

5. Programming

need help with gawk script

hi i've already created this script. When I execute the script it takes the argument and compares it to the 3rd column of the script. What I was wondering if I could get some help with is. I want to add another column to the script and it will be the result of a set number for example, (2000 - 3rd... (3 Replies)
Discussion started by: gengar
3 Replies

6. Windows & DOS: Issues & Discussions

Gawk Script in Windows batch file - Help

Good morning all. I have been running into a problem running a simple gawk script that selects every third line from an input file and writes it to an output file. gawk "NR%3==0" FileIn > FileOut I am attempting to run this command from a batch file at the command line. I have several hundred... (6 Replies)
Discussion started by: 10000springs
6 Replies

7. Shell Programming and Scripting

Help with gawk array, loop in tcsh script

Hi, I'm trying to break a large csv file into smaller files and use unique values for the file names. The shell script i'm using is tcsh and i'm after a gawk one-liner to get the desired outcome. To keep things simple I have the following example with the desired output. fruitlist.csv apples... (6 Replies)
Discussion started by: theflamingmoe
6 Replies

8. Shell Programming and Scripting

Search IP Address in list of ranges -- not working great

I have been struggling with a script to automate some security related activities. I have it pretty much working, all except the search. I have an input file formatted as such: 216.234.246.158 216.234.246.158 `` 24.249.221.22 24.249.221.200 `` 24.249.226.0 ... (4 Replies)
Discussion started by: tsunami4u
4 Replies

9. Shell Programming and Scripting

[BASH] Gawk + MYSQL script

Hello! I've got script to write. It should read databases (names, volumes) from table testdatabase and compares it to actually existing databases in /var/lib/mysql/. If there is no informations about database in table - we should see information "There is no declared informations about database... (1 Reply)
Discussion started by: Zimny
1 Replies

10. Shell Programming and Scripting

Gawk program not working in a script

I've written a very simple gawk program which runs when I execute it at the POSIX shell but the corresponding '.awk' script I wrote doesn't return any data. I'm on an HP-UX running gawk version 3.1. (w/all the defaults) (As you can probably guess I'm a newbie going through the manual and trying... (2 Replies)
Discussion started by: RUCerius
2 Replies
All times are GMT -4. The time now is 04:19 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy