Counting entries in a file

10-06-2011

Registered User

44, 0

Join Date: Sep 2010

Last Activity: 16 August 2012, 1:55 AM EDT

Posts: 44

Thanks Given: 13

Thanked 0 Times in 0 Posts

its the no. of new IPs in the current interval as compared to the previous interval.

sajal.bhatia

View Public Profile for sajal.bhatia

Find all posts by sajal.bhatia

10-06-2011

Registered User

2,759, 420

Join Date: Jun 2006

Last Activity: 13 September 2015, 8:58 PM EDT

Posts: 2,759

Thanks Given: 44

Thanked 420 Times in 408 Posts

Cheers.

Code:

awk -v s=$interval 'NR==1{min=$1}
                    {NoP[$1]++;UnIP[$1 FS $2]++;IP[$2];min=min>$1?$1:min;max=max>$1?max:$1;byte[$1]+=$3}
                   END{for (i=min;i<=max;i=i+s)  
                         { b=i
                           while (b<i+s) 
                                 {t+=NoP[b];u+=byte[b]
                                  for (j in IP) if (UnIP[b FS j]) 
                                                   { x[j] 
                                                     v=(z[j]++)?v:v+1;
                                                   }
                                  b++
                                 }
                           print ++e,t,length(x),v,u
                           t=0;u=0;delete x;v=0}
                       }' infile

Code:

interval=1

1 2 1 1 30
2 3 3 2 130
3 5 3 2 120
4 2 2 1 40

Code:

interval=2

1 5 3 3 160
2 7 4 3 160

rdcwayx

View Public Profile for rdcwayx

Find all posts by rdcwayx

10-06-2011

Registered User

44, 0

Join Date: Sep 2010

Last Activity: 16 August 2012, 1:55 AM EDT

Posts: 44

Thanks Given: 13

Thanked 0 Times in 0 Posts

Thanks! I am trying to optimize it as the input files are pretty big (~2GB)

Cheers,

sajal.bhatia

View Public Profile for sajal.bhatia

Find all posts by sajal.bhatia

02-15-2012

Registered User

44, 0

Join Date: Sep 2010

Last Activity: 16 August 2012, 1:55 AM EDT

Posts: 44

Thanks Given: 13

Thanked 0 Times in 0 Posts

Hi!

I would to add another user input parameter (to script provided by Agama posted on 08-10-11) i.e. history period (ranging from no history to everything till the sampling interval) in order to calculate the last column i.e. new IPs.

Can someone help with this?

Cheers,

sajal.bhatia

View Public Profile for sajal.bhatia

Find all posts by sajal.bhatia

02-22-2012

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

If I understand correctly, the awk would need just a couple of changes allowing the start time (in epoch seconds) and end time to be passed into the script as parameters 2 and 3. Changes are in bold:

Code:

#!/usr/bin/env ksh

awk -v startt=${2:-0} -v endt=${3:-9000000000} -v bin_size=${1:-5} '

    function dump( )
    {
        if( NR == 1 )
            return;

        new_count = 0;
        for( u in unique )              # compute total in this bin that were not in last bin
            if( last_bin[u] == 0 )
                new_count++;

        printf( "%3d %3d %3d\n", bin+1, total, new_count );
        bin++;
    }

    
    $1 < startt { next; }
    $1 > endt { exit( 0 ); }
    

    {
        if( $1+0 >= next_bin )
        {
            dump( );
            next_bin = $1 + bin_size;

            delete last_bin;
            for( u in unique )              # copy hits from this bin
                last_bin[u] = 1;
            delete unique;
            total = 0;
        }

        unique[$2]++
        total++;
    }
    END {
        if( total )
            dump( );
    }
'

agama

View Public Profile for agama

Find all posts by agama

02-22-2012

Registered User

44, 0

Join Date: Sep 2010

Last Activity: 16 August 2012, 1:55 AM EDT

Posts: 44

Thanks Given: 13

Thanked 0 Times in 0 Posts

Hey!

Thanks for the reply! I realize I wasn't clear enough, my apologies for the confusion. I will try and explain the problem again.

If you have a look at post #2 and #9 of the thread, the scripts take 1 user input i.e. interval (in seconds) and returns four things viz.. interval (bin+1), total no. of packets in that interval (total), no. of unique IPs in that interval (length(unique)) and no. of new IPs as compared to the previous interval (new_count). Now, I am looking to have an additional user parameter i.e. history_period (in seconds) which should be used to evaluate the last output (no. of new IPs) by comparing with the "history_period" interval and NOT with the "immediate previous interval" as it is currently doing.

For eg. if the user gives 1(interval) and 10(history_period) as the inputs, the script should return the values for every 1 second BUT for calculating "no. of new IPs" it should use the previous 10 seconds as the history (or comparison) period. So essentially the 4th column would be empty for first 10 seconds (or in general till history has been formed) and from there onwards history would be a "moving" things (last 10 seconds in this example.)

I hope I was more clearer this time. Looking for some solution.

Thanks,

sajal.bhatia

View Public Profile for sajal.bhatia

Find all posts by sajal.bhatia

02-22-2012

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

Yep, I completely misunderstood!! Script below has the same function as before with:
1) A second parameter on the command line is interpreted as the long interval length. This length is multiples of the short interval, not seconds because the output cycle, and binning are both tied directly to the short interval. So, if short interval is 2 seconds, and long interval is given as 5, the number of seconds covered by the long interval is 10 seconds.

2) Once the long interval has passed, a 4th column will be printed. This column contains the number of unique addresses observed during the previous n short intervals where n is the long interval value.

3) to better identify the data I added a header line.

Output looks like this when run with a interval of 2 seconds and a long interval of 4:

Code:

test_count 2 4
 INT   TOT NEW-S NEW-L
   1     4     4     -
   2     7     4     -
   3     6     5     -
   4     4     3    10
   5     1     1    10
   6     4     4    12
   7     7     4    13
   8     6     5    18
   9     4     3    19
  10     1     1    19

And to be sure I'm on track with your thinking, the dummy input I used with comments showing how the long interval groups break up and which groups contribute to the count in the 4th column.

Code:

899726401 112.254.1.0       long interval group 1
899726402 154.162.38.0
899726402 160.114.12.0
899726402 165.161.7.0

899726403 101.226.38.0      long interval group 2
899726403 101.226.38.0   
899726403 101.226.38.0   
899726403 73.214.29.0
899726403 144.12.40.0
899726404 144.12.40.0    
899726404 1.14.4.0

899726405 112.254.1.0       long interval group 3
899726405 154.162.38.0   
899726405 160.114.12.0   
899726406 165.161.7.0    
899726406 101.226.38.0   
899726406 101.226.38.1

899726407 101.226.38.2      long interval group 4
899726407 73.214.29.0    
899726407 144.12.40.0    
899726408 144.12.40.0    
---------------------------- write 4th output line -- 10 unique addresses in groups 1-4

899726409 1.14.4.0          long interval group 5
---------------------------- write 5th output line -- 10 unique addresses in groups 2-5

899726411 112.254.1.4       long interval group 6
---------------------------- write 6th output line -- 12 unique addresses in groups 3-6

899726412 154.162.38.0      long interval group 7
899726412 160.114.12.r
899726412 165.161.7.0
---------------------------- write 7th output line -- 13 unique addresses in groups 4-7

899726413 101.226.38.0      long interval group 8
899726413 101.226.38.0
899726413 101.226.38.0
899726413 73.214.29.0
899726413 144.12.40.0
---------------------------- write 8th output line -- 18 unique addresses in groups 5-8

899726414 144.12.40.0
899726414 1.14.4.5

899726415 112.254.1.5
899726415 154.162.38.5
899726415 160.114.12.5

899726416 165.161.7.5
899726416 101.226.38.5
899726416 101.226.38.5
899726417 101.226.38.8
899726417 73.214.29.0
899726417 144.12.40.0
899726418 144.12.40.0
899726419 1.14.4.0

And finally the augmented script:

Code:

#!/usr/bin/env ksh

# 4 colums of output:
#   1 - short interval number
#   2 - total observations during the short interval
#   3 - new observations during the short interval
#   4 - new observations during long interval (after first complete long interval)
awk  -v lbin_size=${2:-10} -v bin_size=${1:-1} '
    function dump( )
    {
        if( NR == 1 )
            return;

        new_count = 0;
        for( u in unique )              # compute total in this bin that were not in last bin
            if( last_bin[u] == 0 )
                new_count++;


        if( ++lidx >= lbin_size )           # spot for next list; roll if needed
        {
            lidx = 0;
            lwrap = 1;
        }


        if( !lwrap )
            printf( "%5d %5d %5d %5s\n", bin+1, total, new_count, "  -" );  # no value until we have wrapped round
        else
        {                                   # compute the unique addresses in the long interval
            for( l in llist )               # go through each long interval list to weed out duplicates
            {
                split( llist[l], a, " " );
                for( i = 1; i <= length( a ); i++ )
                    lunique[a[i]] = 1;                  # get unique set across the long interval
            }

            ltotal = 0;
            for( u in lunique )
                ltotal++;                   # finally total the unique addresses seen in long interval

            printf( "%5d %5d %5d %5d\n", bin+1, total, new_count, ltotal );
        }

        llist[lidx] = "";               # clear the next list

        bin++;
    }

    BEGIN {
        # comment next line out if no header is needed
        printf( "%5s %5s %5s %5s\n", "INT", "TOT", "NEW-S", "NEW-L" );
        lwrap = 0;
        lidx = 0;
    }

    {
        if( $1+0 >= next_bin )              # short interval expires
        {
            dump( );                        # write a line of data
            next_bin = $1 + bin_size;       # set new expiry time

            delete last_bin;
            for( u in unique )              # copy hits from this bin
                last_bin[u] = 1;
            delete unique;
            total = 0;
        }

        llist[lidx] = llist[lidx] $2 " ";   # add this to list of addresses for the long interval

        unique[$2] = 1;                     # track unique addrs in the short interval
        total++;
    }
    END {
        if( total )
            dump( );
    }
'

exit

Last edited by agama; 02-22-2012 at 09:02 PM.. Reason: comments

agama

View Public Profile for agama

Find all posts by agama

Shell Programming and Scripting

Counting entries in a file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Counting and print from file

Discussion started by: Lord Spectre

2. Shell Programming and Scripting

Need help of counting no of column of a file

Discussion started by: STCET22

3. Shell Programming and Scripting

Counting lines in a file using awk

Discussion started by: guitarist684

4. UNIX for Dummies Questions & Answers

Counting feilds entries with Perl

Discussion started by: pawannoel

5. Shell Programming and Scripting

Counting characters within a file

Discussion started by: puttster

6. Shell Programming and Scripting

Counting duplicate entries in a file using awk

Discussion started by: sajal.bhatia

7. Shell Programming and Scripting

Counting multiple entries in a file using awk

Discussion started by: sajal.bhatia

8. Programming

Counting the words in a file

Discussion started by: ramkrix

9. Shell Programming and Scripting

Help me in counting records from file

Discussion started by: prashant43

10. Shell Programming and Scripting

Counting words in a file

Discussion started by: r0mulus