Counting entries in a file

02-23-2012

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

I think I understand now! Here's code with changes, and I found the performance bug. I tested with 1.5 million records which took just under 4 seconds on my small laptop. Rounding up to 4 seconds that should be just under 100 seconds to do the 35 million -- a few seconds faster than the 12 hours (and lots of memory) that the bug was causing it to take

Code:

# 4 colums of output:
#   1 - interval or bin number
#   2 - total observations during the interval
#   3 - unique observations during the interval
#   4 - new observations during long interval or new first time observations
#
#   if lbin_size is 0, then the 4th column contains
#   the number of ip addresses that were observed
#   for the very first time.
awk  -v lbin_size=${2:-10} -v bin_size=${1:-1} '
    function dump( )
    {
        if( lbin_size > 1 )                 # long interval turned on
        {
            if( ++lidx >= lbin_size )           # spot for next list; roll if needed
            {
                lidx = 0;
                lwrap = 1;
            }


            if( !lwrap )
                printf( "%5d %5d %5d %5s\n", bin+1, total, new_count, "  -" );  # no value until we have wrapped round
            else
            {                                   # compute the unique addresses in the long interval
                for( l in llist )               # go through each long interval list to weed out duplicates
                {
                    split( llist[l], a, " " );
                    for( i = 1; i <= length( a ); i++ )
                        lunique[a[i]] = 1;                  # get unique set across the long interval
                }

                ltotal = 0;
                for( u in lunique )
                    ltotal++;                   # finally total the unique addresses seen in long interval

                printf( "%5d %5d %5d %5d\n", bin+1, total, new_count, ltotal );
            }

            llist[lidx] = "";               # clear the next list
        }
        else                                # long interval off, show stats and since beginning of time value in col 4
            printf( "%5d %5d %5d %6d\n", bin+1, total, length( unique ), brand_new );

        brand_new = 0;                      # reset
        bin++;
    }

    BEGIN {
        # comment next line out if no header is needed
        if( lbin_size > 0 )
            printf( "%5s %5s %5s %5s\n", "INT", "TOT", "NEW-S", "NEW-L" );
        else
            printf( "%5s %5s %5s %6s\n", "INT", "TOT", "NEW-S", "1ST-TM" );
        lwrap = 0;
        lidx = 0;
    }

    {
        if( $1+0 >= next_bin )              # short interval expires
        {
            if( NR > 1 )
            {
                dump( );                        # write a line of data
                delete unique;
            }

            next_bin = $1 + bin_size;       # set new expiry time
            total = 0;
        }

        if( lbin_size > 0 )
            llist[lidx] = llist[lidx] $2 " ";   # add this to list of addresses for the long interval

        if( !seen[$2]++ )                   # never seen at all
            brand_new++;                    # never seen before count

        unique[$2] = 1;                     # track unique addrs in the short interval
        total++;
    }
    END {
        if( total )
            dump( );
    }
'

The script still wants a 0 to show number of addresses seen for the first time (over all of the input) rather than over the longer interval. The number of unique addresses (col 3) is the number observed in the current interval, without regard to the previous interval.

Have fun and let me know how it goes!

EDIT: It did just occur to me that my performance tests were writing output to /dev/null, so your times might be longer given that it will need to do real I/O to write the results someplace. Still, should be better than 12 hours.

Last edited by agama; 02-23-2012 at 10:31 PM.. Reason: Additional thought

agama

View Public Profile for agama

Find all posts by agama

02-23-2012

Registered User

44, 0

Join Date: Sep 2010

Last Activity: 16 August 2012, 1:55 AM EDT

Posts: 44

Thanks Given: 13

Thanked 0 Times in 0 Posts

Hey! that worked and took no time to compute

Doing some more testing but I think its fine. Thanks! Just a quick question, if the user inputs are 2 and 10 (not 0) will it still use 2*10 as the history period for comparing and calculating the values for column 1?

sajal.bhatia

View Public Profile for sajal.bhatia

Find all posts by sajal.bhatia

02-23-2012

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

Quote:

Originally Posted by sajal.bhatia

Hey! that worked and took no time to compute Smilie

Yes, that function is unchanged, but it writes the result in column 4, not 1 as column 1 is always the bin (interval) number -- I assume that 1 was a typo.

This User Gave Thanks to agama For This Post:

agama

View Public Profile for agama

Find all posts by agama

02-23-2012

Registered User

44, 0

Join Date: Sep 2010

Last Activity: 16 August 2012, 1:55 AM EDT

Posts: 44

Thanks Given: 13

Thanked 0 Times in 0 Posts

Yep it was a typo, sorry! Thanks for your help. Will let you know if I have any further issues! Thanks again

sajal.bhatia

View Public Profile for sajal.bhatia

Find all posts by sajal.bhatia

02-24-2012

Registered User

1,305, 26

Join Date: Jun 2007

Last Activity: 11 November 2016, 3:44 AM EST

Location: Beijing China

Posts: 1,305

Thanks Given: 0

Thanked 26 Times in 26 Posts

Code:

# use Getopt::Std;
# getopts('c:');  
# my $tmp = $cnt/$opt_c;

my $cnt;

while(<DATA>){
	chomp;
	my @tmp = split;
	$cnt++ unless $hash{$tmp[0]};
	$hash{$tmp[0]}->{$tmp[1]}++;	
}

my $tmp = 2;

my @keys =  sort {$a cmp $b}  keys %hash;

for(my $i=0;$i<=$tmp-1;$i++){
	
	my $count ;
	
	my %tmp_hash;
	
	my $total = $cnt/$tmp;
	
	for(my $j=0;$j<=$total-1;$j++){
		my $key = shift @keys;
		my $a = $hash{$key};
		foreach my $k (keys %{$a}){
			$count += $a->{$k};	
			$tmp_hash{$k}=1;
		}
	}
	
	my @tt = keys %tmp_hash;
	my $distinct = $#tt+1 ;
	
	print $i+1," ",$count," ",$distinct,"\n";
	
}

__DATA__
899726401 112.254.1.0
899726401 112.254.1.0
899726402 154.162.38.0
899726402 160.114.12.0
899726402 165.161.7.0
899726403 101.226.38.0
899726403 101.226.38.0
899726403 101.226.38.0
899726403 73.214.29.0
899726403 144.12.40.0
899726404 144.12.40.0
899726404 1.14.4.0

Last edited by Franklin52; 02-24-2012 at 06:47 AM.. Reason: Code tags

summer_cherry

View Public Profile for summer_cherry

Find all posts by summer_cherry

Shell Programming and Scripting

Counting entries in a file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Counting and print from file

Discussion started by: Lord Spectre

2. Shell Programming and Scripting

Need help of counting no of column of a file

Discussion started by: STCET22

3. Shell Programming and Scripting

Counting lines in a file using awk

Discussion started by: guitarist684

4. UNIX for Dummies Questions & Answers

Counting feilds entries with Perl

Discussion started by: pawannoel

5. Shell Programming and Scripting

Counting characters within a file

Discussion started by: puttster

6. Shell Programming and Scripting

Counting duplicate entries in a file using awk

Discussion started by: sajal.bhatia

7. Shell Programming and Scripting

Counting multiple entries in a file using awk

Discussion started by: sajal.bhatia

8. Programming

Counting the words in a file

Discussion started by: ramkrix

9. Shell Programming and Scripting

Help me in counting records from file

Discussion started by: prashant43

10. Shell Programming and Scripting

Counting words in a file

Discussion started by: r0mulus