Counting entries in a file

02-23-2012
I think I understand now! Here's code with changes, and I found the performance bug. I tested with 1.5 million records which took just under 4 seconds on my small laptop. Rounding up to 4 seconds that should be just under 100 seconds to do the 35 million -- a few seconds faster than the 12 hours (and lots of memory) that the bug was causing it to take Smilie

# 4 colums of output:
#   1 - interval or bin number
#   2 - total observations during the interval
#   3 - unique observations during the interval
#   4 - new observations during long interval or new first time observations
#   if lbin_size is 0, then the 4th column contains
#   the number of ip addresses that were observed
#   for the very first time.
awk  -v lbin_size=${2:-10} -v bin_size=${1:-1} '
    function dump( )
        if( lbin_size > 1 )                 # long interval turned on
            if( ++lidx >= lbin_size )           # spot for next list; roll if needed
                lidx = 0;
                lwrap = 1;

            if( !lwrap )
                printf( "%5d %5d %5d %5s\n", bin+1, total, new_count, "  -" );  # no value until we have wrapped round
            {                                   # compute the unique addresses in the long interval
                for( l in llist )               # go through each long interval list to weed out duplicates
                    split( llist[l], a, " " );
                    for( i = 1; i <= length( a ); i++ )
                        lunique[a[i]] = 1;                  # get unique set across the long interval

                ltotal = 0;
                for( u in lunique )
                    ltotal++;                   # finally total the unique addresses seen in long interval

                printf( "%5d %5d %5d %5d\n", bin+1, total, new_count, ltotal );

            llist[lidx] = "";               # clear the next list
        else                                # long interval off, show stats and since beginning of time value in col 4
            printf( "%5d %5d %5d %6d\n", bin+1, total, length( unique ), brand_new );

        brand_new = 0;                      # reset

    BEGIN {
        # comment next line out if no header is needed
        if( lbin_size > 0 )
            printf( "%5s %5s %5s %5s\n", "INT", "TOT", "NEW-S", "NEW-L" );
            printf( "%5s %5s %5s %6s\n", "INT", "TOT", "NEW-S", "1ST-TM" );
        lwrap = 0;
        lidx = 0;

        if( $1+0 >= next_bin )              # short interval expires
            if( NR > 1 )
                dump( );                        # write a line of data
                delete unique;

            next_bin = $1 + bin_size;       # set new expiry time
            total = 0;

        if( lbin_size > 0 )
            llist[lidx] = llist[lidx] $2 " ";   # add this to list of addresses for the long interval

        if( !seen[$2]++ )                   # never seen at all
            brand_new++;                    # never seen before count

        unique[$2] = 1;                     # track unique addrs in the short interval
    END {
        if( total )
            dump( );

The script still wants a 0 to show number of addresses seen for the first time (over all of the input) rather than over the longer interval. The number of unique addresses (col 3) is the number observed in the current interval, without regard to the previous interval.

Have fun and let me know how it goes!

EDIT: It did just occur to me that my performance tests were writing output to /dev/null, so your times might be longer given that it will need to do real I/O to write the results someplace. Still, should be better than 12 hours.

02-24-2012
Hey! that worked and took no time to compute Smilie Doing some more testing but I think its fine. Thanks! Just a quick question, if the user inputs are 2 and 10 (not 0) will it still use 2*10 as the history period for comparing and calculating the values for column 1?
02-24-2012
Originally Posted by sajal.bhatia
Hey! that worked and took no time to compute Smilie Doing some more testing but I think its fine. Thanks! Just a quick question, if the user inputs are 2 and 10 (not 0) will it still use 2*10 as the history period for comparing and calculating the values for column 1?
Yes, that function is unchanged, but it writes the result in column 4, not 1 as column 1 is always the bin (interval) number -- I assume that 1 was a typo.
02-24-2012
Yep it was a typo, sorry! Thanks for your help. Will let you know if I have any further issues! Thanks again Smilie
02-24-2012
# use Getopt::Std;
# getopts('c:');  
# my $tmp = $cnt/$opt_c;

my $cnt;

	my @tmp = split;
	$cnt++ unless $hash{$tmp[0]};

my $tmp = 2;

my @keys =  sort {$a cmp $b}  keys %hash;

for(my $i=0;$i<=$tmp-1;$i++){
	my $count ;
	my %tmp_hash;
	my $total = $cnt/$tmp;
	for(my $j=0;$j<=$total-1;$j++){
		my $key = shift @keys;
		my $a = $hash{$key};
		foreach my $k (keys %{$a}){
			$count += $a->{$k};	
	my @tt = keys %tmp_hash;
	my $distinct = $#tt+1 ;
	print $i+1," ",$count," ",$distinct,"\n";


Login or Register to Ask a Question