Counting entries in a file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Counting entries in a file
# 36  
Old 02-23-2012
I think I understand now! Here's code with changes, and I found the performance bug. I tested with 1.5 million records which took just under 4 seconds on my small laptop. Rounding up to 4 seconds that should be just under 100 seconds to do the 35 million -- a few seconds faster than the 12 hours (and lots of memory) that the bug was causing it to take Smilie

Code:
# 4 colums of output:
#   1 - interval or bin number
#   2 - total observations during the interval
#   3 - unique observations during the interval
#   4 - new observations during long interval or new first time observations
#
#   if lbin_size is 0, then the 4th column contains
#   the number of ip addresses that were observed
#   for the very first time.
awk  -v lbin_size=${2:-10} -v bin_size=${1:-1} '
    function dump( )
    {
        if( lbin_size > 1 )                 # long interval turned on
        {
            if( ++lidx >= lbin_size )           # spot for next list; roll if needed
            {
                lidx = 0;
                lwrap = 1;
            }


            if( !lwrap )
                printf( "%5d %5d %5d %5s\n", bin+1, total, new_count, "  -" );  # no value until we have wrapped round
            else
            {                                   # compute the unique addresses in the long interval
                for( l in llist )               # go through each long interval list to weed out duplicates
                {
                    split( llist[l], a, " " );
                    for( i = 1; i <= length( a ); i++ )
                        lunique[a[i]] = 1;                  # get unique set across the long interval
                }

                ltotal = 0;
                for( u in lunique )
                    ltotal++;                   # finally total the unique addresses seen in long interval

                printf( "%5d %5d %5d %5d\n", bin+1, total, new_count, ltotal );
            }

            llist[lidx] = "";               # clear the next list
        }
        else                                # long interval off, show stats and since beginning of time value in col 4
            printf( "%5d %5d %5d %6d\n", bin+1, total, length( unique ), brand_new );

        brand_new = 0;                      # reset
        bin++;
    }

    BEGIN {
        # comment next line out if no header is needed
        if( lbin_size > 0 )
            printf( "%5s %5s %5s %5s\n", "INT", "TOT", "NEW-S", "NEW-L" );
        else
            printf( "%5s %5s %5s %6s\n", "INT", "TOT", "NEW-S", "1ST-TM" );
        lwrap = 0;
        lidx = 0;
    }

    {
        if( $1+0 >= next_bin )              # short interval expires
        {
            if( NR > 1 )
            {
                dump( );                        # write a line of data
                delete unique;
            }

            next_bin = $1 + bin_size;       # set new expiry time
            total = 0;
        }

        if( lbin_size > 0 )
            llist[lidx] = llist[lidx] $2 " ";   # add this to list of addresses for the long interval

        if( !seen[$2]++ )                   # never seen at all
            brand_new++;                    # never seen before count

        unique[$2] = 1;                     # track unique addrs in the short interval
        total++;
    }
    END {
        if( total )
            dump( );
    }
'



The script still wants a 0 to show number of addresses seen for the first time (over all of the input) rather than over the longer interval. The number of unique addresses (col 3) is the number observed in the current interval, without regard to the previous interval.

Have fun and let me know how it goes!


EDIT: It did just occur to me that my performance tests were writing output to /dev/null, so your times might be longer given that it will need to do real I/O to write the results someplace. Still, should be better than 12 hours.

Last edited by agama; 02-23-2012 at 10:31 PM.. Reason: Additional thought
# 37  
Old 02-23-2012
Hey! that worked and took no time to compute Smilie Doing some more testing but I think its fine. Thanks! Just a quick question, if the user inputs are 2 and 10 (not 0) will it still use 2*10 as the history period for comparing and calculating the values for column 1?
# 38  
Old 02-23-2012
Quote:
Originally Posted by sajal.bhatia
Hey! that worked and took no time to compute Smilie Doing some more testing but I think its fine. Thanks! Just a quick question, if the user inputs are 2 and 10 (not 0) will it still use 2*10 as the history period for comparing and calculating the values for column 1?
Yes, that function is unchanged, but it writes the result in column 4, not 1 as column 1 is always the bin (interval) number -- I assume that 1 was a typo.
This User Gave Thanks to agama For This Post:
# 39  
Old 02-23-2012
Yep it was a typo, sorry! Thanks for your help. Will let you know if I have any further issues! Thanks again Smilie
# 40  
Old 02-24-2012
Code:
# use Getopt::Std;
# getopts('c:');  
# my $tmp = $cnt/$opt_c;

my $cnt;

while(<DATA>){
	chomp;
	my @tmp = split;
	$cnt++ unless $hash{$tmp[0]};
	$hash{$tmp[0]}->{$tmp[1]}++;	
}

my $tmp = 2;

my @keys =  sort {$a cmp $b}  keys %hash;

for(my $i=0;$i<=$tmp-1;$i++){
	
	my $count ;
	
	my %tmp_hash;
	
	my $total = $cnt/$tmp;
	
	for(my $j=0;$j<=$total-1;$j++){
		my $key = shift @keys;
		my $a = $hash{$key};
		foreach my $k (keys %{$a}){
			$count += $a->{$k};	
			$tmp_hash{$k}=1;
		}
	}
	
	my @tt = keys %tmp_hash;
	my $distinct = $#tt+1 ;
	
	print $i+1," ",$count," ",$distinct,"\n";
	
}

__DATA__
899726401 112.254.1.0
899726401 112.254.1.0
899726402 154.162.38.0
899726402 160.114.12.0
899726402 165.161.7.0
899726403 101.226.38.0
899726403 101.226.38.0
899726403 101.226.38.0
899726403 73.214.29.0
899726403 144.12.40.0
899726404 144.12.40.0
899726404 1.14.4.0


Last edited by Franklin52; 02-24-2012 at 06:47 AM.. Reason: Code tags
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Counting and print from file

Dear community, I have an already filtered log on my machine, something like: WARN 2016.03.10 10:59:01.136 logging.LogAlarmListener raise ALARMWARNINGRAISED Alarm NODE-NetworkAccessGroup.Client.41283 SERVICEDOWN-41283.WC severity WARNING raised: Service 41283.WC protocoltype client is down... (13 Replies)
Discussion started by: Lord Spectre
13 Replies

2. Shell Programming and Scripting

Need help of counting no of column of a file

Hi All , I got stuck on the below scenario.If anyone can help me ,that will be really helpful. I have a target hdfs file layout.I need to know the no of column in that file. Target_RECRD_layout { ABC_ID EN NOTNULLABLE, ABC_COUNTRY CHARACTER ENCODING ASCII NOTNULLABLE, ... (5 Replies)
Discussion started by: STCET22
5 Replies

3. Shell Programming and Scripting

Counting lines in a file using awk

I want to count lines of a file using AWK (only) and not in the END part like this awk 'END{print FNR}' because I want to use it. Does anyone know of a way? Thanks a lot. (7 Replies)
Discussion started by: guitarist684
7 Replies

4. UNIX for Dummies Questions & Answers

Counting feilds entries with Perl

Hi All, I have a small problem of counting the number of times a particular entry that exists in a horizontal string of elements and a vertical feild (column of entries). For example AATGGTCCTGExpected outputA=2 C=2 G=3 T=3 I have an idea to do this but I dont know how to do that if these entries... (1 Reply)
Discussion started by: pawannoel
1 Replies

5. Shell Programming and Scripting

Counting characters within a file

Ok say I wanted to count every Y in a data file. Then set Y as my delimiter so that I can separate my file by taking all the contents that occur BEFORE the first Y and store them in a variable so that I may use this content later on in my program. Then I could do the same thing with the next Y's... (5 Replies)
Discussion started by: puttster
5 Replies

6. Shell Programming and Scripting

Counting duplicate entries in a file using awk

Hi, I have a very big (with around 1 million entries) txt file with IPv4 addresses in the standard format, i.e. a.b.c.d The file looks like 10.1.1.1 10.1.1.1 10.1.1.1 10.1.2.4 10.1.2.4 12.1.5.6 . . . . and so on.... There are duplicate/multiple entries for some IP... (3 Replies)
Discussion started by: sajal.bhatia
3 Replies

7. Shell Programming and Scripting

Counting multiple entries in a file using awk

Hi, I have a big file (~960MB) having epoch time values (~50 million entries) which looks like 897393601 897393601 897393601 897393601 897393602 897393602 897393602 897393602 897393602 897393603 897393603 897393603 897393603 and so on....each time stamp has more than one... (6 Replies)
Discussion started by: sajal.bhatia
6 Replies

8. Programming

Counting the words in a file

Please find the below program. It contains the purpose of the program itself. /* Program : Write a program to count the number of words in a given text file */ /* Date : 12-June-2010 */ # include <stdio.h> # include <stdlib.h> # include <string.h> int main( int argc, char *argv ) {... (6 Replies)
Discussion started by: ramkrix
6 Replies

9. Shell Programming and Scripting

Help me in counting records from file

Hi, Please help me in counting the below records(1st field) from samplefile: Expected output: Count Descr ------------------------------------------- 7 Mean manager 14 ... (7 Replies)
Discussion started by: prashant43
7 Replies

10. Shell Programming and Scripting

Counting words in a file

I'm trying to figure out a way to count the number of words in the follwing file: cal 2002 > file1 Is there anyway to do this without using wc but instead using the cut command? (1 Reply)
Discussion started by: r0mulus
1 Replies
Login or Register to Ask a Question