Counting entries in a file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Counting entries in a file
# 29  
Old 02-22-2012
Hey! Thanks for your effort, really appreciate that! I still have to test it but it looks like it will be fine. Just a quick comment/question from top of my head. If I had to make the history period (or long interval in your case) the 'entire thing' and not just some previous 'x' intervals, then how would we do that? as in if the user gives sampling interval (or short interval) as 2 seconds, then for calculating the fourth column (no. of new IPs) it should compare with "everything" upto that time sampling time NOT just the previous 'x' intervals (whats been done now). So basically maintaining some sort of 'seen IPs history' (preferably a hash to quicken the search) and calculating no of (new) IP in each sampling interval which have NOT appeared in the history or haven't been see before. Back to just one user parameter i.e. Sampling time Smilie Thanks a ton for your help! Cheers,
# 30  
Old 02-22-2012
Whoops -- you did say that you'd like to do that and I completely forgot. Well, I'm glad I was on the right track. Code below adds the 'never seen before' logic. If you set the long interval to 0, then column 4 will have the number of IP addresses seen in the short interval that were never seen before.

You're right, it's an easy hash reference so it's quick.

Code:
#!/usr/bin/env ksh
# 4 colums of output:
#   1 - interval or bin number
#   2 - total observations during the interval
#   3 - new observations during the interval
#   4 - new observations during long interval
#
#   if lbin_size is 0, then the 4th column contains
#   the number of ip addresses that were observed
#   for the very first time. lbin_size defaults to 10; change
#   to 0 to enable running script in 'single interval' mode. 
awk  -v lbin_size=${2:-10} -v bin_size=${1:-1} '
    function dump( )
    {
        if( NR == 1 )
            return;

        new_count = 0;
        for( u in unique )              # compute total in this bin that were not in last bin
            if( last_bin[u] == 0 )
                new_count++;

        if( lbin_size> 1 )                  # long interval is turned on
        {
            if( ++lidx >= lbin_size )           # spot for next list; roll if needed
            {
                lidx = 0;
                lwrap = 1;
            }


            if( !lwrap )
                printf( "%5d %5d %5d %5s\n", bin+1, total, new_count, "  -" );  # no value until we have wrapped round
            else
            {                                   # compute the unique addresses in the long interval
                for( l in llist )               # go through each long interval list to weed out duplicates
                {
                    split( llist[l], a, " " );
                    for( i = 1; i <= length( a ); i++ )
                        lunique[a[i]] = 1;                  # get unique set across the long interval
                }

                ltotal = 0;
                for( u in lunique )
                    ltotal++;                   # finally total the unique addresses seen in long interval

                printf( "%5d %5d %5d %5d\n", bin+1, total, new_count, ltotal );
            }

            llist[lidx] = "";               # clear the next list
        }
        else                                # long interval off, show stats and since beginning of time value in col 4
            printf( "%5d %5d %5d %6d\n", bin+1, total, new_count, brand_new );

        brand_new = 0;                      # reset
        bin++;
    }

    BEGIN {
        # comment next line out if no header is needed
        if( lbin_size > 0 )
            printf( "%5s %5s %5s %5s\n", "INT", "TOT", "NEW-S", "NEW-L" );
        else
            printf( "%5s %5s %5s %6s\n", "INT", "TOT", "NEW-S", "1ST-TM" );
        lwrap = 0;
        lidx = 0;
    }

    {
        if( $1+0 >= next_bin )              # short interval expires
        {
            dump( );                        # write a line of data
            next_bin = $1 + bin_size;       # set new expiry time

            delete last_bin;
            for( u in unique )              # copy hits from this bin
                last_bin[u] = 1;
            delete unique;
            total = 0;
        }

        if( lbin_size> 1 ) 
                llist[lidx] = llist[lidx] $2 " ";   # add this to list of addresses for the long interval

        if( !seen[$2]++ )                   # never seen at all
            brand_new++;                    # never seen before count

        unique[$2] = 1;                     # track unique addrs in the short interval
        total++;
    }
    END {
        if( total )
            dump( );
    }
'


Last edited by agama; 02-23-2012 at 10:29 PM.. Reason: corrected performance bug in script
# 31  
Old 02-23-2012
Hey! Using 0 as the second input works fine but somehow its disturbing the calculations of column 3. Further, I need to run this on a large dataset (~35 million entries), so can you also suggest some optimization. To be honest, I haven't tried on the whole dataset (as I saw it messing up the calculations for column 3 for test data) but if you think the solution can be optimized any further it would be great. Cheers,
# 32  
Old 02-23-2012
Ok, I've generated some random data (2000 records or so) and I'm not seeing anything odd. Can you post a sample of the input that is giving you problems?
# 33  
Old 02-23-2012
Hey!

Trying running it on this:

899787600 78.169.38.0
899787601 52.72.7.0
899787601 52.72.7.0
899787601 154.225.0.0
899787602 118.82.0.0
899787602 252.184.37.0
899787603 211.20.38.0
899787604 211.20.38.0
899787604 64.184.37.0
899787605 116.96.9.0
899787606 118.82.0.0
899787606 202.184.37.0
899787607 202.184.37.0

The correct values for column 3 (different IP in each sample interval) should be:
1
2
2
1
2
1
2
1

however, the values generate by this script are:
1
2
2
1
1
1
2
0

One more thing, I ran this script on the actual data (some 35 million entires), it ran for nearly 12 hours and still didn't finished processing. So, I suspect there is some issue for large files.

Cheers,
# 34  
Old 02-23-2012
Maybe I've misunderstood, but the output from the script is what I would have expected. A breakdown of your sample (bold lines are the ones counted as unique and thus presented in column 3):

Code:
899787600 78.169.38.0      1 address not in prev group (all as it's the first group)

899787601 52.72.7.0          2 addresses not in previous group
899787601 52.72.7.0         ( duplicate in this group))
899787601 154.225.0.0

899787602 118.82.0.0     2 addresses not in previous group
899787602 252.184.37.0

899787603 211.20.38.0        1 address not in previous group

899787604 211.20.38.0       1 address not in previous group
899787604 64.184.37.0

899787605 116.96.9.0     1 address not in previous group

899787606 118.82.0.0     2 address not in previous group
899787606 202.184.37.0

899787607 202.184.37.0      0 address not in previous group



This is per your post on 08-07-11 21:16. The original post asked for a count of unique addresses within that interval, but I've assumed it to be relative to the previous group. Given your desired output I'm guessing you want the original functionality. I'll tweek to make it so and post it.

Haven't thought about performance -- want to get it working before worrying about that.
# 35  
Old 02-23-2012
Hi,

Your' correct, you've assumed it to be relative to the previous group. What I need is that the 3rd column should give a count of unique/different IPs within that sampling interval (original requirement) and the 4th column should give a count of new IPs as compared to the previous ENTIRE history till that (current) sampling interval (earlier it was comparing it with the previous interval and not the entire history), which is working fine with your current script (but for small test data).

I hope I am clear enough.

Cheers,
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Counting and print from file

Dear community, I have an already filtered log on my machine, something like: WARN 2016.03.10 10:59:01.136 logging.LogAlarmListener raise ALARMWARNINGRAISED Alarm NODE-NetworkAccessGroup.Client.41283 SERVICEDOWN-41283.WC severity WARNING raised: Service 41283.WC protocoltype client is down... (13 Replies)
Discussion started by: Lord Spectre
13 Replies

2. Shell Programming and Scripting

Need help of counting no of column of a file

Hi All , I got stuck on the below scenario.If anyone can help me ,that will be really helpful. I have a target hdfs file layout.I need to know the no of column in that file. Target_RECRD_layout { ABC_ID EN NOTNULLABLE, ABC_COUNTRY CHARACTER ENCODING ASCII NOTNULLABLE, ... (5 Replies)
Discussion started by: STCET22
5 Replies

3. Shell Programming and Scripting

Counting lines in a file using awk

I want to count lines of a file using AWK (only) and not in the END part like this awk 'END{print FNR}' because I want to use it. Does anyone know of a way? Thanks a lot. (7 Replies)
Discussion started by: guitarist684
7 Replies

4. UNIX for Dummies Questions & Answers

Counting feilds entries with Perl

Hi All, I have a small problem of counting the number of times a particular entry that exists in a horizontal string of elements and a vertical feild (column of entries). For example AATGGTCCTGExpected outputA=2 C=2 G=3 T=3 I have an idea to do this but I dont know how to do that if these entries... (1 Reply)
Discussion started by: pawannoel
1 Replies

5. Shell Programming and Scripting

Counting characters within a file

Ok say I wanted to count every Y in a data file. Then set Y as my delimiter so that I can separate my file by taking all the contents that occur BEFORE the first Y and store them in a variable so that I may use this content later on in my program. Then I could do the same thing with the next Y's... (5 Replies)
Discussion started by: puttster
5 Replies

6. Shell Programming and Scripting

Counting duplicate entries in a file using awk

Hi, I have a very big (with around 1 million entries) txt file with IPv4 addresses in the standard format, i.e. a.b.c.d The file looks like 10.1.1.1 10.1.1.1 10.1.1.1 10.1.2.4 10.1.2.4 12.1.5.6 . . . . and so on.... There are duplicate/multiple entries for some IP... (3 Replies)
Discussion started by: sajal.bhatia
3 Replies

7. Shell Programming and Scripting

Counting multiple entries in a file using awk

Hi, I have a big file (~960MB) having epoch time values (~50 million entries) which looks like 897393601 897393601 897393601 897393601 897393602 897393602 897393602 897393602 897393602 897393603 897393603 897393603 897393603 and so on....each time stamp has more than one... (6 Replies)
Discussion started by: sajal.bhatia
6 Replies

8. Programming

Counting the words in a file

Please find the below program. It contains the purpose of the program itself. /* Program : Write a program to count the number of words in a given text file */ /* Date : 12-June-2010 */ # include <stdio.h> # include <stdlib.h> # include <string.h> int main( int argc, char *argv ) {... (6 Replies)
Discussion started by: ramkrix
6 Replies

9. Shell Programming and Scripting

Help me in counting records from file

Hi, Please help me in counting the below records(1st field) from samplefile: Expected output: Count Descr ------------------------------------------- 7 Mean manager 14 ... (7 Replies)
Discussion started by: prashant43
7 Replies

10. Shell Programming and Scripting

Counting words in a file

I'm trying to figure out a way to count the number of words in the follwing file: cal 2002 > file1 Is there anyway to do this without using wc but instead using the cut command? (1 Reply)
Discussion started by: r0mulus
1 Replies
Login or Register to Ask a Question