Counting multiple entries in a file using awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Counting multiple entries in a file using awk
# 1  
Old 09-23-2010
Counting multiple entries in a file using awk

Hi,

I have a big file (~960MB) having epoch time values (~50 million entries) which looks like

897393601
897393601
897393601
897393601
897393602
897393602
897393602
897393602
897393602
897393603
897393603
897393603
897393603

and so on....each time stamp has more than one occurrence.

I want an AWK / SED program (as the file size is considerably big) to read this file,count the number of entries within a fixed interval (for ex. 2 hrs or 7200 secs given by user) and return an output file which would look something like

first 2 hrs X entries
next 2 hrs Y entries
next 2 hrs Z entries
.
.
.
and so on
where "first 2 hrs" means start time(897393601)+time interval(7200) and so on...

I have written a bash script doing the desired thing but it is way too slow for such a big file. So I am looking for a solution in AWK or SED.

Any quick help will be highly appreciated.

Thanks !

Last edited by sajal.bhatia; 09-23-2010 at 11:26 PM.. Reason: Typing error
# 2  
Old 09-24-2010
This should get you started. You'll have to add conversion if you want to allow the user to supply 2 hours rather than 7200 seconds. Not sure how fast this will be, don't have the patience tonight to create a large data set, but it will likely be faster than bash.

Code:
#!/usr/bin/env ksh

awk -v window=${1:-7200} '
        {
                if( $1 > end_window )    # reached the end of the time window
                {
                        if( idx++ )            # if not the first record, print count
                                printf( "range %d: %.0f values\n", idx, count );
                        count = 0;           # reset count and set next end of window

                        end_window = $1 + window;
                }

                count++;    # count observations in this window
        }

        END {
                printf( "range %d: %.0f values\n", idx, count );   # print count in progress as we reach eof
        }
'



---------- Post updated at 23:52 ---------- Previous update was at 23:50 ----------

Forgot to mention that this reads from stdin and will count the duplicates. If you need to drop the duplicates you can take the easy way out and execute
Code:
sort -u

piping the output into the awk.
This User Gave Thanks to agama For This Post:
# 3  
Old 09-24-2010
Guess sometime there is no entry at all. change two lines from the sample input file.

Code:
awk '
NR==1{start=$1} 
{t=int(($1-start)/7200);a[t]++;s=(t>s)?t:s}
END{
        print "first 2 hours", a[0] , "entries"
        for (i=1;i<=s;i++) print "next 2 hours", (a[i])?a[i]:"0" , "entries"
    }' infile

first 2 hours 11 entries
next 2 hours 0 entries
next 2 hours 0 entries
next 2 hours 2 entries

# 4  
Old 09-24-2010
Hi, this script is giving syntax errors while executing it. Can you help me fix them, as I am new to AWK .

when I run this test.awk with this command -- awk -f test.awk input_file.txt this is the error what I am getting

awk: test.awk:3: awk -v window=${1:-7200} '
awk: test.awk:3: ^ syntax error
awk: test.awk:3: awk -v window=${1:-7200} '
awk: test.awk:3: ^ invalid char ''' in expression


Please help !
# 5  
Old 09-24-2010
Quote:
Originally Posted by sajal.bhatia
Hi, this script is giving syntax errors while executing it. Can you help me fix them, as I am new to AWK .

when I run this test.awk with this command -- awk -f test.awk input_file.txt this is the error what I am getting

awk: test.awk:3: awk -v window=${1:-7200} '
awk: test.awk:3: ^ syntax error
awk: test.awk:3: awk -v window=${1:-7200} '
awk: test.awk:3: ^ invalid char ''' in expression


Please help !
You can run agama's code directly, no need add it in awk command again.

Code:
awk -v window=${1:-7200} '
        {
                if( $1 > end_window )    # reached the end of the time window
                {
                        if( idx++ )            # if not the first record, print count
                                printf( "range %d: %.0f values\n", idx, count );
                        count = 0;           # reset count and set next end of window

                        end_window = $1 + window;
                }

                count++;    # count observations in this window
        }

        END {
                printf( "range %d: %.0f values\n", idx, count );   # print count in progress as we reach eof
        }
' input_file.txt

These 2 Users Gave Thanks to rdcwayx For This Post:
# 6  
Old 09-24-2010
Thanks a lot :-)
# 7  
Old 09-24-2010
Another approach:
Code:
awk '!e{e=$1+7200} 
$1-e>0{print "Range "++i , c " entries"; e+=7200; c=0}
{c++}
END{print "Range " ++i , c " entries"}
' file

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Counting lines in a file using awk

I want to count lines of a file using AWK (only) and not in the END part like this awk 'END{print FNR}' because I want to use it. Does anyone know of a way? Thanks a lot. (7 Replies)
Discussion started by: guitarist684
7 Replies

2. Shell Programming and Scripting

Shell script with awk command for counting in a file

Hi, I hope you can help me with the awk command in shell scripting. I want to do the following, but it doesn't work. for i in $REF1 $REF2 $REF3; do awk '{if($n>=0 && $n<=50000){count+=1}} END{print count}' ${DIR}${i} >${DIR}${i}_count.txt done REF1 to REF3 are only variables for .txt... (1 Reply)
Discussion started by: y.g.
1 Replies

3. Shell Programming and Scripting

Counting Multiple Fields with awk/nawk

I am trying to figure out a way in nawk to 1) get a count of the number of times a value appears in field 1 and 2) count each time the same value appears in field 2 for each value of field 1. So for example, if I have a text file with the following: grapes, purple apples, green squash, yellow... (2 Replies)
Discussion started by: he204035
2 Replies

4. Shell Programming and Scripting

Awk match multiple columns in multiple lines in single file

Hi, Input 7488 7389 chr1.fa chr1.fa 3546 9887 chr5.fa chr9.fa 7387 7898 chrX.fa chr3.fa 7488 7389 chr21.fa chr3.fa 7488 7389 chr1.fa chr1.fa 3546 9887 chr9.fa chr5.fa 7898 7387 chrX.fa chr3.fa Desired Output 7488 7389 chr1.fa chr1.fa 2 3546 9887 chr5.fa chr9.fa 2... (2 Replies)
Discussion started by: jacobs.smith
2 Replies

5. Shell Programming and Scripting

Counting entries in a file

Hi, I have a very large two column log file in the following format: # Epoch Time IP Address 899726401 112.254.1.0 899726401 112.254.1.0 899726402 154.162.38.0 899726402 160.114.12.0 899726402 165.161.7.0 899726403 ... (39 Replies)
Discussion started by: sajal.bhatia
39 Replies

6. Shell Programming and Scripting

counting particular record format in a file using AWK

I am trying to count records of particular format from a file and assign it to a variable. I tried below command br_count=wc -l "inputfile.dat"| awk -F"|" '{if (NF != "14") print }' but I amnot able to get it done. Please share me some idea how to get it done. Thanks in advance (7 Replies)
Discussion started by: siteregsam
7 Replies

7. Shell Programming and Scripting

Counting occurrences of all words in multiple files

Hey Unix gurus, I would like to count the number occurrences of all the words (regardless of case) across multiple files, preferably outputting them in descending order of occurrence. This is well beyond my paltry shell scripting ability. Researching, I can find many scripts/commands that... (4 Replies)
Discussion started by: twjolson
4 Replies

8. Shell Programming and Scripting

Counting duplicate entries in a file using awk

Hi, I have a very big (with around 1 million entries) txt file with IPv4 addresses in the standard format, i.e. a.b.c.d The file looks like 10.1.1.1 10.1.1.1 10.1.1.1 10.1.2.4 10.1.2.4 12.1.5.6 . . . . and so on.... There are duplicate/multiple entries for some IP... (3 Replies)
Discussion started by: sajal.bhatia
3 Replies

9. Shell Programming and Scripting

multiple files: counting

In a directory, I have 5000 multiple files that contains around 4000 rows with 10 columns in each file containing a unique string 'AT' located at 4th column. OM 3328 O BT 268 5.800 7.500 4.700 0.000 1.400 OM 3329 O BT 723 8.500 8.900... (7 Replies)
Discussion started by: asanjuan
7 Replies

10. Shell Programming and Scripting

Counting lines in multiple files

Hi, I have couple of .txt files (say 50 files) in a folder. For each file: I need to get the number of lines in each file and then that count -1 (I wanted to exclude the header. Then sum the counts of all files and output the total sum. Is there an efficient way to do this using shell... (7 Replies)
Discussion started by: Lucky Ali
7 Replies
Login or Register to Ask a Question