Frequency Count of chunked data

12-11-2014

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Frequency Count of chunked data

Dear all,
I have an AWK script which provides frequency of words. However I am interested in getting the frequency of chunked data. This means that I have already produced valid chunks of running text, with each chunk on a line. What I need is a script to count the frequencies of each string. A pseudo sample is provided below

Code:

this interesting event
has been going on
since years
in this country
the two actors
met
one another
in this country
Mary
met
her husband
in this country

The output would be

Code:

Mary	1
has been going on	1
her husband	1
in this country	3
met	2
one another	1
since years	1
the two actors	1
this interesting event	1

I have been able to sort the data so that all similar strings are clubbed together

Code:

Mary	
has been going on
her husband
in this country
in this country
in this country
met
met
one another
since years
the two actors
this interesting event

My question is how do I manipulate a script so that a whole line is treated as an entity and lines that match (I have come till there) can be treated as one unit and a frequency counter set up.
My awk script handles space as delimiter but I do not know how to make it recognise start of line and end of line CRLF as delimiters.
I am sure this tool will be useful to people who work with chunked big data.
Many thanks

gimley

View Public Profile for gimley

Find all posts by gimley

12-11-2014

Registered User

2,205, 181

Join Date: Mar 2006

Last Activity: 8 May 2020, 5:01 AM EDT

Location: Bangalore,India

Posts: 2,205

Thanks Given: 31

Thanked 181 Times in 171 Posts

Code:

$ cat file
this interesting event
has been going on
since years
in this country
the two actors
met
one another
in this country
Mary
met
her husband
in this country
$ sort file | uniq -c
   1 Mary
   1 has been going on
   1 her husband
   3 in this country
   2 met
   1 one another
   1 since years
   1 the two actors
   1 this interesting event

This User Gave Thanks to anbu23 For This Post:

anbu23

View Public Profile for anbu23

Find all posts by anbu23

12-11-2014

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello gimley,

Following may help you in same, but still I am not sure completly about your requirement if this following doesn't fulfil your requirement,
please do let us know the expected output of yours, what you have tried with OS name you are using.

Code:

awk '{X[$0]++;Y[$0]=$0;} END{for(i in X){print Y[i] OFS X[i]}}' Input_file | sort

Output will be as follows.

Code:

Mary 1
has been going on 1
her husband 1
in this country 3
met 2
one another 1
since years 1
the two actors 1
this interesting event 1

Thanks,
R. Singh

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

12-11-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by gimley

Code:

this interesting event
has been going on
since years
in this country
the two actors
met
one another
in this country
Mary
met
her husband
in this country

The output would be

Code:

Mary	1
has been going on	1
her husband	1
in this country	3
met	2
one another	1
since years	1
the two actors	1
this interesting event	1

I have been able to sort the data so that all similar strings are clubbed together

Code:

Mary	
has been going on
her husband
in this country
in this country
in this country
met
met
one another
since years
the two actors
this interesting event

anbu23's suggestion using uniq -c is an excellent choice for this task, but if you want to know how to do it with awk, read on...

When you were dealing with words (instead of lines), your awk script probably had a loop going from 1 to NF on each input line to treat each field as a "word". To do that you probably had a loop to count occurrences of words something like:

Code:

{for(i = 1; i <= NF; i++) freq[$i]++}

To count occurrences of lines, it is simpler:

Code:

{freq[$0]++}

Note that awk assumes input files have LF line terminators; not CR/LF. But if every line is CR/LF terminated, it won't matter when you're working on whole lines. It would, however, screw up individual word counts because the last word on each line would be stored in a different bucket (one with "wordCR") from the other words on the line that would be counted in the "word" bucket.

Note that your awk script can't have CR/LF line terminators; the CR will be treated as part of whatever awk command is on that line, frequently generating syntax errors.

RavinderSingh13 provided an example of how to use awk to do this. It can be simplified a little bit to just:

Code:

awk '{X[$0]++} END{for(i in X){print i, X[i]}}' Input_file | sort

Last edited by Don Cragun; 12-11-2014 at 06:23 AM.. Reason: Fix typos.

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-11-2014

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks to all for their help. Especially to Don for his kind and helpful explanation. One is never too old to learn (just turned 65) and this forum is a wonderful place to learn with helpful people.

This User Gave Thanks to gimley For This Post:

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Frequency Count of chunked data

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Count frequency of unique values in specific column

Discussion started by: owwow14

2. Shell Programming and Scripting

Code for count the frequency of interacting pairs

Discussion started by: Tzole

3. Shell Programming and Scripting

frequency count using shell

Discussion started by: xshang

4. Shell Programming and Scripting

Count column data

Discussion started by: asavaliya

5. Shell Programming and Scripting

count frequency of words in a file

Discussion started by: mohit_iitk

6. Shell Programming and Scripting

count horizontal data

Discussion started by: buncit8

7. Shell Programming and Scripting

Extracting high frequency data-lines

Discussion started by: sajal.bhatia

8. Shell Programming and Scripting

Help with checking reference data frequency count

Discussion started by: perl_beginner

9. Shell Programming and Scripting

Count field frequency in a '|' delimited file

Discussion started by: ChicagoBlues

10. Shell Programming and Scripting

Splitting Chunked-FullNames Nightmare

Discussion started by: RacerX