word frequency counter - awk solution?

03-06-2011

Registered User

436, 107

Join Date: Feb 2011

Last Activity: 24 March 2015, 6:12 AM EDT

Posts: 436

Thanks Given: 9

Thanked 107 Times in 106 Posts

Thanks Chubler_XL, based on your note, I have my code a little change to be more robust,

Code:

awk 'NR==FNR{for(i=1;i<=NF;i++) {a[$i]++}}
NR>FNR&&FNR==1{for(i in a) {if( a[i]>=1) b[j++]=i;printf i " "}print ""}
NR>FNR{split($0,c,FS);for(m=0;m<j;m++) {for(n=1;n<=NF;n++){if(b[m]== c[n]) {d++}};printf d" ";d=0};print""}' file1 file1
bb cc dd ee aa
2 1 1 3 2
1 2 1 3 2

Hi Chubler_XL, Thanks for improving my code

Last edited by yinyuemi; 03-07-2011 at 12:21 AM..

This User Gave Thanks to yinyuemi For This Post:

yinyuemi

View Public Profile for yinyuemi

Find all posts by yinyuemi

03-06-2011

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Sorry to be pedantic yinyuemi, but it now misses the last word on each line change n<NF to n<=NF

---------- Post updated at 11:56 AM ---------- Previous update was at 11:33 AM ----------

And, just for fun, here is a version that does it in 1 pass (change <2 to <40 for limit of 40 total count):

Code:

awk '{t++;for(w=1;w<=NF;w++){l[t,$w]++;g[$w]++}}
END {for(w in g) if(g[w]<2) delete g[w];
for(w in g) printf w " "; print "";
for(i=1;i<=t;i++) { for(w in g) printf l[i,w]" "; print ""}}' file

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

03-06-2011

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by Chubler_XL

Code:

printf l[i,w]" "

I did not run the code, only skimmed it, but it seems to me that if a word that meets the frequency threshold does not occur in a line, the printf statement will not print a 0. I'm assuming a 0 would be desirable as opposed to an empty string. Perhaps a "+0" or a format string with a numeric conversion specifier would be in order?

For example:

Code:

printf l[i,w]+0 " "
printf "%d ", l[i,w]

Regards,
Alister

These 2 Users Gave Thanks to alister For This Post:

alister

View Public Profile for alister

Find all posts by alister

03-07-2011

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Well spotted - my test data didn't have a line with zero count, fixed below.
Also matches words regardless of their case and removes common punctuantion (eg comma, full stop, semi-colon, colon, brackets, etc.):

Code:

awk '{$0=tolower($0);gsub("[:;.,()!]"," ");t++;
  for(w=1;w<=NF;w++){l[t,$w]++;g[$w]++}}
END {for(w in g) if(g[w]<2) delete g[w]; else printf w " "; print "";
  for(i=1;i<=t;i++){ for(w in g) printf +l[i,w]" "; print ""}}' infile

Last edited by Chubler_XL; 03-07-2011 at 12:33 AM..

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

03-09-2011

Registered User

11, 0

Join Date: Mar 2011

Last Activity: 17 March 2011, 7:56 PM EDT

Posts: 11

Thanks Given: 6

Thanked 0 Times in 0 Posts

Works

Chubler_XL's works perfectly... in gawk, nawk, awk. Trying to see how...

irrevocabile

View Public Profile for irrevocabile

Find all posts by irrevocabile

03-09-2011

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Quote:

Originally Posted by irrevocabile

Chubler_XL's works perfectly... in gawk, nawk, awk. Trying to see how...

It probably does need some explination.

Consider the following input

Code:

The quick brown fox jumped over the lazy
brown fox.

It produces to 2 arrays from this g is a global word count:

Code:

w[the]=2
w[fox]=2
w[quick]=1
w[brown]=2
w[jumped]=1
...

l is a word count for each line

Code:

l[1,the]=2
l[1,quick]=1
l[1,brown]=1
...
l[2,brown]=1
l[2,fox]=1

The whole file is processed like this, note t is also counting the number of lines in the file. At the end we go thru the g array and delete any entries with less than our 40 limit, this changes g to the popular word list. Now for each line (i = 1 thru t) we print the count in l[i,w] were w is each word remaining in g.

If no entry exists for the line (ie this popular word is not on line i) l[i,w] will be null, but the + in front of +l[i,w] causes awk to treat it as numeric and print a zero for us instead of a blank.

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

03-11-2011

Registered User

11, 0

Join Date: Mar 2011

Last Activity: 17 March 2011, 7:56 PM EDT

Posts: 11

Thanks Given: 6

Thanked 0 Times in 0 Posts

This is as clear as a child's tear drop... thank you.

irrevocabile

View Public Profile for irrevocabile

Find all posts by irrevocabile

Shell Programming and Scripting

word frequency counter - awk solution?

8 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

[Solved] awk solution to add sequential numbers based on a word

Discussion started by: torchij

2. Shell Programming and Scripting

Shell scripting: frequency of specific word in a string and statistics

Discussion started by: kraterions

3. UNIX for Dummies Questions & Answers

Calculating cumulative frequency using awk

Discussion started by: ida1215

4. Shell Programming and Scripting

Help with calculating frequency of specific word in a string

Discussion started by: perl_beginner

5. Shell Programming and Scripting

AWK counter problem

Discussion started by: manas_ranjan

6. Shell Programming and Scripting

Word Frequency Sort

Discussion started by: gimley

7. Shell Programming and Scripting

Word frequency with additional information

Discussion started by: ToeLint

8. Shell Programming and Scripting

Determining Word Frequency of Specific Terms

Discussion started by: richsark