word frequency counter - awk solution?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting word frequency counter - awk solution?
# 8  
Old 03-06-2011
Thanks Chubler_XL, based on your note, I have my code a little change to be more robust,
Code:
awk 'NR==FNR{for(i=1;i<=NF;i++) {a[$i]++}}
NR>FNR&&FNR==1{for(i in a) {if( a[i]>=1) b[j++]=i;printf i " "}print ""}
NR>FNR{split($0,c,FS);for(m=0;m<j;m++) {for(n=1;n<=NF;n++){if(b[m]== c[n]) {d++}};printf d" ";d=0};print""}' file1 file1
bb cc dd ee aa
2 1 1 3 2
1 2 1 3 2

Hi Chubler_XL, Thanks for improving my codeSmilie

Last edited by yinyuemi; 03-07-2011 at 12:21 AM..
This User Gave Thanks to yinyuemi For This Post:
# 9  
Old 03-06-2011
Sorry to be pedantic yinyuemi, but it now misses the last word on each line change n<NF to n<=NF

---------- Post updated at 11:56 AM ---------- Previous update was at 11:33 AM ----------

And, just for fun, here is a version that does it in 1 pass (change <2 to <40 for limit of 40 total count):

Code:
awk '{t++;for(w=1;w<=NF;w++){l[t,$w]++;g[$w]++}}
END {for(w in g) if(g[w]<2) delete g[w];
for(w in g) printf w " "; print "";
for(i=1;i<=t;i++) { for(w in g) printf l[i,w]" "; print ""}}' file

This User Gave Thanks to Chubler_XL For This Post:
# 10  
Old 03-06-2011
Quote:
Originally Posted by Chubler_XL
Code:
printf l[i,w]" "

I did not run the code, only skimmed it, but it seems to me that if a word that meets the frequency threshold does not occur in a line, the printf statement will not print a 0. I'm assuming a 0 would be desirable as opposed to an empty string. Perhaps a "+0" or a format string with a numeric conversion specifier would be in order?

For example:
Code:
printf l[i,w]+0 " "
printf "%d ", l[i,w]

Regards,
Alister
These 2 Users Gave Thanks to alister For This Post:
# 11  
Old 03-07-2011
Well spotted - my test data didn't have a line with zero count, fixed below.
Also matches words regardless of their case and removes common punctuantion (eg comma, full stop, semi-colon, colon, brackets, etc.):

Code:
awk '{$0=tolower($0);gsub("[:;.,()!]"," ");t++;
  for(w=1;w<=NF;w++){l[t,$w]++;g[$w]++}}
END {for(w in g) if(g[w]<2) delete g[w]; else printf w " "; print "";
  for(i=1;i<=t;i++){ for(w in g) printf +l[i,w]" "; print ""}}' infile


Last edited by Chubler_XL; 03-07-2011 at 12:33 AM..
This User Gave Thanks to Chubler_XL For This Post:
# 12  
Old 03-09-2011
Works

Chubler_XL's works perfectly... in gawk, nawk, awk. Trying to see how...
# 13  
Old 03-09-2011
Quote:
Originally Posted by irrevocabile
Chubler_XL's works perfectly... in gawk, nawk, awk. Trying to see how...
It probably does need some explination.

Consider the following input
Code:
The quick brown fox jumped over the lazy
brown fox.

It produces to 2 arrays from this g is a global word count:
Code:
w[the]=2
w[fox]=2
w[quick]=1
w[brown]=2
w[jumped]=1
...

l is a word count for each line
Code:
l[1,the]=2
l[1,quick]=1
l[1,brown]=1
...
l[2,brown]=1
l[2,fox]=1

The whole file is processed like this, note t is also counting the number of lines in the file. At the end we go thru the g array and delete any entries with less than our 40 limit, this changes g to the popular word list. Now for each line (i = 1 thru t) we print the count in l[i,w] were w is each word remaining in g.

If no entry exists for the line (ie this popular word is not on line i) l[i,w] will be null, but the + in front of +l[i,w] causes awk to treat it as numeric and print a zero for us instead of a blank.
This User Gave Thanks to Chubler_XL For This Post:
# 14  
Old 03-11-2011
This is as clear as a child's tear drop... thank you.
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

[Solved] awk solution to add sequential numbers based on a word

Hi experts, I've been struggling to format a large genetic dataset. It's complicated to explain so I'll simply post example input/output $cat input.txt ID GENE pos start end blah1 coolgene 1 3 5 blah2 coolgene 1 4 6 blah3 coolgene 1 4 ... (4 Replies)
Discussion started by: torchij
4 Replies

2. Shell Programming and Scripting

Shell scripting: frequency of specific word in a string and statistics

Hello friends, I need a BIG help from UNIX collective intelligence: I have a CSV file like this: VALUE,TIMESTAMP,TEXT 1,Sun May 05 16:13:05 +0000 2013,"RT @gracecheree: Praying God sends me a really great man one day. Gotta trust in his timing. 0,Sun May 05 16:13:05 +0000 2013,@sendi__... (19 Replies)
Discussion started by: kraterions
19 Replies

3. UNIX for Dummies Questions & Answers

Calculating cumulative frequency using awk

Hi, I wanted to calculate cumulative frequency distribution of my data that involves several arithmetic calls. I did things in excel but its taking me forever. this is what I want to do: var1.txt contains n observations which I have to compute for frequency which is given by 1/n and subsequently... (7 Replies)
Discussion started by: ida1215
7 Replies

4. Shell Programming and Scripting

Help with calculating frequency of specific word in a string

Input file: #read_1 AWEAWQQRZZZQWQQWZ #read_2 ZZAQWRQTWQQQWADSADZZZ #read_3 POGZZZZZZADWRR . . Desired output file: #read_1 3 #read_1 1 #read_2 2 #read_2 3 #read_3 6 . . (3 Replies)
Discussion started by: perl_beginner
3 Replies

5. Shell Programming and Scripting

AWK counter problem

Hi I have a file like below ############################################ # ParentFolder Flag SubFolders Colateral 1 Source1/Checksum CVA 1 Source1/Checksum Flexing 1 VaR/Checksum Flexing 1 SVaR/Checksum FX 1 ... (5 Replies)
Discussion started by: manas_ranjan
5 Replies

6. Shell Programming and Scripting

Word Frequency Sort

hello, Here is a program for creating a word-frequency # wf.gk --- program to generate word frequencies from a file { # remove punctuation: This will remove all punctuations from the file gsub(/_]/, "", $0) #Start frequency analysis for (i = 1; i <= NF; i++) freq++ } END #Print output... (11 Replies)
Discussion started by: gimley
11 Replies

7. Shell Programming and Scripting

Word frequency with additional information

Hello everyone, I am using a chunk of code to display the frequency of a file name in a list of directories. The code looks like this: find . -name "*.log" | cut -d/ -f4 | cut -d. -f1 | awk '{print $1}' | sort | uniq -c | sort -nr The file paths would look something like this:... (1 Reply)
Discussion started by: ToeLint
1 Replies

8. Shell Programming and Scripting

Determining Word Frequency of Specific Terms

Hello, I require a perl script that will read a .txt file that contains words like 224.199.207.IN-ADDR.ARPA. IN NS NS1.internet.com. 4.200.162.207.in-addr.arpa. IN PTR beeriftw.internet.com. arroyoeinternet.com. IN A 200.199.227.49 I want to focus on words: IN... (23 Replies)
Discussion started by: richsark
23 Replies
Login or Register to Ask a Question