word frequency counter - awk solution?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting word frequency counter - awk solution?
# 1  
Old 03-04-2011
word frequency counter - awk solution?

Dear all,

i need your help on this. There is a text file, i need to count word frequency for each word with frequency >40 in each line of file and output it into another file with columns like this:

word1,word2,word3, ...wordn
0,0,1
1,2,0
3,2,0 etc -- each raw represents word counts for a line of the original text file

numbers are wordn frequencies in each line of the original file.

This AWK of course does the first part (collects a list of words to count)
Code:
{    
     for (i=1; i<=NF; i++)
          words[$i]++
}
     
END {
for (i in words)
         if (words[i] > 40)
         print i
 }

This does searches and counts

Code:
{
res=gsub(i, " ", all)

print res
}

How do i put them together??? In awk? Sorry, i am a complete newbie.
# 2  
Old 03-04-2011
Your description is very vague. Should it do this:

Code:
# input data
aa bb aa bb cc dd ee ee ee
# resulting count line
2,2,2,2,1,1,3

...because I think that's what your gsub would end up doing.
# 3  
Old 03-04-2011
Thank you for your reply!!!
It needs to be:
Code:
# input data
aa bb aa bb cc dd ee ee ee
aa aa bb cc ee ee dd ee cc
# resulting count line
aa,bb,cc,dd,ee
2,2,1,1,3
2,1,2,1,3

I made a shell script for that, but I would really prefer to have it all done inside awk.
Thank you again.
# 4  
Old 03-04-2011
how about this?

Code:
 awk 'NR==FNR{for(i=1;i<NF;i++) {a[$i]++}}
NR>FNR&&FNR==1{for(i in a) {if( a[i]>=1) b[j++]=i;printf i " "}print ""}
NR>FNR{for(m=0;m<j;m++) printf gsub(b[m],b[m])" ";print""}' file file

# 5  
Old 03-06-2011
Thank you yinyuemi. I was trying to make this work, no success so far. Output is an empty file. Thank you nevertheless.
# 6  
Old 03-06-2011
it worked on my computer:

Code:
>cat file
aa bb aa bb cc dd ee ee ee
aa aa bb cc ee ee dd ee cc
>awk 'NR==FNR{for(i=1;i<NF;i++) {a[$i]++}}
NR>FNR&&FNR==1{for(i in a) {if( a[i]>=1) b[j++]=i;printf i " "}print ""}
NR>FNR{for(m=0;m<j;m++) printf gsub(b[m],b[m])" ";print""}' file file
bb cc dd ee aa
2 1 1 3 2
1 2 1 3 2

Best,
Y
# 7  
Old 03-06-2011
Worked on my PC too, perhaps OP should use nawk instead of awk.

Couple of things to note, yinyuemi's code does search and replace so if words are substrings of other words eg "the" and "thesis" it's starts going all wrong.

This update fixes this issue for me (Change >=1 to >=40 when your ready to limit to only 40 or greater total occurances):

Code:
$ cat file
thesis the thesis the cc dd ee ee ee
thesis thesis the cc ee ee dd ee cc
$ awk 'NR==FNR{for(i=1;i<NF;i++) {a[$i]++};next}
FNR==1{for(i in a)if(a[i]>=1)b[i]=0;for(i in b)printf (k++?",":"")i;print ""}
{for(i in b) k=b[i]=0;for(w=1;w<=NF;w++)if($w in b)b[$w]++;for(i in b) printf (k++?",":"")b[i];print ""}' file file
thesis,cc,the,dd,ee
2,1,2,1,3
2,2,1,1,3


Last edited by Chubler_XL; 03-06-2011 at 09:16 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

[Solved] awk solution to add sequential numbers based on a word

Hi experts, I've been struggling to format a large genetic dataset. It's complicated to explain so I'll simply post example input/output $cat input.txt ID GENE pos start end blah1 coolgene 1 3 5 blah2 coolgene 1 4 6 blah3 coolgene 1 4 ... (4 Replies)
Discussion started by: torchij
4 Replies

2. Shell Programming and Scripting

Shell scripting: frequency of specific word in a string and statistics

Hello friends, I need a BIG help from UNIX collective intelligence: I have a CSV file like this: VALUE,TIMESTAMP,TEXT 1,Sun May 05 16:13:05 +0000 2013,"RT @gracecheree: Praying God sends me a really great man one day. Gotta trust in his timing. 0,Sun May 05 16:13:05 +0000 2013,@sendi__... (19 Replies)
Discussion started by: kraterions
19 Replies

3. UNIX for Dummies Questions & Answers

Calculating cumulative frequency using awk

Hi, I wanted to calculate cumulative frequency distribution of my data that involves several arithmetic calls. I did things in excel but its taking me forever. this is what I want to do: var1.txt contains n observations which I have to compute for frequency which is given by 1/n and subsequently... (7 Replies)
Discussion started by: ida1215
7 Replies

4. Shell Programming and Scripting

Help with calculating frequency of specific word in a string

Input file: #read_1 AWEAWQQRZZZQWQQWZ #read_2 ZZAQWRQTWQQQWADSADZZZ #read_3 POGZZZZZZADWRR . . Desired output file: #read_1 3 #read_1 1 #read_2 2 #read_2 3 #read_3 6 . . (3 Replies)
Discussion started by: perl_beginner
3 Replies

5. Shell Programming and Scripting

AWK counter problem

Hi I have a file like below ############################################ # ParentFolder Flag SubFolders Colateral 1 Source1/Checksum CVA 1 Source1/Checksum Flexing 1 VaR/Checksum Flexing 1 SVaR/Checksum FX 1 ... (5 Replies)
Discussion started by: manas_ranjan
5 Replies

6. Shell Programming and Scripting

Word Frequency Sort

hello, Here is a program for creating a word-frequency # wf.gk --- program to generate word frequencies from a file { # remove punctuation: This will remove all punctuations from the file gsub(/_]/, "", $0) #Start frequency analysis for (i = 1; i <= NF; i++) freq++ } END #Print output... (11 Replies)
Discussion started by: gimley
11 Replies

7. Shell Programming and Scripting

Word frequency with additional information

Hello everyone, I am using a chunk of code to display the frequency of a file name in a list of directories. The code looks like this: find . -name "*.log" | cut -d/ -f4 | cut -d. -f1 | awk '{print $1}' | sort | uniq -c | sort -nr The file paths would look something like this:... (1 Reply)
Discussion started by: ToeLint
1 Replies

8. Shell Programming and Scripting

Determining Word Frequency of Specific Terms

Hello, I require a perl script that will read a .txt file that contains words like 224.199.207.IN-ADDR.ARPA. IN NS NS1.internet.com. 4.200.162.207.in-addr.arpa. IN PTR beeriftw.internet.com. arroyoeinternet.com. IN A 200.199.227.49 I want to focus on words: IN... (23 Replies)
Discussion started by: richsark
23 Replies
Login or Register to Ask a Question