Visit The New, Modern Unix Linux Community


Word Occurrences script using awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Word Occurrences script using awk
# 1  
Word Occurrences script using awk

I'm putting together a script that will the count the occurrences of words in text documents. It works fine so far, but I'd like to make a couple tweaks/additions:

1) I'm having a hard time displaying the array index number, tried freq[$i] which just spit 0's back at me
2) Is there any way to eliminate the whitespace (spaces) from the word count?

I'm relatively new to Unix, so any help would be greatly appreciated. Thank you!
Code:
{
        $0 = tolower($0)
        for ( i = 1; i <= NF; i++ )
        freq[$i]++
}
BEGIN { printf "%-20s %-6s\n", "Word", "Count"}
END {
sort = "sort -k 2nr"
for (word in freq)
        printf "%-20s %-6s\n", word, freq[word] | sort
close(sort)
}

# 2  
maybe try this?
Code:
freq[i]++

awk automatically use delimiter of spaces by default.
This User Gave Thanks to ghostdog74 For This Post:
# 3  
Thank you, ghostdog. I'll try your suggestion about freq[i]++ instead of freq[$i]++.

The reason I mentioned the spaces - when viewing the output it lists blank space as having a count of 243. I can't figure out exactly what it's picking up.
# 4  
Perhaps you have some non-printing characters in the file.
Maybe it's from MSDOS and has LF characters, you could try dos2unix filename first

or try
Code:
{
    $0 = tolower($0)
    gsub(/\r/, x, $0)
    for ( i = 1; i <= NF; i++ )
    freq[$i]++
}
BEGIN { printf "%-20s %-6s\n", "Word", "Count"}
END {
    sort = "sort -k 2nr"
    for (word in freq)
        printf "%-20s %-6s\n", word, freq[word] | sort
    close(sort)
}


Last edited by Chubler_XL; 10-31-2014 at 01:30 AM..
This User Gave Thanks to Chubler_XL For This Post:
# 5  
Thank you, Chubler! Any idea how I can print off the index value as well? Should I be using asorti instead of sort? I'd like my output to appear like the following example:

Index Word Count
1 the 247
2 a 215
3 to 201
# 6  
How about :

Code:
{
    $0 = tolower($0)
    gsub(/\r/, x, $0)
    for ( i = 1; i <= NF; i++ )
    freq[$i]++
}
BEGIN { printf "Index\t%-20s %-6s\n", "Word", "Count"}
END {
    sort = "sort -k 2nr | cat -n"
    for (word in freq)
        printf "%-20s %-6s\n", word, freq[word] | sort
    close(sort)
}

This User Gave Thanks to Chubler_XL For This Post:
# 7  
Huge improvement, thank you Chubler! The only issue's remaining are the alignment.
-The index heading is left aligned, but the index numbers are right aligned (I'd like to get both left aligned)
-The word heading and results are left aligned (need right aligned)
-The work count and results are left aligned (need right aligned).

Also, is there any way to do the sort using the asorti function? It was recommended I use that.

Again, thank you so much for your help!

---------- Post updated at 09:58 PM ---------- Previous update was at 02:43 PM ----------

I've completely redone the script because I wasn't using the actual index values (which this needs to be sorted by). I've come up with the following, which seems close to working, but isn't quite there. I've spent the past 4 hours on this, and am completely at my wits end. Any help would be appreciated. Thanks.

Code:
{
j = 1
for (i in freq)
ind[j] = i
j++
}
{
$0 = tolower($0)
for (i = 1; i <= NF; i++ )
freq [$i]++
}
BEGIN { printf "%-5s %20s %6s\n", "Index", "Word", "Count"}
END {
        asorti(freq)
        for (word in freq)
        printf "%-5s %20s %6s\n", ind[j], word, freq[word]
}


Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #811
Difficulty: Easy
RGBA stands for red green blue alpha.
True or False?

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk or sed script to count number of occurrences and creating an average

Hi Friends , I am having one problem as stated file . Having an input CSV file as shown in the code U_TOP_LOGIC/U_HPB2/U_HBRIDGE2/i_core/i_paddr_reg_2_/Q,1,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,0,0... (4 Replies)
Discussion started by: kshitij
4 Replies

2. UNIX for Advanced & Expert Users

Find 2 occurrences of a word and print file names

I was thinking something like this but it always gets rid of the file location. grep -roh base. | wc -l find . -type f -exec grep -o base {} \; | wc -l Would this be a job for awk? Would I need to store the file locations in an array? (3 Replies)
Discussion started by: cokedude
3 Replies

3. Shell Programming and Scripting

awk Group By and count string occurrences

Hi Gurus, I'm scratching my head over and over and couldn't find the the right way to compose this AWK properly - PLEASE HELP :confused: Input: c,d,e,CLICK a,b,c,CLICK a,b,c,CONV c,d,e,CLICK a,b,c,CLICK a,b,c,CLICK a,b,c,CONV b,c,d,CLICK c,d,e,CLICK c,d,e,CLICK b,c,d,CONV... (6 Replies)
Discussion started by: Royi
6 Replies

4. UNIX for Dummies Questions & Answers

Awk: Counting occurrences between two files

Hi, I have two text files (1.txt and 2.txt). 2.txt contains two columns which are extracted from 1.txt using a simple if(condition) print. I want to: - count how many times the values contained in 2.txt appear in 1.txt -if they appear just one time, I have to delete the entire row in... (5 Replies)
Discussion started by: Pintug
5 Replies

5. UNIX for Dummies Questions & Answers

BASH - Counting word occurrences in a Web Page

Hi all, I have to do a script bash (for university) that counts all word occurrences in a specific web page. anyone can help me?. Thanks :) (1 Reply)
Discussion started by: piacentero
1 Replies

6. Shell Programming and Scripting

Script to count word occurrences, but exclude some?

I am trying to count the occurrences of ALL words in a file. However, I want to exclude certain words: short words (i.e. <3 chars), and words contained in an blacklist file. There is also a desire to count words that are capitalized (e.g. proper names). I am not 100% sure where the line on... (5 Replies)
Discussion started by: Cronk
5 Replies

7. Shell Programming and Scripting

Count occurrences in awk

Hello, I have an output from GDB with many entries that looks like this 0x00007ffff7dece94 39 in dl-fini.c 0x00007ffff7dece97 39 in dl-fini.c 0x00007ffff7ab356c 50 in exit.c 0x00007ffff7aed9db in _IO_cleanup () at genops.c:1022 115 in dl-fini.c 0x00007ffff7decf7b in _dl_sort_fini (l=0x0,... (6 Replies)
Discussion started by: ikke008
6 Replies

8. Homework & Coursework Questions

Du without directory and Grep for occurrences of a word

Assistance on work Use and complete the template provided. The entire template must be completed. If you don't, your post may be deleted! 1. The problem statement, all variables and given/known data: Files stored in ... (1 Reply)
Discussion started by: alindner
1 Replies

9. Shell Programming and Scripting

awk and gsub - how to replace only the first X occurrences

I have a text (text.txt) and I would like to replace only the first 2 occurrences of a word (but I might need to replace more): For example, if text is this: CAR sweet head hat red yellow CAR book brown tiger CAR cow CAR CAR milk I would like to replace the word "CAR" with word... (12 Replies)
Discussion started by: bingel
12 Replies

10. Shell Programming and Scripting

Count the number of occurrences of the word

I am a newbie in UNIX shell script and seeking help on this UNIX function. Please give me a hand. Thanks. I have a large file. Named as 'MyFile'. It was tab-delmited. I am told to write a shell function that counts the number of occurrences of the ord “mysring” in the file 'MyFile'. (1 Reply)
Discussion started by: duke0001
1 Replies

Featured Tech Videos