searching and displaying most commonly used words

09-10-2007

Registered User

14, 0

Join Date: Sep 2007

Last Activity: 16 September 2007, 2:31 PM EDT

Posts: 14

Thanks Given: 0

Thanked 0 Times in 0 Posts

searching and displaying most commonly used words

Hi guys,

i need to search the most commonly occuring words in a file and display their counts of about 30000 words and the words shud not be of typ specified in file 2 e. words like is,for,the,an,he,she etc...

k.

file1:
ALICE was beginning to get very tired of sitting by her sister on the bank and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, "and what is the use of a book," thought Alice, "without pictures or conversations?'

So she was considering, in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.

file2:
was to get of by her on the etc....

output:

ALICE : 1
begining : 1 etc...

Cud u help me with this

arunsubbhian

View Public Profile for arunsubbhian

Find all posts by arunsubbhian

09-10-2007

Registered User

109, 0

Join Date: Jul 2003

Last Activity: 4 November 2010, 8:37 PM EDT

Location: Interweb

Posts: 109

Thanks Given: 0

Thanked 0 Times in 0 Posts

This looks suspiciously like homework.

cassj

View Public Profile for cassj

Find all posts by cassj

09-10-2007

Registered User

15, 0

Join Date: Sep 2007

Last Activity: 29 July 2010, 8:17 PM EDT

Posts: 15

Thanks Given: 0

Thanked 0 Times in 0 Posts

This is actually quite a famous problem. I quote verbatim from "Classic Shell Scripting"

Quote:

From 1983 to 1987, Bell Labs researcher Jon Bentley wrote an interesting column in Communications of the ACM titled Programming Pearls. Some of the columns were later collected, with substantial changes, into two books listed in the Chapter 16. In one of the columns, Bentley posed this challenge: write a program to process a text file, and output a list of the n most-frequent words, with counts of their frequency of occurrence, sorted by descending count. Noted computer scientists Donald Knuth and David Hanson responded separately with interesting and clever literate programs,[7] each of which took several hours to write. Bentley's original specification was imprecise, so Hanson rephrased it this way: Given a text file and an integer n, you are to print the words (and their frequencies of occurrence) whose frequencies of occurrence are among the n largest in order of decreasing frequency.

[7] Programming Pearls: A Literate Program: A WEB program for common words, Comm. ACM 29(6), 471-483, June (1986), and Programming Pearls: Literate Programming: Printing Common Words, 30(7), 594-599, July (1987). Knuth's paper is also reprinted in his book Literate Programming, Stanford University Center for the Study of Language and Information, 1992, ISBN 0-937073-80-6 (paper) and 0-937073-81-4 (cloth).

In the first of Bentley's articles, fellow Bell Labs researcher Doug McIlroy reviewed Knuth's program, and offered a six-step Unix solution that took only a couple of minutes to develop and worked correctly the first time. Moreover, unlike the two other programs, McIlroy's is devoid of explicit magic constants that limit the word lengths, the number of unique words, and the input file size. Also, its notion of what constitutes a word is defined entirely by simple patterns given in its first two executable statements, making changes to the word-recognition algorithm easy.

McIlroy's program illustrates the power of the Unix tools approach: break a complex problem into simpler parts that you already know how to handle. To solve the word-frequency problem, McIlroy converted the text file to a list of words, one per line (tr does the job), mapped words to a single lettercase (tr again), sorted the list (sort), reduced it to a list of unique words with counts (uniq), sorted that list by descending counts (sort), and finally, printed the first several entries in the list (sed, though head would work too).

The resulting program is worth being given a name (wf, for word frequency) and wrapped in a shell script with a comment header. We also extend McIlroy's original sed command to make the output list-length argument optional, and we modernize the sort options. We show the complete program in Example 5-5.

And here's the code, right from the book

Code:

#! /bin/sh
# Read a text stream on standard input, and output a list of
# the n (default: 25) most frequently occurring words and
# their frequency counts, in order of descending counts, on
# standard output.
#
# Usage:
#       wf [n]

tr -cs A-Za-z\' '\n' |         
# Replace nonletters with newlines
  tr A-Z a-z |                 
#Map uppercase to lowercase
    sort |                     
#Sort the words in ascending order
      uniq -c |                
#Eliminate duplicates, showing their counts
        sort -k1,1nr -k2 |     
#Sort by descending count, and then by ascending word
          sed ${1:-25}q        
#Print only the first n (default: 25) lines

It's not exactly what you need, but you now have a strong model to work from. Enjoy

lev_lafayette

View Public Profile for lev_lafayette

Find all posts by lev_lafayette

UNIX for Dummies Questions & Answers

searching and displaying most commonly used words

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Gawk gensub, match capital words and lowercase words

Discussion started by: louisJ

2. UNIX for Dummies Questions & Answers

searching words & print prefixed string after it

Discussion started by: A-V

3. Shell Programming and Scripting

Finding my lost file by searching for words in it

Discussion started by: statichazard

4. Shell Programming and Scripting

Awk: Searching for length of words between slash character

Discussion started by: vnayak

5. UNIX for Dummies Questions & Answers

Searching for multiple words on a line in any order issue

Discussion started by: semaj

6. Shell Programming and Scripting

Shell script to find out words, replace them and count words

Discussion started by: alex83

7. Shell Programming and Scripting

Perl searching special words in lines

Discussion started by: alinalin

8. Shell Programming and Scripting

searching for words between delimeters from the rear

Discussion started by: oktbabs

9. Shell Programming and Scripting

Searching words in a file containing a pattern

Discussion started by: sree_123

10. Shell Programming and Scripting

searching and displaying help

Discussion started by: ajay41aj