Sponsored Content
Top Forums UNIX for Dummies Questions & Answers searching and displaying most commonly used words Post 302135575 by lev_lafayette on Monday 10th of September 2007 07:58:10 PM
Old 09-10-2007
This is actually quite a famous problem. I quote verbatim from "Classic Shell Scripting"

Quote:
From 1983 to 1987, Bell Labs researcher Jon Bentley wrote an interesting column in Communications of the ACM titled Programming Pearls. Some of the columns were later collected, with substantial changes, into two books listed in the Chapter 16. In one of the columns, Bentley posed this challenge: write a program to process a text file, and output a list of the n most-frequent words, with counts of their frequency of occurrence, sorted by descending count. Noted computer scientists Donald Knuth and David Hanson responded separately with interesting and clever literate programs,[7] each of which took several hours to write. Bentley's original specification was imprecise, so Hanson rephrased it this way: Given a text file and an integer n, you are to print the words (and their frequencies of occurrence) whose frequencies of occurrence are among the n largest in order of decreasing frequency.

[7] Programming Pearls: A Literate Program: A WEB program for common words, Comm. ACM 29(6), 471-483, June (1986), and Programming Pearls: Literate Programming: Printing Common Words, 30(7), 594-599, July (1987). Knuth's paper is also reprinted in his book Literate Programming, Stanford University Center for the Study of Language and Information, 1992, ISBN 0-937073-80-6 (paper) and 0-937073-81-4 (cloth).

In the first of Bentley's articles, fellow Bell Labs researcher Doug McIlroy reviewed Knuth's program, and offered a six-step Unix solution that took only a couple of minutes to develop and worked correctly the first time. Moreover, unlike the two other programs, McIlroy's is devoid of explicit magic constants that limit the word lengths, the number of unique words, and the input file size. Also, its notion of what constitutes a word is defined entirely by simple patterns given in its first two executable statements, making changes to the word-recognition algorithm easy.

McIlroy's program illustrates the power of the Unix tools approach: break a complex problem into simpler parts that you already know how to handle. To solve the word-frequency problem, McIlroy converted the text file to a list of words, one per line (tr does the job), mapped words to a single lettercase (tr again), sorted the list (sort), reduced it to a list of unique words with counts (uniq), sorted that list by descending counts (sort), and finally, printed the first several entries in the list (sed, though head would work too).

The resulting program is worth being given a name (wf, for word frequency) and wrapped in a shell script with a comment header. We also extend McIlroy's original sed command to make the output list-length argument optional, and we modernize the sort options. We show the complete program in Example 5-5.
And here's the code, right from the book

Code:
#! /bin/sh
# Read a text stream on standard input, and output a list of
# the n (default: 25) most frequently occurring words and
# their frequency counts, in order of descending counts, on
# standard output.
#
# Usage:
#       wf [n]

tr -cs A-Za-z\' '\n' |         
# Replace nonletters with newlines
  tr A-Z a-z |                 
#Map uppercase to lowercase
    sort |                     
#Sort the words in ascending order
      uniq -c |                
#Eliminate duplicates, showing their counts
        sort -k1,1nr -k2 |     
#Sort by descending count, and then by ascending word
          sed ${1:-25}q        
#Print only the first n (default: 25) lines

It's not exactly what you need, but you now have a strong model to work from. Enjoy
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

searching and displaying help

I have two input files like this: File-1 ----- a1234 abc town f2345 def village t5678 pqr county File-2 ------ 123456 test1 test2 test3 id-a1234 789012 test2 test4 id-t5678 456789 test7 id-b1234 I want to check the lines that match the first field of File-1 in... (7 Replies)
Discussion started by: ajay41aj
7 Replies

2. Shell Programming and Scripting

Searching words in a file containing a pattern

Hi all, I would like to print words in a file seperated by whitespaces containing a specific pattern like "=" e.g. I have a file1 containing strings like %cat file1 The= some= in wish= born <eof> .I want to display only those words containing = i.e The= , some=,wish= ... (5 Replies)
Discussion started by: sree_123
5 Replies

3. Shell Programming and Scripting

searching for words between delimeters from the rear

Hi, i need to pick up dates and times from the file names which are of unequal length. The dates and time are delimited by dot. I am interested in getting the strings between the delimeter for fields -3, -4, -5 from behind (rear) so that the out put looks like : 071118.011300.556 I have... (2 Replies)
Discussion started by: oktbabs
2 Replies

4. Shell Programming and Scripting

Perl searching special words in lines

Hi , i am a new with perl, i want to made a script that find in file rows that start with specil words, as an example a line will start with" ............................................. specialword aaa=2 bbb=5 ............................................. and to put this in a new file... (3 Replies)
Discussion started by: alinalin
3 Replies

5. Shell Programming and Scripting

Shell script to find out words, replace them and count words

hello, i 'd like your help about a bash script which: 1. finds inside the html file (it is attached with my post) the code number of the Latest Stable Kernel, 2.finds the link which leads to the download location of the Latest Stable Kernel version, (the right link should lead to the file... (3 Replies)
Discussion started by: alex83
3 Replies

6. UNIX for Dummies Questions & Answers

Searching for multiple words on a line in any order issue

Hi again I have figured out how to be able to sort through lines in a file with multiple words in any order and display them using this command: cat file | grep -i $OPTION1 | grep -i $OPTION2 | grep -i $OPTION3 OPTION1 is 2008, OPTION2 is Mar, OPTION 3 is Tue Result: Tue Mar 25... (4 Replies)
Discussion started by: semaj
4 Replies

7. Shell Programming and Scripting

Awk: Searching for length of words between slash character

Dear UNIX Community, I have a set of file paths like the one below: \\folder name \ folder1 \ folder2 \ folder3 \ folder4 \\folder name \ very long folder name \ even longer name I would like to find the length of the characters (including space) between the \'s. However, I want... (6 Replies)
Discussion started by: vnayak
6 Replies

8. Shell Programming and Scripting

Finding my lost file by searching for words in it

Got a question for you guys...I am searching through a public directory (that has tons of files) trying to find a file that I was working on a longggggg time ago. I can't remember what it is called, but I do remember the content. It should contains words like this: Joe Pulvo botnet zeus... (5 Replies)
Discussion started by: statichazard
5 Replies

9. UNIX for Dummies Questions & Answers

searching words & print prefixed string after it

I have a text which I divided them into sentences and now printed them in a rows. I want to get the list of most of words ( the, and, a) and print 5 words after them (so 6 with the word itself). I have created an acceptfile with those rows and using grep but I have rows that have these words more... (2 Replies)
Discussion started by: A-V
2 Replies

10. Shell Programming and Scripting

Gawk gensub, match capital words and lowercase words

Hi I have strings like these : Vengeance mitt Men Vengeance gloves Women Quatro Windstopper Etip gloves Quatro Windstopper Etip gloves Girls Thermobite hooded jacket Thermobite Triclimate snow jacket Boys Thermobite Triclimate snow jacket and I would like to get the lower case words at... (2 Replies)
Discussion started by: louisJ
2 Replies
QSORT(3)						   BSD Library Functions Manual 						  QSORT(3)

NAME
heapsort, mergesort -- sort functions LIBRARY
Utility functions from BSD systems (libbsd, -lbsd) SYNOPSIS
#include <bsd/stdlib.h> int heapsort(void *base, size_t nmemb, size_t size, int (*compar)(const void *, const void *)); int mergesort(void *base, size_t nmemb, size_t size, int (*compar)(const void *, const void *)); DESCRIPTION
The heapsort() function is a modified selection sort. The mergesort() function is a modified merge sort with exponential search intended for sorting data with pre-existing order. The heapsort() function sorts an array of nmemb objects, the initial member of which is pointed to by base. The size of each object is spec- ified by size. The mergesort() function behaves similarly, but requires that size be greater than ``sizeof(void *) / 2''. The contents of the array base are sorted in ascending order according to a comparison function pointed to by compar, which requires two arguments pointing to the objects being compared. The comparison function must return an integer less than, equal to, or greater than zero if the first argument is considered to be respec- tively less than, equal to, or greater than the second. The algorithm implemented by heapsort() is not stable, that is, if two members compare as equal, their order in the sorted array is unde- fined. The mergesort() algorithm is stable. The heapsort() function is an implementation of J.W.J. William's ``heapsort'' algorithm, a variant of selection sorting; in particular, see D.E. Knuth's Algorithm H. Heapsort takes O N lg N worst-case time. Its only advantage over qsort() is that it uses almost no additional memory; while qsort() does not allocate memory, it is implemented using recursion. The function mergesort() requires additional memory of size nmemb * size bytes; it should be used only when space is not at a premium. The mergesort() function is optimized for data with pre-existing order; its worst case time is O N lg N; its best case is O N. Normally, qsort() is faster than mergesort() is faster than heapsort(). Memory availability and pre-existing order in the data can make this untrue. RETURN VALUES
The heapsort() and mergesort() functions return the value 0 if successful; otherwise the value -1 is returned and the global variable errno is set to indicate the error. ERRORS
The heapsort() and mergesort() functions succeed unless: [EINVAL] The size argument is zero, or, the size argument to mergesort() is less than ``sizeof(void *) / 2''. [ENOMEM] The heapsort() or mergesort() functions were unable to allocate memory. SEE ALSO
sort(1), radixsort(3) Williams, J.W.J, "Heapsort", Communications of the ACM, 7:1, pp. 347-348, 1964. Knuth, D.E., "Sorting and Searching", The Art of Computer Programming, Vol. 3, pp. 114-123, 145-149, 1968. McIlroy, P.M., "Optimistic Sorting and Information Theoretic Complexity", Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, January 1992. Bentley, J.L. and McIlroy, M.D., "Engineering a Sort Function", Software--Practice and Experience, Vol. 23(11), pp. 1249-1265, November 1993. BSD
September 30, 2003 BSD
All times are GMT -4. The time now is 10:49 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy