This is actually quite a famous problem. I quote verbatim from "Classic Shell Scripting"
Quote:
From 1983 to 1987, Bell Labs researcher Jon Bentley wrote an interesting column in Communications of the ACM titled Programming Pearls. Some of the columns were later collected, with substantial changes, into two books listed in the Chapter 16. In one of the columns, Bentley posed this challenge: write a program to process a text file, and output a list of the n most-frequent words, with counts of their frequency of occurrence, sorted by descending count. Noted computer scientists Donald Knuth and David Hanson responded separately with interesting and clever literate programs,[7] each of which took several hours to write. Bentley's original specification was imprecise, so Hanson rephrased it this way: Given a text file and an integer n, you are to print the words (and their frequencies of occurrence) whose frequencies of occurrence are among the n largest in order of decreasing frequency.
[7] Programming Pearls: A Literate Program: A WEB program for common words, Comm. ACM 29(6), 471-483, June (1986), and Programming Pearls: Literate Programming: Printing Common Words, 30(7), 594-599, July (1987). Knuth's paper is also reprinted in his book Literate Programming, Stanford University Center for the Study of Language and Information, 1992, ISBN 0-937073-80-6 (paper) and 0-937073-81-4 (cloth).
In the first of Bentley's articles, fellow Bell Labs researcher Doug McIlroy reviewed Knuth's program, and offered a six-step Unix solution that took only a couple of minutes to develop and worked correctly the first time. Moreover, unlike the two other programs, McIlroy's is devoid of explicit magic constants that limit the word lengths, the number of unique words, and the input file size. Also, its notion of what constitutes a word is defined entirely by simple patterns given in its first two executable statements, making changes to the word-recognition algorithm easy.
McIlroy's program illustrates the power of the Unix tools approach: break a complex problem into simpler parts that you already know how to handle. To solve the word-frequency problem, McIlroy converted the text file to a list of words, one per line (tr does the job), mapped words to a single lettercase (tr again), sorted the list (sort), reduced it to a list of unique words with counts (uniq), sorted that list by descending counts (sort), and finally, printed the first several entries in the list (sed, though head would work too).
The resulting program is worth being given a name (wf, for word frequency) and wrapped in a shell script with a comment header. We also extend McIlroy's original sed command to make the output list-length argument optional, and we modernize the sort options. We show the complete program in Example 5-5.
And here's the code, right from the book
It's not exactly what you need, but you now have a strong model to work from. Enjoy
I have two input files like this:
File-1
-----
a1234 abc town
f2345 def village
t5678 pqr county
File-2
------
123456 test1 test2 test3 id-a1234
789012 test2 test4 id-t5678
456789 test7 id-b1234
I want to check the lines that match the first field of File-1 in... (7 Replies)
Hi all,
I would like to print words in a file seperated by whitespaces containing a specific pattern like "="
e.g. I have a file1 containing strings like
%cat file1
The= some= in
wish= born
<eof> .I want to display only those words containing = i.e The= , some=,wish=
... (5 Replies)
Hi,
i need to pick up dates and times from the file names which are of unequal length. The dates and time are delimited by dot. I am interested in getting the strings between the delimeter for fields -3, -4, -5 from behind (rear) so that the out put looks like :
071118.011300.556
I have... (2 Replies)
Hi , i am a new with perl, i want to made a script that find in file rows that start with specil words, as an example a line will start with"
.............................................
specialword aaa=2 bbb=5
.............................................
and to put this in a new file... (3 Replies)
hello,
i 'd like your help about a bash script which:
1. finds inside the html file (it is attached with my post) the code number of the Latest Stable Kernel,
2.finds the link which leads to the download location of the Latest Stable Kernel version,
(the right link should lead to the file... (3 Replies)
Hi again
I have figured out how to be able to sort through lines in a file with multiple words in any order and display them using this command:
cat file | grep -i $OPTION1 | grep -i $OPTION2 | grep -i $OPTION3 OPTION1 is 2008, OPTION2 is Mar, OPTION 3 is Tue
Result:
Tue Mar 25... (4 Replies)
Dear UNIX Community,
I have a set of file paths like the one below:
\\folder name \ folder1 \ folder2 \ folder3 \ folder4
\\folder name \ very long folder name \ even longer name
I would like to find the length of the characters (including space) between the \'s. However, I want... (6 Replies)
Got a question for you guys...I am searching through a public directory (that has tons of files) trying to find a file that I was working on a longggggg time ago. I can't remember what it is called, but I do remember the content. It should contains words like this:
Joe
Pulvo
botnet
zeus... (5 Replies)
I have a text which I divided them into sentences and now printed them in a rows.
I want to get the list of most of words ( the, and, a) and print 5 words after them (so 6 with the word itself). I have created an acceptfile with those rows and using grep but I have rows that have these words more... (2 Replies)
Hi
I have strings like these :
Vengeance mitt
Men Vengeance gloves
Women Quatro Windstopper Etip gloves
Quatro Windstopper Etip gloves
Girls Thermobite hooded jacket
Thermobite Triclimate snow jacket
Boys Thermobite Triclimate snow jacket
and I would like to get the lower case words at... (2 Replies)
Discussion started by: louisJ
2 Replies
LEARN ABOUT DEBIAN
heapsort
QSORT(3) BSD Library Functions Manual QSORT(3)NAME
heapsort, mergesort -- sort functions
LIBRARY
Utility functions from BSD systems (libbsd, -lbsd)
SYNOPSIS
#include <bsd/stdlib.h>
int
heapsort(void *base, size_t nmemb, size_t size, int (*compar)(const void *, const void *));
int
mergesort(void *base, size_t nmemb, size_t size, int (*compar)(const void *, const void *));
DESCRIPTION
The heapsort() function is a modified selection sort. The mergesort() function is a modified merge sort with exponential search intended for
sorting data with pre-existing order.
The heapsort() function sorts an array of nmemb objects, the initial member of which is pointed to by base. The size of each object is spec-
ified by size. The mergesort() function behaves similarly, but requires that size be greater than ``sizeof(void *) / 2''.
The contents of the array base are sorted in ascending order according to a comparison function pointed to by compar, which requires two
arguments pointing to the objects being compared.
The comparison function must return an integer less than, equal to, or greater than zero if the first argument is considered to be respec-
tively less than, equal to, or greater than the second.
The algorithm implemented by heapsort() is not stable, that is, if two members compare as equal, their order in the sorted array is unde-
fined. The mergesort() algorithm is stable.
The heapsort() function is an implementation of J.W.J. William's ``heapsort'' algorithm, a variant of selection sorting; in particular, see
D.E. Knuth's Algorithm H. Heapsort takes O N lg N worst-case time. Its only advantage over qsort() is that it uses almost no additional
memory; while qsort() does not allocate memory, it is implemented using recursion.
The function mergesort() requires additional memory of size nmemb * size bytes; it should be used only when space is not at a premium. The
mergesort() function is optimized for data with pre-existing order; its worst case time is O N lg N; its best case is O N.
Normally, qsort() is faster than mergesort() is faster than heapsort(). Memory availability and pre-existing order in the data can make this
untrue.
RETURN VALUES
The heapsort() and mergesort() functions return the value 0 if successful; otherwise the value -1 is returned and the global variable errno
is set to indicate the error.
ERRORS
The heapsort() and mergesort() functions succeed unless:
[EINVAL] The size argument is zero, or, the size argument to mergesort() is less than ``sizeof(void *) / 2''.
[ENOMEM] The heapsort() or mergesort() functions were unable to allocate memory.
SEE ALSO sort(1), radixsort(3)
Williams, J.W.J, "Heapsort", Communications of the ACM, 7:1, pp. 347-348, 1964.
Knuth, D.E., "Sorting and Searching", The Art of Computer Programming, Vol. 3, pp. 114-123, 145-149, 1968.
McIlroy, P.M., "Optimistic Sorting and Information Theoretic Complexity", Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, January
1992.
Bentley, J.L. and McIlroy, M.D., "Engineering a Sort Function", Software--Practice and Experience, Vol. 23(11), pp. 1249-1265,
November 1993.
BSD September 30, 2003 BSD