Sponsored Content
Top Forums Shell Programming and Scripting Script to count word occurrences, but exclude some? Post 302657007 by Cronk on Friday 15th of June 2012 08:38:31 PM
Old 06-15-2012
Script to count word occurrences, but exclude some?

I am trying to count the occurrences of ALL words in a file. However, I want to exclude certain words: short words (i.e. <3 chars), and words contained in an blacklist file. There is also a desire to count words that are capitalized (e.g. proper names). I am not 100% sure where the line on capitalization is; i.e. do we count the first word of a sentence differently? What if it is a word that would be capitalized in the middle of a sentence, e.g. a name? So working on the other parts is more important, but any other input would be appreciated.

I have put together a command to do the word counting in the file (I borrowed code that I found here in other postings). It is in a script here, and uses command line arguments for the filename, too:

Code:
tr -cs "[:alpha:]'" "\n" < $1 | sort | uniq -c | sort -rn >w_counts.txt

In the TR command, I have put in an apostrophe in the match set so that it doesn't break up contractions (e.g. "doesn't"). The output of TR is a CR/LF separated list of words that is then fed into the others, where it gets sorted so that 'uniq' will count correctly. Then that is reverse sorted (we want to know about the highest occurring words) and output to the text file. (This will eventually be imported back into a database.)

This works in about .5 seconds on a 4000+ word file. I am pretty happy with that. Smilie

Any comments or suggestions about excluding short words or words from a blacklist file, or even the counting capitalized words, would be appreciated.

I am working on Mac OS X 10.6.8, but would hope to get a solution that will work under a Windows Unix-like shell (e.g. Cygwin).


Thanks,
J
 

9 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

count occurrences and substitute with counter

Hi Unix-Experts, I have a textfile with several occurrences of some string XXX. I'd like to count all the occurrences and number them in reverse order. E.g. input: XXX bla XXX foo XXX output: 3 bla 2 foo 1 I tried to achieve this with sed, but failed. Any suggestions? Thanks in... (4 Replies)
Discussion started by: ptob
4 Replies

2. Shell Programming and Scripting

Count the number of occurrences of the word

I am a newbie in UNIX shell script and seeking help on this UNIX function. Please give me a hand. Thanks. I have a large file. Named as 'MyFile'. It was tab-delmited. I am told to write a shell function that counts the number of occurrences of the ord “mysring” in the file 'MyFile'. (1 Reply)
Discussion started by: duke0001
1 Replies

3. Shell Programming and Scripting

Count occurrences in awk

Hello, I have an output from GDB with many entries that looks like this 0x00007ffff7dece94 39 in dl-fini.c 0x00007ffff7dece97 39 in dl-fini.c 0x00007ffff7ab356c 50 in exit.c 0x00007ffff7aed9db in _IO_cleanup () at genops.c:1022 115 in dl-fini.c 0x00007ffff7decf7b in _dl_sort_fini (l=0x0,... (6 Replies)
Discussion started by: ikke008
6 Replies

4. Shell Programming and Scripting

How to count occurrences in a specific column

Hi, I need help to count the number of occurrences in $3 of file1.txt. I only know how to count by checking one by one and the code is like this: awk '$3 ~ /aku hanya poyo/ {++c} END {print c}' FS="\t" file1.txt But this is not wise to do as i have hundreds of different occurrences in that... (10 Replies)
Discussion started by: redse171
10 Replies

5. Shell Programming and Scripting

Word Count In A Script

I am in need of a basic format to 1. list all files in a directory 2. list the # of lines in each file 3. list the # of words in each file If someone could give me a basic format i would appreicate it ***ALSO i can not use the FIND command*** (4 Replies)
Discussion started by: domdom110
4 Replies

6. Shell Programming and Scripting

Word Occurrences script using awk

I'm putting together a script that will the count the occurrences of words in text documents. It works fine so far, but I'd like to make a couple tweaks/additions: 1) I'm having a hard time displaying the array index number, tried freq which just spit 0's back at me 2) Is there any way to... (12 Replies)
Discussion started by: ksmarine1980
12 Replies

7. Shell Programming and Scripting

Count occurrences in first column

input amex-11 10 abc amex-11 20 bcn amed-12 1 abc I tried something like this. awk '{h++}; END { for(k in h) print k, h }' rm1 output amex-11 1 10 abc amex-11 1 20 bcn amed-12 2 1 abc Note: The second column represents the occurrences. amex-11 is first one and amed-12 is the... (5 Replies)
Discussion started by: quincyjones
5 Replies

8. UNIX for Beginners Questions & Answers

UNIX script to check word count of each word in file

I am trying to figure out to find word count of each word from my file sample file hi how are you hi are you ok sample out put hi 1 how 1 are 1 you 1 hi 1 are 1 you 1 ok 1 wc -l filename is not helping , i think we will have to split the lines and count and then print and also... (4 Replies)
Discussion started by: mirwasim
4 Replies

9. UNIX for Beginners Questions & Answers

awk or sed script to count number of occurrences and creating an average

Hi Friends , I am having one problem as stated file . Having an input CSV file as shown in the code U_TOP_LOGIC/U_HPB2/U_HBRIDGE2/i_core/i_paddr_reg_2_/Q,1,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,0,0... (4 Replies)
Discussion started by: kshitij
4 Replies
STRSPLIT(3pub)						       C Programmer's Manual						    STRSPLIT(3pub)

NAME
strsplit - split string into words SYNOPSIS
#include <publib.h> int strsplit(char *src, char **words, int maxw, const char *sep); DESCRIPTION
strsplit splits the src string into words separated by one or more of the characters in sep (or by whitespace characters, as specified by isspace(3), if sep is the empty string). Pointers to the words are stored in successive elements in the array pointed to by words. No more than maxw pointers are stored. The input string is modifed by replacing the separator character following a word with ''. However, if there are more than maxw words, only maxw-1 words will be returned, and the maxwth pointer in the array will point to the rest of the string. If maxw is 0, no modification is done. This can be used for counting how many words there are, e.g., so that space for the word pointer table can be allocated dynamically. strsplit splits the src string into words separated by one or more of the characters in sep (or by whitespace characters, as defined by isspace(3), if sep is the empty string). The src string is modified by replacing the separator character after each word with ''. A pointer to each word is stored into successive elements of the array words. If there are more than maxw words, a '' is stored after the first maxw-1 words only, and the words[maxw-1] will contain a pointer to the rest of the string after the word in words[maxw-2]. RETURN VALUE
strsplit returns the total number of words in the input string. EXAMPLE
Assuming that words are separated by white space, to count the number of words on a line, one might say the following. n = strsplit(line, NULL, 0, ""); To print out the fields of a colon-separated list (such as PATH, or a line from /etc/passwd or /etc/group), one might do the following. char *fields[15]; int i, n; n = strsplit(list, fields, 15, ":"); if (n > 15) n = 15; for (i = 0; i < n; ++i) printf("field %d: %s ", i, fields[i]); In real life, one would of course prefer to not restrict the number of fields, so one might either allocated the pointer table dynamically (first counting the number of words using something like the first example), or realize that since it is the original string that is being modified, one can do the following: char *fields[15]; int i, n; do { n = strsplit(list, fields, 15, ":"); if (n > 15) n = 15; for (i = 0; i < n; ++i) printf("field %d: %s ", i, fields[i]); list = field[n-1] + strlen(field[n-1]); } while (n == 15); SEE ALSO
publib(3), strtok(3) AUTHOR
The idea for this function came from C-News source code by Henry Spencer and Geoff Collyer. Their function is very similar, but this implementation is by Lars Wirzenius (lars.wirzenius@helsinki.fi) Publib C Programmer's Manual STRSPLIT(3pub)
All times are GMT -4. The time now is 06:22 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy