Script to count word occurrences, but exclude some?
I am trying to count the occurrences of ALL words in a file. However, I want to exclude certain words: short words (i.e. <3 chars), and words contained in an blacklist file. There is also a desire to count words that are capitalized (e.g. proper names). I am not 100% sure where the line on capitalization is; i.e. do we count the first word of a sentence differently? What if it is a word that would be capitalized in the middle of a sentence, e.g. a name? So working on the other parts is more important, but any other input would be appreciated.
I have put together a command to do the word counting in the file (I borrowed code that I found here in other postings). It is in a script here, and uses command line arguments for the filename, too:
In the TR command, I have put in an apostrophe in the match set so that it doesn't break up contractions (e.g. "doesn't"). The output of TR is a CR/LF separated list of words that is then fed into the others, where it gets sorted so that 'uniq' will count correctly. Then that is reverse sorted (we want to know about the highest occurring words) and output to the text file. (This will eventually be imported back into a database.)
This works in about .5 seconds on a 4000+ word file. I am pretty happy with that.
Any comments or suggestions about excluding short words or words from a blacklist file, or even the counting capitalized words, would be appreciated.
I am working on Mac OS X 10.6.8, but would hope to get a solution that will work under a Windows Unix-like shell (e.g. Cygwin).
Hi Unix-Experts,
I have a textfile with several occurrences of some string XXX. I'd like to count all the occurrences and number them in reverse order.
E.g. input: XXX bla XXX foo XXX
output: 3 bla 2 foo 1
I tried to achieve this with sed, but failed. Any suggestions?
Thanks in... (4 Replies)
I am a newbie in UNIX shell script and seeking help on this UNIX function. Please give me a hand. Thanks.
I have a large file. Named as 'MyFile'. It was tab-delmited. I am told to write a shell function that counts the number of occurrences of the ord “mysring” in the file 'MyFile'. (1 Reply)
Hello,
I have an output from GDB with many entries that looks like this
0x00007ffff7dece94 39 in dl-fini.c
0x00007ffff7dece97 39 in dl-fini.c
0x00007ffff7ab356c 50 in exit.c
0x00007ffff7aed9db in _IO_cleanup () at genops.c:1022
115 in dl-fini.c
0x00007ffff7decf7b in _dl_sort_fini (l=0x0,... (6 Replies)
Hi,
I need help to count the number of occurrences in $3 of file1.txt. I only know how to count by checking one by one and the code is like this:
awk '$3 ~ /aku hanya poyo/ {++c} END {print c}' FS="\t" file1.txt
But this is not wise to do as i have hundreds of different occurrences in that... (10 Replies)
I am in need of a basic format to
1. list all files in a directory
2. list the # of lines in each file
3. list the # of words in each file
If someone could give me a basic format i would appreicate it
***ALSO i can not use the FIND command*** (4 Replies)
I'm putting together a script that will the count the occurrences of words in text documents. It works fine so far, but I'd like to make a couple tweaks/additions:
1) I'm having a hard time displaying the array index number, tried freq which just spit 0's back at me
2) Is there any way to... (12 Replies)
input
amex-11 10 abc
amex-11 20 bcn
amed-12 1 abc
I tried something like this.
awk '{h++}; END { for(k in h) print k, h }' rm1
output
amex-11 1 10 abc
amex-11 1 20 bcn
amed-12 2 1 abc
Note: The second column represents the occurrences. amex-11 is first one and amed-12 is the... (5 Replies)
I am trying to figure out to find word count of each word from my file
sample file
hi how are you
hi are you ok
sample out put
hi 1
how 1
are 1
you 1
hi 1
are 1
you 1
ok 1
wc -l filename is not helping , i think we will have to split the lines and count and then print and also... (4 Replies)
Hi Friends ,
I am having one problem as stated file .
Having an input CSV file as shown in the code
U_TOP_LOGIC/U_HPB2/U_HBRIDGE2/i_core/i_paddr_reg_2_/Q,1,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,0,0... (4 Replies)
Discussion started by: kshitij
4 Replies
LEARN ABOUT OPENSOLARIS
look
look(1) User Commands look(1)NAME
look - find words in the system dictionary or lines in a sorted list
SYNOPSIS
/usr/bin/look [-d] [-f] [-tc] string [filename]
DESCRIPTION
The look command consults a sorted filename and prints all lines that begin with string.
If no filename is specified, look uses /usr/share/lib/dict/words with collating sequence -df.
look limits the length of a word to search for to 256 characters.
OPTIONS -d Dictionary order. Only letters, digits, TAB and SPACE characters are used in comparisons.
-f Fold case. Upper case letters are not distinguished from lower case in comparisons.
-tc Set termination character. All characters to the right of c in string are ignored.
FILES
/usr/share/lib/dict/words spelling list
ATTRIBUTES
See attributes(5) for descriptions of the following attributes:
+-----------------------------+-----------------------------+
| ATTRIBUTE TYPE | ATTRIBUTE VALUE |
+-----------------------------+-----------------------------+
|Availability |SUNWesu |
+-----------------------------+-----------------------------+
SEE ALSO grep(1), sort(1), attributes(5)SunOS 5.11 29 Mar 1994 look(1)