Script to count word occurrences, but exclude some? Post: 302657143

Sponsored Content

Top Forums Shell Programming and Scripting Script to count word occurrences, but exclude some? Post 302657143 by agama on Saturday 16th of June 2012 12:05:49 PM

06-16-2012

Registered User

I would take a slightly different approach. No need for the leading sed, and I would exclude the black list on the output of the awk assuming that will be a shorter list than the output of an initial sed. I'd also strip punctuation/special characters so that something like (word is counted as word without the paren. I'd also check the length after removing specials/punct so that (and is dropped if you want only words that have a length greater than 3.

This can be smashed onto one line, but it's easier to read and commented when written with some structure:

Code:

awk '
    BEGIN { RS = "[" FS "\n]" }         # break into records based on whitespace and newline (this may require gnu awk and not work in older versions)
    { 
        gsub( "[:,%?<>&@!=+.()]", "", $(i) );   # ditch unwanted punctuation before looking at len
        if( length( $0 ) > 3 )                  # keep only words long enough
            count[$0]++; 
    } 

    END {
        for( x in count )
            print x, count[x];
    }'  data-file | grep -v -f black-list |sort -k 2rn,2

Last edited by agama; 06-16-2012 at 01:06 PM.. Reason: clarification

agama

View Public Profile for agama

Find all posts by agama

9 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

count occurrences and substitute with counter

Hi Unix-Experts, I have a textfile with several occurrences of some string XXX. I'd like to count all the occurrences and number them in reverse order. E.g. input: XXX bla XXX foo XXX output: 3 bla 2 foo 1 I tried to achieve this with sed, but failed. Any suggestions? Thanks in...

2. Shell Programming and Scripting

Count the number of occurrences of the word

I am a newbie in UNIX shell script and seeking help on this UNIX function. Please give me a hand. Thanks. I have a large file. Named as 'MyFile'. It was tab-delmited. I am told to write a shell function that counts the number of occurrences of the ord �mysring� in the file 'MyFile'.

3. Shell Programming and Scripting

Count occurrences in awk

Hello, I have an output from GDB with many entries that looks like this 0x00007ffff7dece94 39 in dl-fini.c 0x00007ffff7dece97 39 in dl-fini.c 0x00007ffff7ab356c 50 in exit.c 0x00007ffff7aed9db in _IO_cleanup () at genops.c:1022 115 in dl-fini.c 0x00007ffff7decf7b in _dl_sort_fini (l=0x0,...

4. Shell Programming and Scripting

How to count occurrences in a specific column

Hi, I need help to count the number of occurrences in $3 of file1.txt. I only know how to count by checking one by one and the code is like this: awk '$3 ~ /aku hanya poyo/ {++c} END {print c}' FS="\t" file1.txt But this is not wise to do as i have hundreds of different occurrences in that...

5. Shell Programming and Scripting

Word Count In A Script

I am in need of a basic format to 1. list all files in a directory 2. list the # of lines in each file 3. list the # of words in each file If someone could give me a basic format i would appreicate it ***ALSO i can not use the FIND command***

6. Shell Programming and Scripting

Word Occurrences script using awk

I'm putting together a script that will the count the occurrences of words in text documents. It works fine so far, but I'd like to make a couple tweaks/additions: 1) I'm having a hard time displaying the array index number, tried freq which just spit 0's back at me 2) Is there any way to...

7. Shell Programming and Scripting

Count occurrences in first column

input amex-11 10 abc amex-11 20 bcn amed-12 1 abc I tried something like this. awk '{h++}; END { for(k in h) print k, h }' rm1 output amex-11 1 10 abc amex-11 1 20 bcn amed-12 2 1 abc Note: The second column represents the occurrences. amex-11 is first one and amed-12 is the...

8. UNIX for Beginners Questions & Answers

UNIX script to check word count of each word in file

I am trying to figure out to find word count of each word from my file sample file hi how are you hi are you ok sample out put hi 1 how 1 are 1 you 1 hi 1 are 1 you 1 ok 1 wc -l filename is not helping , i think we will have to split the lines and count and then print and also...

9. UNIX for Beginners Questions & Answers

awk or sed script to count number of occurrences and creating an average

Hi Friends , I am having one problem as stated file . Having an input CSV file as shown in the code U_TOP_LOGIC/U_HPB2/U_HBRIDGE2/i_core/i_paddr_reg_2_/Q,1,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,0,0...

LEARN ABOUT DEBIAN

sql::reservedwords::postgresql

SQL::ReservedWords::PostgreSQL(3pm)			User Contributed Perl Documentation		       SQL::ReservedWords::PostgreSQL(3pm)

NAME

       SQL::ReservedWords::PostgreSQL - Reserved SQL words by PostgreSQL

SYNOPSIS

	  if ( SQL::ReservedWords::PostgreSQL->is_reserved( $word ) ) {
	      print "$word is a reserved PostgreSQL word!";
	  }

DESCRIPTION

       Determine if words are reserved by PostgreSQL.

METHODS

       is_reserved( $word )
	   Returns a boolean indicating if $word is reserved by either PostgreSQL 7.3, 7.4, 8.0 or 8.1.

       is_reserved_by_postgresql7( $word )
	   Returns a boolean indicating if $word is reserved by either PostgreSQL 7.3 or 7.4.

       is_reserved_by_postgresql8( $word )
	   Returns a boolean indicating if $word is reserved by either PostgreSQL 8.0 or 8.1.

       reserved_by( $word )
	   Returns a list with PostgreSQL versions that reserves $word.

       words
	   Returns a list with all reserved words.

EXPORTS

       Nothing by default. Following subroutines can be exported:

       is_reserved
       is_reserved_by_postgresql7
       is_reserved_by_postgresql8
       reserved_by
       words

SEE ALSO

       SQL::ReservedWords

       <http://www.postgresql.org/docs/manuals/>

AUTHOR

       Christian Hansen "chansen@cpan.org"

COPYRIGHT

       This program is free software, you can redistribute it and/or modify it under the same terms as Perl itself.

perl v5.8.8							    2008-03-28				       SQL::ReservedWords::PostgreSQL(3pm)