Creating Frequency of words from a file by accessing a corpus

07-23-2013

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Creating Frequency of words from a file by accessing a corpus

Hello,
I have a large file of syllables /strings in Urdu. Each word is on a separate line.
Example in English:

Code:

be
at
for
if
being
attract

I need to identify the frequency of each of these strings from a large corpus (which I cannot attach unfortunately because of size limitations) and identify the frequency of each string.
Is there a perl or awk script which can do the job.
Many thanks for your help

gimley

View Public Profile for gimley

Find all posts by gimley

07-23-2013

Registered User

83, 16

Join Date: Sep 2010

Last Activity: 9 March 2015, 1:19 PM EDT

Posts: 83

Thanks Given: 0

Thanked 16 Times in 16 Posts

Code:

$ awk '{txt[$0]++} END{for (i in txt) { printf "%-8d %s\n"  ,txt[i],i }}' list
1        attract
1        if
1        at
1        being
1        for
1        be

MR.bean

View Public Profile for MR.bean

Find all posts by MR.bean

07-24-2013

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

You could also use sort and uniq like this:

Code:

$ sort corpus | uniq -c
      1 at
      1 attract
      1 be
      1 being
      1 for
      1 if

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

07-24-2013

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

Quote:

Originally Posted by gimley

...I need to identify the frequency of each of these strings from a large corpus (which I cannot attach unfortunately because of size limitations) and identify the frequency of each string.
...

Code:

$ 
$ # "wordlist.txt" is a list of words that we have to check
$ cat wordlist.txt
be
at
for
if
being
attract
$ 
$ # "poe_the_gold_bug.txt" is a text file against which we have to
$ # check the words. This file contains the story "The Gold Bug" by
$ # Edgar Allen Poe from the Project Gutenberg website.
$ wc poe_the_gold_bug.txt
 1460 13462 76460 poe_the_gold_bug.txt
$ 
$ # A Perl program to check the frequency of words from "wordlist.txt"
$ # in the file "poe_the_gold_bug.txt"
$ cat -n word_occurrences.pl
     1    #!/usr/bin/perl -w
     2    use strict;
     3    my $wordfile = $ARGV[0];
     4    my $testfile = $ARGV[1];
     5    my %occurrences;
     6    open(WF, "<", $wordfile) or die "Can't open $wordfile: $!";
     7    while (<WF>) {
     8      chomp;
     9      $occurrences{$_} = 0
    10    }
    11    close(WF) or die "Can't close $wordfile: $!";
    12    open(TF, "<", $testfile) or die "Can't open $testfile: $!";
    13    while (<TF>) {
    14      chomp;
    15      while (/(\w+)/g) {
    16        $occurrences{$1}++ if defined $occurrences{$1};
    17      }
    18    }
    19    close(TF) or die "Can't close $testfile: $!";
    20    while (my ($k, $v) = each %occurrences) {
    21      printf("%-10s occurs %5d times\n", $k, $v);
    22    }
$ 
$ # Execution of the Perl program
$ perl word_occurrences.pl wordlist.txt poe_the_gold_bug.txt
attract    occurs     0 times
for        occurs   109 times
be         occurs    72 times
at         occurs    96 times
being      occurs    13 times
if         occurs    24 times
$ 
$

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

07-24-2013

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Hello,
I tried the awk script but it does not work.
I created a file called txt which is the source file for which the frequencies have to be found

Code:

eng
book
shop
writ

and a large file of English words which I am appending as a zip for testing.
The idea is that the script should find the strings provided in the input file and spew out all words containing their frequency.
Thus in the corpus 1134 instances of eng were detected (did this in Ultraedit) and a sample output desired is provided below:

Code:

eng=1134
engine
strength
revenge
engaged
challenge
passengers
engineer
engagement
engines
messenger
length
vengeance
passenger
engage
avenge
engineering
engine
engineers
Deng
challenged
challenging
penguin

Many thanks for the help. Please note that I cannot use Unix tools since I work in Windows/DOS.

en.zip (2.02 MB)

gimley

View Public Profile for gimley

Find all posts by gimley

07-24-2013

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

Looks like "grep" returns a different count for "eng" than UltraEdit. But the counts determined by grep and Perl are consistent.

Code:

$ 
$ cat words.txt
eng
book
shop
writ
$ 
$ 
$ grep eng en.txt | wc -l
1123
$ 
$ grep book en.txt | wc -l
220
$ 
$ grep shop en.txt | wc -l
147
$ 
$ grep writ en.txt | wc -l
176
$ 
$ # The Perl program
$ cat -n word_frequency.pl
     1    #!/usr/bin/perl -w
     2    use strict;
     3    my $wordfile = $ARGV[0];
     4    my $testfile = $ARGV[1];
     5    my %occurrences;
     6    open(WF, "<", $wordfile) or die "Can't open $wordfile: $!";
     7    while (<WF>) {
     8      chomp;
     9      $occurrences{$_} = 0
    10    }
    11    close(WF) or die "Can't close $wordfile: $!";
    12    open(TF, "<", $testfile) or die "Can't open $testfile: $!";
    13    while (<TF>) {
    14      chomp(my $word = $_);
    15      foreach my $k (keys %occurrences) {
    16        $occurrences{$k}++ if $word =~ /$k/
    17      }
    18    }
    19    close(TF) or die "Can't close $testfile: $!";
    20    while (my ($k, $v) = each %occurrences) {
    21      printf("%-10s occurs %5d times\n", $k, $v);
    22    }
$ 
$ # "en.txt" is the file you attached in your post
$ perl word_frequency.pl words.txt en.txt
shop       occurs   147 times
book       occurs   220 times
writ       occurs   176 times
eng        occurs  1123 times
$ 
$

This User Gave Thanks to durden_tyler For This Post:

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

07-24-2013

Registered User

83, 16

Join Date: Sep 2010

Last Activity: 9 March 2015, 1:19 PM EDT

Posts: 83

Thanks Given: 0

Thanked 16 Times in 16 Posts

In awk, this might be long-winded

Code:

bash-3.2$ cat list
be
at
for
if
being
attract
bash-3.2$ cat input
at
be
bash-3.2$ 
bash-3.2$ 
bash-3.2$ awk 'BEGIN { while((getline line < "input") > 0) { pat[line] = 0 } }  { for(x in pat) { if($0 ~ x) { pat[x]++; matched[x,pat[x]]=$0; } } }  END { for (x in pat) { print x"="pat[x]; for (c=1; c<=pat[x]; c++) { print matched[x,c] } }}' list
be=2
be
being
at=2
at
attract

MR.bean

View Public Profile for MR.bean

Find all posts by MR.bean

Shell Programming and Scripting

Creating Frequency of words from a file by accessing a corpus

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Replace particular words in file based on if finds another words in that line

Discussion started by: Rajib Podder

2. HP-UX

Problems creating and accessing with user

Discussion started by: anaigini45

3. Shell Programming and Scripting

Frequency of Words in a File, sed script from 1980

Discussion started by: 1in10

4. UNIX for Dummies Questions & Answers

Replace the words in the file to the words that user type?

Discussion started by: malfolozy

5. Shell Programming and Scripting

How count the number of two words associated with the two words occurring in the file?

Discussion started by: jmarx

6. Shell Programming and Scripting

Assigning the same frequency to more than one words in a file

Discussion started by: gimley

7. Shell Programming and Scripting

Splitting concatenated words in input file with words from the same file

Discussion started by: gimley

8. Shell Programming and Scripting

count frequency of words in a file

Discussion started by: mohit_iitk

9. Shell Programming and Scripting

Splitting Concatenated Words in Input File with Words from a Master File

Discussion started by: gimley

10. Shell Programming and Scripting

Creating String from words in a file

Discussion started by: deepakthaman