Creating Frequency of words from a file by accessing a corpus Post: 302836273

Sponsored Content

Top Forums Shell Programming and Scripting Creating Frequency of words from a file by accessing a corpus Post 302836273 by durden_tyler on Wednesday 24th of July 2013 12:03:17 AM

07-24-2013

Registered User

Quote:

Originally Posted by gimley

...I need to identify the frequency of each of these strings from a large corpus (which I cannot attach unfortunately because of size limitations) and identify the frequency of each string.
...

Code:

$ 
$ # "wordlist.txt" is a list of words that we have to check
$ cat wordlist.txt
be
at
for
if
being
attract
$ 
$ # "poe_the_gold_bug.txt" is a text file against which we have to
$ # check the words. This file contains the story "The Gold Bug" by
$ # Edgar Allen Poe from the Project Gutenberg website.
$ wc poe_the_gold_bug.txt
 1460 13462 76460 poe_the_gold_bug.txt
$ 
$ # A Perl program to check the frequency of words from "wordlist.txt"
$ # in the file "poe_the_gold_bug.txt"
$ cat -n word_occurrences.pl
     1    #!/usr/bin/perl -w
     2    use strict;
     3    my $wordfile = $ARGV[0];
     4    my $testfile = $ARGV[1];
     5    my %occurrences;
     6    open(WF, "<", $wordfile) or die "Can't open $wordfile: $!";
     7    while (<WF>) {
     8      chomp;
     9      $occurrences{$_} = 0
    10    }
    11    close(WF) or die "Can't close $wordfile: $!";
    12    open(TF, "<", $testfile) or die "Can't open $testfile: $!";
    13    while (<TF>) {
    14      chomp;
    15      while (/(\w+)/g) {
    16        $occurrences{$1}++ if defined $occurrences{$1};
    17      }
    18    }
    19    close(TF) or die "Can't close $testfile: $!";
    20    while (my ($k, $v) = each %occurrences) {
    21      printf("%-10s occurs %5d times\n", $k, $v);
    22    }
$ 
$ # Execution of the Perl program
$ perl word_occurrences.pl wordlist.txt poe_the_gold_bug.txt
attract    occurs     0 times
for        occurs   109 times
be         occurs    72 times
at         occurs    96 times
being      occurs    13 times
if         occurs    24 times
$ 
$

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Creating String from words in a file

Hi i have a file called search.txt Which contains text like Car Bus Cat Dog Now i have to create a string from the file which should look like Car,Bus,Cat,Dog ( appending , is essential part) String must be stored in some variable so i can pass it as argument to some other...

2. Shell Programming and Scripting

Splitting Concatenated Words in Input File with Words from a Master File

Hello, I have a complex problem. I have a file in which words have been joined together: Theboy ranslowly I want to be able to correctly split the words using a lookup file in which all the words occur: the boy ran slowly slow put child ly The lookup file which is meant for look up...

3. Shell Programming and Scripting

count frequency of words in a file

I need to write a shell script "cmn" that, given an integer k, print the k most common words in descending order of frequency. Example Usage: user@ubuntu:/$ cmn 4 < example.txt :b:

4. Shell Programming and Scripting

Splitting concatenated words in input file with words from the same file

Dear all, I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list. An example would make this clear annamarie mariechristine johnsmith johnjoseph smith john smith...

5. Shell Programming and Scripting

Assigning the same frequency to more than one words in a file

I have a file of names with the following structure NAME FREQUENCY NAME NAME FREQUENCY NAME NAME NAME FREQUENCY i.e. more than one name is assigned the same frequency. An example will make this clear SANDHYA DAS 6901 ARATI DAS 6201 KALPANA DAS 4714 GITA DAS 4550 BISWANATH DAS 3949...

6. Shell Programming and Scripting

How count the number of two words associated with the two words occurring in the file?

Hi , I need to count the number of errors associated with the two words occurring in the file. It's about counting the occurrences of the word "error" for where is the word "index.js". As such the command should look like. Please kindly help. I was trying: grep "error" log.txt | wc -l

7. UNIX for Dummies Questions & Answers

Replace the words in the file to the words that user type?

Hello, I would like to change my setting in a file to the setting that user input. For example, by default it is ONBOOT=ON When user key in "YES", it would be ONBOOT=YES -------------- This code only adds in the entire user input, but didn't replace it. How do i go about...

8. Shell Programming and Scripting

Frequency of Words in a File, sed script from 1980

tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | sort -k1,1nr -k2 | sed ${1:-25} < book7.txt This is not my script, it can be found way back from 1980 but once it worked fine to give me the most used words in a text file. Now the shell is complaining about an error in sed sed: -e...

9. HP-UX

Problems creating and accessing with user

Hi, I have created the user 'mastersa' in several servers. I need to change the user ID to '0'. However, after doing this, I am not able to login (Access denied). Even after I change the password, I still get this error. Why is this? Also, when I attempt to delete the user account, I get...

10. Shell Programming and Scripting

Replace particular words in file based on if finds another words in that line

Hi All, I need one help to replace particular words in file based on if finds another words in that file . i.e. my self is peter@king. i am staying at north sydney. we all are peter@king. How to replace peter to sham if it finds @king in any line of that file. Please help me...

LEARN ABOUT DEBIAN

bup-margin

bup-margin(1)						      General Commands Manual						     bup-margin(1)

NAME

       bup-margin - figure out your deduplication safety margin

SYNOPSIS

       bup margin [options...]

DESCRIPTION

       bup margin  iterates  through  all  objects  in	your  bup repository, calculating the largest number of prefix bits shared between any two
       entries.  This number, n, identifies the longest subset of SHA-1 you could use and still encounter a collision between your object ids.

       For example, one system that was tested had a collection of 11 million objects (70 GB), and bup margin returned 45.  That  means  a  46-bit
       hash  would be sufficient to avoid all collisions among that set of objects; each object in that repository could be uniquely identified by
       its first 46 bits.

       The number of bits needed seems to increase by about 1 or 2 for every doubling of the number of objects.  Since SHA-1 hashes have 160 bits,
       that  leaves 115 bits of margin.  Of course, because SHA-1 hashes are essentially random, it's theoretically possible to use many more bits
       with far fewer objects.

       If you're paranoid about the possibility of SHA-1 collisions, you can monitor your repository by running bup margin occasionally to see	if
       you're getting dangerously close to 160 bits.

OPTIONS

       --predict
	      Guess  the offset into each index file where a particular object will appear, and report the maximum deviation of the correct answer
	      from the guess.  This is potentially useful for tuning an interpolation search algorithm.

       --ignore-midx
	      don't use .midx files, use only .idx files.  This is only really useful when used with --predict.

EXAMPLE

	      $ bup margin
	      Reading indexes: 100.00% (1612581/1612581), done.
	      40
	      40 matching prefix bits
	      1.94 bits per doubling
	      120 bits (61.86 doublings) remaining
	      4.19338e+18 times larger is possible

	      Everyone on earth could have 625878182 data sets
	      like yours, all in one repository, and we would
	      expect 1 object collision.

	      $ bup margin --predict
	      PackIdxList: using 1 index.
	      Reading indexes: 100.00% (1612581/1612581), done.
	      915 of 1612581 (0.057%)

SEE ALSO

       bup-midx(1), bup-save(1)

BUP

       Part of the bup(1) suite.

AUTHORS

       Avery Pennarun <apenwarr@gmail.com>.

Bup unknown-															     bup-margin(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Creating String from words in a file

Discussion started by: deepakthaman

2. Shell Programming and Scripting

Splitting Concatenated Words in Input File with Words from a Master File

Discussion started by: gimley

3. Shell Programming and Scripting

count frequency of words in a file

Discussion started by: mohit_iitk

4. Shell Programming and Scripting

Splitting concatenated words in input file with words from the same file

Discussion started by: gimley

5. Shell Programming and Scripting

Assigning the same frequency to more than one words in a file

Discussion started by: gimley

6. Shell Programming and Scripting

How count the number of two words associated with the two words occurring in the file?

Discussion started by: jmarx

7. UNIX for Dummies Questions & Answers

Replace the words in the file to the words that user type?

Discussion started by: malfolozy

8. Shell Programming and Scripting

Frequency of Words in a File, sed script from 1980

Discussion started by: 1in10

9. HP-UX

Problems creating and accessing with user

Discussion started by: anaigini45

10. Shell Programming and Scripting

Replace particular words in file based on if finds another words in that line

Discussion started by: Rajib Podder

LEARN ABOUT DEBIAN

bup-margin