Linguistic project: extract co-occurrences from text corpus


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Linguistic project: extract co-occurrences from text corpus
# 1  
Old 06-24-2012
Linguistic project: extract co-occurrences from text corpus

Hello guys,

I've got a big corpus (a huge text file in which words are separated by one or several spaces). I would like to know if there is a simple way - using awk for instance - to extract any co-occurrence appearing at least 3times through the whole corpus for a given word. By co-occurrence, here, I mean every word that appears to the left of this given word.

For instance: "dog"

big dog (appearing 4 times)
mean dog (appearing 3 times)
blue dog (appearing only once, thus excluded)

The output would look something like this:

big dog 4
mean dog 3

The cherry on top would be to add a condition that would exclude any combination separated by "." in the middle to avoid this scenario (for "dogs"):
Shell scripting is hard. Dogs are...
"hard. Dogs" would be rejected.

I could try to do it on my own if you would be kind enough to point me in the right direction.

Thank you very much !
# 2  
Old 06-24-2012
Can you post some sample data?
This User Gave Thanks to bartus11 For This Post:
# 3  
Old 06-24-2012
If you use "dog|dogs" as a field separator, then any resulting field would be adjacent to the word dog or dogs, if the number of fields is >1
This User Gave Thanks to Scrutinizer For This Post:
# 4  
Old 06-24-2012
Are you saying that if "big dog" appears 3 times or more in a given piece of text, it should return the number of occurrences, whereby the user provides the search word, in your example "dog"?
You speak of the period (".") as the delimiter, but you ultimately want to extend this to other punctuation as well, such as ! ? ; , etc?
This User Gave Thanks to figaro For This Post:
# 5  
Old 06-24-2012
I'm sorry I wasn't really clear in my first post. In more concrete words, I'm trying to see with what word the word of my choice is most commonly associated with - on its left, that is to say: word wordofmychoice - within a corpus.

The textfile looks like this:

Are you one of those people who prefer larger dogs? Do you know someone who has told you that they prefer larger dogs because small dogs are yappy and snappy? Whether you are a large-dog person or a small-dog person, one thing we all would agree on is that a larger percentage of small dogs tend to have a different type of temperament than medium and large dogs. Small dogs have earned the reputation of being yappy, snappy, jealous, protective, wary of strangers and not the greatest child companions.


Let's say I'm interested in the word "dogs". The output would be:
larger dogs
small dogs
large dogs

But I want to count how many times each association appears:
larger dogs 2
small dogs 3
large dogs 1

And, I only want to keep (print in a new file) associations appearing at least 3 times. Therefore, the final result (in a new textfile) I want to obtain would be:
small dogs 3

That's it basically. If possible, now, but this is not a priority, I would like to make sure no association contain any punctuation in the middle, to avoid getting what I would call false results. For instance, let's say I'm looking for "small" and its associations (with one word on the left) in the previous text:

"dogs. Small"

This is what I want to avoid. But once again, that's not a priority.

Thanks for your answers guys, I hope it was a bit clearer
# 6  
Old 06-24-2012
How about this:

Code:
awk -F'[- ]' -vW=dogs '
BEGIN{IGNORECASE=1;S="[.?)]"}
$0 ~ W {
  p=$1;
  for(i=2;i<=NF;i++) {
    if ($i ~ "^"W S"*$" && p !~ S) c[p]++;
    p=tolower($i)} }
END { for(w in c)
  if (c[w] >= 3) print w,W,c[w] }' infile

This User Gave Thanks to Chubler_XL For This Post:
# 7  
Old 06-25-2012
Thank you Chubler, it's working flawlessly !
Login or Register to Ask a Question

Previous Thread | Next Thread

6 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Alignment tool to join text files in 2 directories to create a parallel corpus

I have two directories called English and Hindi. Each directory contains the same number of files with the only difference being that in the case of the English Directory the tag is .english and in the Hindi one the tag is .Hindi The file may contain either a single text or more than one text... (7 Replies)
Discussion started by: gimley
7 Replies

2. Shell Programming and Scripting

Remove duplicate occurrences of text pattern

Hi folks! I have a file which contains a 1000 lines. On each line i have multiple occurrences ( 26 to be exact ) of pattern folder#/folder#. # is depicting the line number in the file some text here folder1/folder1 some text here folder1/folder1 some text here folder1/folder1 some text... (7 Replies)
Discussion started by: martinsmith
7 Replies

3. Shell Programming and Scripting

Grepping verbal forms from a large corpus

I want to extract verbal forms from a large corpus of English. I have identified a certain number of patterns. Each pattern has the following structure SPACE word_CATEGORY where word refers to the verbal form and CATEGORY refers to the class of the verb The categories are identified as per the... (4 Replies)
Discussion started by: gimley
4 Replies

4. Shell Programming and Scripting

Creating Frequency of words from a file by accessing a corpus

Hello, I have a large file of syllables /strings in Urdu. Each word is on a separate line. Example in English: be at for if being attract I need to identify the frequency of each of these strings from a large corpus (which I cannot attach unfortunately because of size limitations) and... (7 Replies)
Discussion started by: gimley
7 Replies

5. Shell Programming and Scripting

Text Substitution Project

History: large open source PHP project, school management program. Comprises about 200 scripts. Had another developer for awhile, and he wanted a version in German, so he edited all the scripts and replaced text that would show up in the browser with variables (i.e. instead of "Click Here",... (7 Replies)
Discussion started by: dougp23
7 Replies

6. Programming

c program to extract text between two delimiters from some text file

needa c program to extract text between two delimiters from some text file. and then storing them in to diffrent variables ? text file like 0: abc.txt ========= aaaaaa|11111111|sssssssssss|333333|ddddddddd|34343454564|asass aaaaaa|11111111|sssssssssss|333333|ddddddddd|34343454564|asass... (7 Replies)
Discussion started by: kukretiabhi13
7 Replies
Login or Register to Ask a Question