Linguistic project: extract co-occurrences from text corpus

06-24-2012

Registered User

16, 0

Join Date: Jun 2012

Last Activity: 30 June 2012, 7:40 AM EDT

Posts: 16

Thanks Given: 9

Thanked 0 Times in 0 Posts

Linguistic project: extract co-occurrences from text corpus

Hello guys,

I've got a big corpus (a huge text file in which words are separated by one or several spaces). I would like to know if there is a simple way - using awk for instance - to extract any co-occurrence appearing at least 3times through the whole corpus for a given word. By co-occurrence, here, I mean every word that appears to the left of this given word.

For instance: "dog"

big dog (appearing 4 times)
mean dog (appearing 3 times)
blue dog (appearing only once, thus excluded)

The output would look something like this:

big dog 4
mean dog 3

The cherry on top would be to add a condition that would exclude any combination separated by "." in the middle to avoid this scenario (for "dogs"):
Shell scripting is hard. Dogs are...
"hard. Dogs" would be rejected.

I could try to do it on my own if you would be kind enough to point me in the right direction.

Thank you very much !

bobylapointe

View Public Profile for bobylapointe

Find all posts by bobylapointe

06-24-2012

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

Can you post some sample data?

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

06-24-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

If you use "dog|dogs" as a field separator, then any resulting field would be adjacent to the word dog or dogs, if the number of fields is >1

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

06-24-2012

Registered User

846, 29

Join Date: Jan 2007

Last Activity: 2 December 2019, 5:59 PM EST

Posts: 846

Thanks Given: 94

Thanked 29 Times in 25 Posts

Are you saying that if "big dog" appears 3 times or more in a given piece of text, it should return the number of occurrences, whereby the user provides the search word, in your example "dog"?
You speak of the period (".") as the delimiter, but you ultimately want to extend this to other punctuation as well, such as ! ? ; , etc?

This User Gave Thanks to figaro For This Post:

figaro

View Public Profile for figaro

Find all posts by figaro

06-24-2012

Registered User

16, 0

Join Date: Jun 2012

Last Activity: 30 June 2012, 7:40 AM EDT

Posts: 16

Thanks Given: 9

Thanked 0 Times in 0 Posts

I'm sorry I wasn't really clear in my first post. In more concrete words, I'm trying to see with what word the word of my choice is most commonly associated with - on its left, that is to say: word wordofmychoice - within a corpus.

The textfile looks like this:

Are you one of those people who prefer larger dogs? Do you know someone who has told you that they prefer larger dogs because small dogs are yappy and snappy? Whether you are a large-dog person or a small-dog person, one thing we all would agree on is that a larger percentage of small dogs tend to have a different type of temperament than medium and large dogs. Small dogs have earned the reputation of being yappy, snappy, jealous, protective, wary of strangers and not the greatest child companions.

Let's say I'm interested in the word "dogs". The output would be:
larger dogs
small dogs
large dogs

But I want to count how many times each association appears:
larger dogs 2
small dogs 3
large dogs 1

And, I only want to keep (print in a new file) associations appearing at least 3 times. Therefore, the final result (in a new textfile) I want to obtain would be:
small dogs 3

That's it basically. If possible, now, but this is not a priority, I would like to make sure no association contain any punctuation in the middle, to avoid getting what I would call false results. For instance, let's say I'm looking for "small" and its associations (with one word on the left) in the previous text:

"dogs. Small"

This is what I want to avoid. But once again, that's not a priority.

Thanks for your answers guys, I hope it was a bit clearer

bobylapointe

View Public Profile for bobylapointe

Find all posts by bobylapointe

06-24-2012

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

How about this:

Code:

awk -F'[- ]' -vW=dogs '
BEGIN{IGNORECASE=1;S="[.?)]"}
$0 ~ W {
  p=$1;
  for(i=2;i<=NF;i++) {
    if ($i ~ "^"W S"*$" && p !~ S) c[p]++;
    p=tolower($i)} }
END { for(w in c)
  if (c[w] >= 3) print w,W,c[w] }' infile

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

06-25-2012

Registered User

16, 0

Join Date: Jun 2012

Last Activity: 30 June 2012, 7:40 AM EDT

Posts: 16

Thanks Given: 9

Thanked 0 Times in 0 Posts

Thank you Chubler, it's working flawlessly !

bobylapointe

View Public Profile for bobylapointe

Find all posts by bobylapointe

Shell Programming and Scripting

Linguistic project: extract co-occurrences from text corpus

6 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Alignment tool to join text files in 2 directories to create a parallel corpus

Discussion started by: gimley

2. Shell Programming and Scripting

Remove duplicate occurrences of text pattern

Discussion started by: martinsmith

3. Shell Programming and Scripting

Grepping verbal forms from a large corpus

Discussion started by: gimley

4. Shell Programming and Scripting

Creating Frequency of words from a file by accessing a corpus

Discussion started by: gimley

5. Shell Programming and Scripting

Text Substitution Project

Discussion started by: dougp23

6. Programming

c program to extract text between two delimiters from some text file

Discussion started by: kukretiabhi13