06-24-2012
16,
0
Join Date: Jun 2012
Last Activity: 30 June 2012, 7:40 AM EDT
Posts: 16
Thanks Given: 9
Thanked 0 Times in 0 Posts
I'm sorry I wasn't really clear in my first post. In more concrete words, I'm trying to see with what word the word of my choice is most commonly associated with - on its left, that is to say: word wordofmychoice - within a corpus.
The textfile looks like this:
Are you one of those people who prefer larger dogs? Do you know someone who has told you that they prefer larger dogs because small dogs are yappy and snappy? Whether you are a large-dog person or a small-dog person, one thing we all would agree on is that a larger percentage of small dogs tend to have a different type of temperament than medium and large dogs. Small dogs have earned the reputation of being yappy, snappy, jealous, protective, wary of strangers and not the greatest child companions.
Let's say I'm interested in the word "dogs". The output would be:
larger dogs
small dogs
large dogs
But I want to count how many times each association appears:
larger dogs 2
small dogs 3
large dogs 1
And, I only want to keep (print in a new file) associations appearing at least 3 times. Therefore, the final result (in a new textfile) I want to obtain would be:
small dogs 3
That's it basically. If possible, now, but this is not a priority, I would like to make sure no association contain any punctuation in the middle, to avoid getting what I would call false results. For instance, let's say I'm looking for "small" and its associations (with one word on the left) in the previous text:
"dogs. Small"
This is what I want to avoid. But once again, that's not a priority.
Thanks for your answers guys, I hope it was a bit clearer