06-24-2012
Linguistic project: extract co-occurrences from text corpus
Hello guys,
I've got a big corpus (a huge text file in which words are separated by one or several spaces). I would like to know if there is a simple way - using awk for instance - to extract any co-occurrence appearing at least 3times through the whole corpus for a given word. By co-occurrence, here, I mean every word that appears to the left of this given word.
For instance: "dog"
big dog (appearing 4 times)
mean dog (appearing 3 times)
blue dog (appearing only once, thus excluded)
The output would look something like this:
big dog 4
mean dog 3
The cherry on top would be to add a condition that would exclude any combination separated by "." in the middle to avoid this scenario (for "dogs"):
Shell scripting is hard. Dogs are...
"hard. Dogs" would be rejected.
I could try to do it on my own if you would be kind enough to point me in the right direction.
Thank you very much !
6 More Discussions You Might Find Interesting
1. Programming
needa c program to extract text between two delimiters from some text file.
and then storing them in to diffrent variables ?
text file like 0:
abc.txt
=========
aaaaaa|11111111|sssssssssss|333333|ddddddddd|34343454564|asass
aaaaaa|11111111|sssssssssss|333333|ddddddddd|34343454564|asass... (7 Replies)
Discussion started by: kukretiabhi13
7 Replies
2. Shell Programming and Scripting
History: large open source PHP project, school management program. Comprises about 200 scripts. Had another developer for awhile, and he wanted a version in German, so he edited all the scripts and replaced text that would show up in the browser with variables (i.e. instead of "Click Here",... (7 Replies)
Discussion started by: dougp23
7 Replies
3. Shell Programming and Scripting
Hello,
I have a large file of syllables /strings in Urdu. Each word is on a separate line.
Example in English:
be
at
for
if
being
attract
I need to identify the frequency of each of these strings from a large corpus (which I cannot attach unfortunately because of size limitations) and... (7 Replies)
Discussion started by: gimley
7 Replies
4. Shell Programming and Scripting
I want to extract verbal forms from a large corpus of English. I have identified a certain number of patterns. Each pattern has the following structure
SPACE word_CATEGORY
where word refers to the verbal form and CATEGORY refers to the class of the verb
The categories are identified as per the... (4 Replies)
Discussion started by: gimley
4 Replies
5. Shell Programming and Scripting
Hi folks!
I have a file which contains a 1000 lines. On each line i have multiple occurrences ( 26 to be exact ) of pattern folder#/folder#.
# is depicting the line number in the file
some text here folder1/folder1 some text here folder1/folder1 some text here folder1/folder1 some text... (7 Replies)
Discussion started by: martinsmith
7 Replies
6. Shell Programming and Scripting
I have two directories called English and Hindi. Each directory contains the same number of files with the only difference being that in the case of the English Directory the tag is
.english
and in the Hindi one the tag is
.Hindi
The file may contain either a single text or more than one text... (7 Replies)
Discussion started by: gimley
7 Replies
LEARN ABOUT MOJAVE
perlrequick5.18
PERLREQUICK(1) Perl Programmers Reference Guide PERLREQUICK(1)
NAME
perlrequick - Perl regular expressions quick start
DESCRIPTION
This page covers the very basics of understanding, creating and using regular expressions ('regexes') in Perl.
The Guide
Simple word matching
The simplest regex is simply a word, or more generally, a string of characters. A regex consisting of a word matches any string that
contains that word:
"Hello World" =~ /World/; # matches
In this statement, "World" is a regex and the "//" enclosing "/World/" tells Perl to search a string for a match. The operator "=~"
associates the string with the regex match and produces a true value if the regex matched, or false if the regex did not match. In our
case, "World" matches the second word in "Hello World", so the expression is true. This idea has several variations.
Expressions like this are useful in conditionals:
print "It matches
" if "Hello World" =~ /World/;
The sense of the match can be reversed by using "!~" operator:
print "It doesn't match
" if "Hello World" !~ /World/;
The literal string in the regex can be replaced by a variable:
$greeting = "World";
print "It matches
" if "Hello World" =~ /$greeting/;
If you're matching against $_, the "$_ =~" part can be omitted:
$_ = "Hello World";
print "It matches
" if /World/;
Finally, the "//" default delimiters for a match can be changed to arbitrary delimiters by putting an 'm' out front:
"Hello World" =~ m!World!; # matches, delimited by '!'
"Hello World" =~ m{World}; # matches, note the matching '{}'
"/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
# '/' becomes an ordinary char
Regexes must match a part of the string exactly in order for the statement to be true:
"Hello World" =~ /world/; # doesn't match, case sensitive
"Hello World" =~ /o W/; # matches, ' ' is an ordinary char
"Hello World" =~ /World /; # doesn't match, no ' ' at end
Perl will always match at the earliest possible point in the string:
"Hello World" =~ /o/; # matches 'o' in 'Hello'
"That hat is red" =~ /hat/; # matches 'hat' in 'That'
Not all characters can be used 'as is' in a match. Some characters, called metacharacters, are reserved for use in regex notation. The
metacharacters are
{}[]()^$.|*+?
A metacharacter can be matched by putting a backslash before it:
"2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
"2+2=4" =~ /2+2/; # matches, + is treated like an ordinary +
'C:WIN32' =~ /C:\WIN/; # matches
"/usr/bin/perl" =~ //usr/bin/perl/; # matches
In the last regex, the forward slash '/' is also backslashed, because it is used to delimit the regex.
Non-printable ASCII characters are represented by escape sequences. Common examples are " " for a tab, "
" for a newline, and "
" for a
carriage return. Arbitrary bytes are represented by octal escape sequences, e.g., "