Sponsored Content
Top Forums Shell Programming and Scripting Linguistic project: extract co-occurrences from text corpus Post 302661031 by bobylapointe on Sunday 24th of June 2012 01:30:02 PM
Old 06-24-2012
Linguistic project: extract co-occurrences from text corpus

Hello guys,

I've got a big corpus (a huge text file in which words are separated by one or several spaces). I would like to know if there is a simple way - using awk for instance - to extract any co-occurrence appearing at least 3times through the whole corpus for a given word. By co-occurrence, here, I mean every word that appears to the left of this given word.

For instance: "dog"

big dog (appearing 4 times)
mean dog (appearing 3 times)
blue dog (appearing only once, thus excluded)

The output would look something like this:

big dog 4
mean dog 3

The cherry on top would be to add a condition that would exclude any combination separated by "." in the middle to avoid this scenario (for "dogs"):
Shell scripting is hard. Dogs are...
"hard. Dogs" would be rejected.

I could try to do it on my own if you would be kind enough to point me in the right direction.

Thank you very much !
 

6 More Discussions You Might Find Interesting

1. Programming

c program to extract text between two delimiters from some text file

needa c program to extract text between two delimiters from some text file. and then storing them in to diffrent variables ? text file like 0: abc.txt ========= aaaaaa|11111111|sssssssssss|333333|ddddddddd|34343454564|asass aaaaaa|11111111|sssssssssss|333333|ddddddddd|34343454564|asass... (7 Replies)
Discussion started by: kukretiabhi13
7 Replies

2. Shell Programming and Scripting

Text Substitution Project

History: large open source PHP project, school management program. Comprises about 200 scripts. Had another developer for awhile, and he wanted a version in German, so he edited all the scripts and replaced text that would show up in the browser with variables (i.e. instead of "Click Here",... (7 Replies)
Discussion started by: dougp23
7 Replies

3. Shell Programming and Scripting

Creating Frequency of words from a file by accessing a corpus

Hello, I have a large file of syllables /strings in Urdu. Each word is on a separate line. Example in English: be at for if being attract I need to identify the frequency of each of these strings from a large corpus (which I cannot attach unfortunately because of size limitations) and... (7 Replies)
Discussion started by: gimley
7 Replies

4. Shell Programming and Scripting

Grepping verbal forms from a large corpus

I want to extract verbal forms from a large corpus of English. I have identified a certain number of patterns. Each pattern has the following structure SPACE word_CATEGORY where word refers to the verbal form and CATEGORY refers to the class of the verb The categories are identified as per the... (4 Replies)
Discussion started by: gimley
4 Replies

5. Shell Programming and Scripting

Remove duplicate occurrences of text pattern

Hi folks! I have a file which contains a 1000 lines. On each line i have multiple occurrences ( 26 to be exact ) of pattern folder#/folder#. # is depicting the line number in the file some text here folder1/folder1 some text here folder1/folder1 some text here folder1/folder1 some text... (7 Replies)
Discussion started by: martinsmith
7 Replies

6. Shell Programming and Scripting

Alignment tool to join text files in 2 directories to create a parallel corpus

I have two directories called English and Hindi. Each directory contains the same number of files with the only difference being that in the case of the English Directory the tag is .english and in the Hindi one the tag is .Hindi The file may contain either a single text or more than one text... (7 Replies)
Discussion started by: gimley
7 Replies
Ns_Pathname(3aolserver) 				   AOLserver Library Procedures 				   Ns_Pathname(3aolserver)

__________________________________________________________________________________________________________________________________________________

NAME
Ns_HomePath, Ns_LibPath, Ns_MakePath, Ns_ModulePath, Ns_NormalizePath, Ns_PathIsAbsolute - Pathname procedures SYNOPSIS
#include "ns.h" char * Ns_HomePath(Ns_DString *dest, ...) char * Ns_LibPath(Ns_DString *dest, ...) char * Ns_MakePath(Ns_DString *dest, ...) char * Ns_ModulePath(Ns_DString *dest, char *server, char *module, ...) char * Ns_NormalizePath(Ns_DString *dsPtr, char *path) int Ns_PathIsAbsolute(char *path) _________________________________________________________________ DESCRIPTION
These functions operate on file pathnames. They work with Unix and Windows pathnames on their respective hosts. Ns_HomePath(dest, ...) Construct a path name relative to the home directory of the server. The full path is constructed by appending the library directory followed by each of the variable number of string elements after the dest argument. The elements will be separated by a / charac- ter. The list must be terminated with a NULL string. Ns_LibPath(dest, ...) Construct a path name relative to the library directory of the server, normally the lib/ subdirectory of the home directory. The full path is constructed by appending the library directory followed by each of the variable number of string elements after the dest argument. The elements will be separated by a / character. The list must be terminated with a NULL string. Ns_MakePath(dest, ...) Construct a path name from a list of path elements. The Ns_MakePath function constructs a path name by appending a list of path ele- ments to the given Ns_DString. The path elements are separated by single slashes, and the resulting path name is appended to the given Ns_DString. The last argument needs to be NULL to indicate the end of the argument list. Ns_ModulePath(dest, char *server, char *module, ...) Construct a server and/or module specific pathname relative to the server home directory. The path in constructed by first append- ing the server home directory. Next, if the server argument is not NULL, "server/servere appended to the destination and if the module argument is not NULL, "module/module" will be appended. Finally, all other string elements, if any, will be appended to the destination with separating / characters. The list must be terminated with a NULL string. Ns_NormalizePath(dsPtr, path) Normalize a path name. This function removes any extraneous slashes from the path and resolves "." and ".." references. The result is appended to the given Ns_DString. The following code appends "/dog" to the Ns_DString: Ns_NormalizePath(&ds, "/dog/cat/../../rat/../../dog//mouse/.."); Ns_PathIsAbsolute(path) Check for an absolute path name. Return NS_TRUE if the path is absolute and NS_FALSE otherwise. Under Unix, an absolute path starts with a "/". On Windows, it starts with a drive letter followed immediately by a ":". SEE ALSO
nsd(1), info(n) KEYWORDS
AOLserver 4.0 Ns_Pathname(3aolserver)
All times are GMT -4. The time now is 07:31 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy