Sponsored Content
Top Forums Shell Programming and Scripting Grepping verbal forms from a large corpus Post 302954830 by vgersh99 on Friday 11th of September 2015 11:02:31 AM
Old 09-11-2015
something along these lines - didn't check the output for the complete correctness.
Maybe a smaller corpus file/line could help....
awk -f gim.awk gimPat.txt gimCorpus.txt where gim.awk is:
Code:
FNR==NR {
  gsub("word", "[^ ][^ ]*")
  pat[$0]
  next
}
{
  for(i in pat) {
    s=$0
    while (match(s,i)) {
      print substr(s,RSTART, RLENGTH)
      s=substr(s,RSTART+RLENGTH+1)
    }
  }
}

results in:
Code:
are_VBP trying_VB
can_MD be_VB used_VBN
will_MD be_VB extended_VBN
will_MD compete_VB
can_MD be_VB
will_MD allow_VB
can_MD access_VB
can_MD download_VB
will_MD be_VB
are_VBP
have_VBP
update_VBP
are_VBP
are_VBP trying_VBG

This User Gave Thanks to vgersh99 For This Post:
 

9 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Forms

Hi, I currently have a form containing three boxes of info to be filled in. I would like it so if the user presses F10 a list of company names is displayed, using the company names from a table I have. I would like this list to be in a popup window if it is possible. I am using Informix, sco-unix.... (2 Replies)
Discussion started by: Dan Rooney
2 Replies

2. UNIX for Dummies Questions & Answers

Unix Forms

Hi Im new so be gentle Just starting out in programing and i want to try unix to see what all the fuss is about. But right now im like a kid in a sweet shop, spoilt for choice. Theres red hat, fedora, linux, ubuntu and thats just for starters I've been told ubuntu is a nice... (3 Replies)
Discussion started by: NightWatchman
3 Replies

3. Shell Programming and Scripting

Linguistic project: extract co-occurrences from text corpus

Hello guys, I've got a big corpus (a huge text file in which words are separated by one or several spaces). I would like to know if there is a simple way - using awk for instance - to extract any co-occurrence appearing at least 3times through the whole corpus for a given word. By co-occurrence,... (7 Replies)
Discussion started by: bobylapointe
7 Replies

4. Shell Programming and Scripting

Grepping large list of files

Hi All, I need help to know the exact command when I grep large list of files. Either using ls or find command. However I do not want to find in the subdirectories as the number of subdirectories are not fixed. How do I achieve that. I want something like this: find ./ -name "MYFILE*.txt"... (2 Replies)
Discussion started by: angshuman
2 Replies

5. Shell Programming and Scripting

Performance issue in Grepping large files

I have around 300 files(*.rdf,*.fmb,*.pll,*.ctl,*.sh,*.sql,*.prog) which are of large size. Around 8000 keywords(which will be in the file $keywordfile) needed to be searched inside those files. If a keyword is found in a file..I have to insert the filename,extension,catagoery,keyword,occurrence... (8 Replies)
Discussion started by: millan
8 Replies

6. Shell Programming and Scripting

Creating Frequency of words from a file by accessing a corpus

Hello, I have a large file of syllables /strings in Urdu. Each word is on a separate line. Example in English: be at for if being attract I need to identify the frequency of each of these strings from a large corpus (which I cannot attach unfortunately because of size limitations) and... (7 Replies)
Discussion started by: gimley
7 Replies

7. Homework & Coursework Questions

Dialog forms

1. The problem statement, all variables and given/known data: I need to create dialog interface for adress book i created a while ago but i don't know how to read info from forms 2. Relevant commands, code, scripts, algorithms: #!/bin/bash knyga="adresu-knyga.txt" dialog... (0 Replies)
Discussion started by: sasisken
0 Replies

8. Shell Programming and Scripting

Creating verbal structures from a dictionary and a template

My main aim here is to create a database of verbs in a language to Hindi. The output if it works well will be put up on a University site for researchers to use for Machine Translation. This because one of the main weaknesses of MT is in the area of verbs. Sorry for the long post but the problem... (4 Replies)
Discussion started by: gimley
4 Replies

9. Shell Programming and Scripting

Alignment tool to join text files in 2 directories to create a parallel corpus

I have two directories called English and Hindi. Each directory contains the same number of files with the only difference being that in the case of the English Directory the tag is .english and in the Hindi one the tag is .Hindi The file may contain either a single text or more than one text... (7 Replies)
Discussion started by: gimley
7 Replies
apertium-preprocess-corpus-lextor(1)									      apertium-preprocess-corpus-lextor(1)

NAME
apertium-preprocess-corpus-lextor - This application is part of ( apertium ) This tool is part of the apertium machine translation architecture: http://apertium.org. SYNOPSIS
apertium-preprocess-corpus-lextor data_dir translation_dir input_file output_file DESCRIPTION
apertium-preprocess-corpus-lextor is the application responsible for preprocessing the training corpus for the lexical selector training. OPTIONS
This tool currently has no options. FILES
These are the kinds of files and directories used with this tool: data_dir the path to the linguistic data to use. translation_dir the translation direction to use. input_file contains a large corpus in raw format. output_file The file which gets the preprocessed corpus. SEE ALSO
apertium-gen-lextorbil(1), apertium-gen-lextormono(1), apertium-gen-lextor-eval(1), apertium-gen-stopwords-lextor(1), aper- tium-gen-wlist-lextor(1), apertium-gen-wlist-lextor-translation(1), apertium-lextor(1). BUGS
Lots of...lurking in the dark and waiting for you! AUTHOR
(c) 2005,2006 Universitat d'Alacant / Universidad de Alicante. All rights reserved. 2006-12-12 apertium-preprocess-corpus-lextor(1)
All times are GMT -4. The time now is 03:47 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy