I want to extract verbal forms from a large corpus of English. I have identified a certain number of patterns. Each pattern has the following structure
where
refers to the verbal form and
refers to the class of the verb
The categories are identified as per the Penn Tree Bank norms. These can be concatenated to form larger patterns.
From a tagged corpus of English I want to just extract strings that conform to the patterns leaving behind all residual data
An example will make this clear
I just need to extract
which conforms to the pattern
I am providing below a large pattern list which is not necessarily exhaustive
and a small corpus for testing purposes:
I hope I have made my requirement clear and would like the patterns extracted from a file listing the patterns.
I work under Windows and have tried to solve the problem using Regexes in Perl but the problem seems to be beyond my skills. Any help and if possible with commented code would be of great use to me to better my skills. I am 65+ and believe that one is never too old to learn.
Incidentally the data analysed will be provided to the Community working on Verbs in languages.
Many thanks for your help.
something along these lines - didn't check the output for the complete correctness.
Maybe a smaller corpus file/line could help.... awk -f gim.awk gimPat.txt gimCorpus.txt where gim.awk is:
results in:
Many thanks. Could I test it on a large corpus and get back to you. I will need some time for that but the test on a small corpus seems to work wonderfully.
Thanks once again. I tested it on a 28 Mb file. I ran the script in batch mode to test time in and time out. It took around 9 minutes to execute and the results are perfect. Incidentally this is a tool which can be used for generic pattern matching provided the pattern is correctly provided and I hope it will prove useful to others also.
Thanks a lot.
I have two directories called English and Hindi. Each directory contains the same number of files with the only difference being that in the case of the English Directory the tag is
.english
and in the Hindi one the tag is
.Hindi
The file may contain either a single text or more than one text... (7 Replies)
My main aim here is to create a database of verbs in a language to Hindi. The output if it works well will be put up on a University site for researchers to use for Machine Translation. This because one of the main weaknesses of MT is in the area of verbs.
Sorry for the long post but the problem... (4 Replies)
1. The problem statement, all variables and given/known data:
I need to create dialog interface for adress book i created a while ago but i don't know how to read info from forms
2. Relevant commands, code, scripts, algorithms:
#!/bin/bash
knyga="adresu-knyga.txt"
dialog... (0 Replies)
Hello,
I have a large file of syllables /strings in Urdu. Each word is on a separate line.
Example in English:
be
at
for
if
being
attract
I need to identify the frequency of each of these strings from a large corpus (which I cannot attach unfortunately because of size limitations) and... (7 Replies)
I have around 300 files(*.rdf,*.fmb,*.pll,*.ctl,*.sh,*.sql,*.prog) which are of large size.
Around 8000 keywords(which will be in the file $keywordfile) needed to be searched inside those files.
If a keyword is found in a file..I have to insert the filename,extension,catagoery,keyword,occurrence... (8 Replies)
Hi All,
I need help to know the exact command when I grep large list of files. Either using ls or find command. However I do not want to find in the subdirectories as the number of subdirectories are not fixed. How do I achieve that.
I want something like this:
find ./ -name "MYFILE*.txt"... (2 Replies)
Hello guys,
I've got a big corpus (a huge text file in which words are separated by one or several spaces). I would like to know if there is a simple way - using awk for instance - to extract any co-occurrence appearing at least 3times through the whole corpus for a given word. By co-occurrence,... (7 Replies)
Hi
Im new so be gentle
Just starting out in programing and i want to try unix to see what all the fuss is about.
But right now im like a kid in a sweet shop, spoilt for choice.
Theres red hat, fedora, linux, ubuntu and thats just for starters
I've been told ubuntu is a nice... (3 Replies)
Hi, I currently have a form containing three boxes of info to be filled in. I would like it so if the user presses F10 a list of company names is displayed, using the company names from a table I have. I would like this list to be in a popup window if it is possible. I am using Informix, sco-unix.... (2 Replies)