Grepping verbal forms from a large corpus


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Grepping verbal forms from a large corpus
# 1  
Old 09-11-2015
Grepping verbal forms from a large corpus

I want to extract verbal forms from a large corpus of English. I have identified a certain number of patterns. Each pattern has the following structure
Code:
SPACE word_CATEGORY

where
Code:
word

refers to the verbal form and
Code:
CATEGORY

refers to the class of the verb
The categories are identified as per the Penn Tree Bank norms. These can be concatenated to form larger patterns.
From a tagged corpus of English I want to just extract strings that conform to the patterns leaving behind all residual data
An example will make this clear
Code:
The_DT Prime_NNP Minister_NNP may_MD not_RB have_VB travelled_VBN to_TO Europe_NNP ._.

I just need to extract
Code:
 may_MD not_RB have_VB travelled_VBN

which conforms to the pattern
Code:
word_MD word_RB word_VB word_VBN

I am providing below a large pattern list which is not necessarily exhaustive
Code:
 word_VBP word_RB word_VBG
 to_TO word_VB word_VBN 
 to_TO word_VB word_VBZ word_JJ
 was_VBD word_RB word_VBG
 word_MD word_RB be_VB word_VBG
 word_MD word_RB word_VB
 word_MD word_RB word_VB word_VBN
 word_MD word_VB
 word_MD word_VB word_RB word_VBN
 word_MD word_VB word_VBG
 word_MD word_VB word_VBN
 word_RB word_VBP
 word_VB to_TO word_VB
 word_VBD
 word_VBD word_RB word_VB
 word_VBD word_RB word_VBN
 word_VBD word_VBG
 word_VBD word_VBN
 word_VBP
 word_VBP word_RB word_RB word_VB
 word_VBP word_RB word_RB word_VBN
 word_VBP word_RB word_VB
 word_VBP word_RB word_VBN
 word_VBP word_VB
 word_VBP word_VBG
 word_VBP word_VBN

and a small corpus for testing purposes:
Code:
Google_NNP has_VBZ made_VBN its_PRP$ mobile_JJ payments_NNS system_NN ,_, Android_NNP Pay_VB ,_, available_JJ at_IN more_JJR than_IN one_CD million_CD locations_NNS in_IN the_DT United_NNP States_NNPS ._. The_DT tap-to-pay_JJ system_NN will_MD compete_VB with_IN Apple_NNP Pay_VB in_IN the_DT burgeoning_VBG mobile_JJ payments_NNS market_NN ._. The_DT market_NN is_VBZ estimated_VBN to_TO be_VB worth_JJ $_$ 1tn_CD -LRB-_-LRB- #_# 650bn_CD -RRB-_-RRB- in_IN 2017_CD ._. Technology_NN companies_NNS are_VBP trying_VBG to_TO convince_VB shoppers_NNS to_TO use_VB their_PRP$ handsets_NNS ,_, rather_RB than_IN plastic_JJ cards_NNS ,_, to_TO pay_VB for_IN purchases_NNS ._. Android_NNP Pay_VB can_MD be_VB used_VBN with_IN smartphones_NNS that_WDT have_VBP near-field_JJ communication_NN -LRB-_-LRB- NFC_NN -RRB-_-RRB- capability_NN and_CC Google_NNP 's_POS KitKat_NNP 4.4_CD +_CC operating_VBG system_NN ._. It_PRP will_MD allow_VB users_NNS to_TO store_VB their_PRP$ credit_NN card_NN details_NNS on_IN their_PRP$ phones_NNS ,_, as_RB well_RB as_IN loyalty_NN cards_NNS and_CC other_JJ data_NNS ._. Existing_VBG users_NNS of_IN the_DT Google_NNP Wallet_NNP app_NN can_MD access_VB Android_NNP Pay_VB through_IN an_DT update_VBP ,_, while_IN new_JJ users_NNS can_MD download_VB it_PRP from_IN the_DT Google_NNP Play_NNP app_NN store_NN in_IN the_DT coming_JJ days_NNS ._. Retailers_NNS including_VBG Macy_NNP 's_POS ,_, Bloomingdale_NNP 's_POS and_CC Subway_NNP are_VBP among_IN the_DT first_JJ to_TO participate_VB in_IN Android_NNP Pay_VB ,_, with_IN more_JJR to_TO come_VB ._. It_PRP will_MD be_VB extended_VBN to_TO mobile_JJ checkouts_NNS in_IN some_DT apps_NNS later_RB this_DT year_NN ._.

I hope I have made my requirement clear and would like the patterns extracted from a file listing the patterns.
I work under Windows and have tried to solve the problem using Regexes in Perl but the problem seems to be beyond my skills. Any help and if possible with commented code would be of great use to me to better my skills. I am 65+ and believe that one is never too old to learn.
Incidentally the data analysed will be provided to the Community working on Verbs in languages.
Many thanks for your help.
# 2  
Old 09-11-2015
something along these lines - didn't check the output for the complete correctness.
Maybe a smaller corpus file/line could help....
awk -f gim.awk gimPat.txt gimCorpus.txt where gim.awk is:
Code:
FNR==NR {
  gsub("word", "[^ ][^ ]*")
  pat[$0]
  next
}
{
  for(i in pat) {
    s=$0
    while (match(s,i)) {
      print substr(s,RSTART, RLENGTH)
      s=substr(s,RSTART+RLENGTH+1)
    }
  }
}

results in:
Code:
are_VBP trying_VB
can_MD be_VB used_VBN
will_MD be_VB extended_VBN
will_MD compete_VB
can_MD be_VB
will_MD allow_VB
can_MD access_VB
can_MD download_VB
will_MD be_VB
are_VBP
have_VBP
update_VBP
are_VBP
are_VBP trying_VBG

This User Gave Thanks to vgersh99 For This Post:
# 3  
Old 09-11-2015
Many thanks. Could I test it on a large corpus and get back to you. I will need some time for that but the test on a small corpus seems to work wonderfully.
# 4  
Old 09-11-2015
Let us know how the testing goes...
# 5  
Old 09-11-2015
Thanks once again. I tested it on a 28 Mb file. I ran the script in batch mode to test time in and time out. It took around 9 minutes to execute and the results are perfect. Incidentally this is a tool which can be used for generic pattern matching provided the pattern is correctly provided and I hope it will prove useful to others also.
Thanks a lot.
This User Gave Thanks to gimley For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Alignment tool to join text files in 2 directories to create a parallel corpus

I have two directories called English and Hindi. Each directory contains the same number of files with the only difference being that in the case of the English Directory the tag is .english and in the Hindi one the tag is .Hindi The file may contain either a single text or more than one text... (7 Replies)
Discussion started by: gimley
7 Replies

2. Shell Programming and Scripting

Creating verbal structures from a dictionary and a template

My main aim here is to create a database of verbs in a language to Hindi. The output if it works well will be put up on a University site for researchers to use for Machine Translation. This because one of the main weaknesses of MT is in the area of verbs. Sorry for the long post but the problem... (4 Replies)
Discussion started by: gimley
4 Replies

3. Homework & Coursework Questions

Dialog forms

1. The problem statement, all variables and given/known data: I need to create dialog interface for adress book i created a while ago but i don't know how to read info from forms 2. Relevant commands, code, scripts, algorithms: #!/bin/bash knyga="adresu-knyga.txt" dialog... (0 Replies)
Discussion started by: sasisken
0 Replies

4. Shell Programming and Scripting

Creating Frequency of words from a file by accessing a corpus

Hello, I have a large file of syllables /strings in Urdu. Each word is on a separate line. Example in English: be at for if being attract I need to identify the frequency of each of these strings from a large corpus (which I cannot attach unfortunately because of size limitations) and... (7 Replies)
Discussion started by: gimley
7 Replies

5. Shell Programming and Scripting

Performance issue in Grepping large files

I have around 300 files(*.rdf,*.fmb,*.pll,*.ctl,*.sh,*.sql,*.prog) which are of large size. Around 8000 keywords(which will be in the file $keywordfile) needed to be searched inside those files. If a keyword is found in a file..I have to insert the filename,extension,catagoery,keyword,occurrence... (8 Replies)
Discussion started by: millan
8 Replies

6. Shell Programming and Scripting

Grepping large list of files

Hi All, I need help to know the exact command when I grep large list of files. Either using ls or find command. However I do not want to find in the subdirectories as the number of subdirectories are not fixed. How do I achieve that. I want something like this: find ./ -name "MYFILE*.txt"... (2 Replies)
Discussion started by: angshuman
2 Replies

7. Shell Programming and Scripting

Linguistic project: extract co-occurrences from text corpus

Hello guys, I've got a big corpus (a huge text file in which words are separated by one or several spaces). I would like to know if there is a simple way - using awk for instance - to extract any co-occurrence appearing at least 3times through the whole corpus for a given word. By co-occurrence,... (7 Replies)
Discussion started by: bobylapointe
7 Replies

8. UNIX for Dummies Questions & Answers

Unix Forms

Hi Im new so be gentle Just starting out in programing and i want to try unix to see what all the fuss is about. But right now im like a kid in a sweet shop, spoilt for choice. Theres red hat, fedora, linux, ubuntu and thats just for starters I've been told ubuntu is a nice... (3 Replies)
Discussion started by: NightWatchman
3 Replies

9. UNIX for Dummies Questions & Answers

Forms

Hi, I currently have a form containing three boxes of info to be filled in. I would like it so if the user presses F10 a list of company names is displayed, using the company names from a table I have. I would like this list to be in a popup window if it is possible. I am using Informix, sco-unix.... (2 Replies)
Discussion started by: Dan Rooney
2 Replies
Login or Register to Ask a Question