Grepping verbal forms from a large corpus

09-11-2015

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Grepping verbal forms from a large corpus

I want to extract verbal forms from a large corpus of English. I have identified a certain number of patterns. Each pattern has the following structure

Code:

SPACE word_CATEGORY

where

Code:

word

refers to the verbal form and

Code:

CATEGORY

refers to the class of the verb
The categories are identified as per the Penn Tree Bank norms. These can be concatenated to form larger patterns.
From a tagged corpus of English I want to just extract strings that conform to the patterns leaving behind all residual data
An example will make this clear

Code:

The_DT Prime_NNP Minister_NNP may_MD not_RB have_VB travelled_VBN to_TO Europe_NNP ._.

I just need to extract

Code:

 may_MD not_RB have_VB travelled_VBN

which conforms to the pattern

Code:

word_MD word_RB word_VB word_VBN

I am providing below a large pattern list which is not necessarily exhaustive

Code:

 word_VBP word_RB word_VBG
 to_TO word_VB word_VBN 
 to_TO word_VB word_VBZ word_JJ
 was_VBD word_RB word_VBG
 word_MD word_RB be_VB word_VBG
 word_MD word_RB word_VB
 word_MD word_RB word_VB word_VBN
 word_MD word_VB
 word_MD word_VB word_RB word_VBN
 word_MD word_VB word_VBG
 word_MD word_VB word_VBN
 word_RB word_VBP
 word_VB to_TO word_VB
 word_VBD
 word_VBD word_RB word_VB
 word_VBD word_RB word_VBN
 word_VBD word_VBG
 word_VBD word_VBN
 word_VBP
 word_VBP word_RB word_RB word_VB
 word_VBP word_RB word_RB word_VBN
 word_VBP word_RB word_VB
 word_VBP word_RB word_VBN
 word_VBP word_VB
 word_VBP word_VBG
 word_VBP word_VBN

and a small corpus for testing purposes:

Code:

Google_NNP has_VBZ made_VBN its_PRP$ mobile_JJ payments_NNS system_NN ,_, Android_NNP Pay_VB ,_, available_JJ at_IN more_JJR than_IN one_CD million_CD locations_NNS in_IN the_DT United_NNP States_NNPS ._. The_DT tap-to-pay_JJ system_NN will_MD compete_VB with_IN Apple_NNP Pay_VB in_IN the_DT burgeoning_VBG mobile_JJ payments_NNS market_NN ._. The_DT market_NN is_VBZ estimated_VBN to_TO be_VB worth_JJ $_$ 1tn_CD -LRB-_-LRB- #_# 650bn_CD -RRB-_-RRB- in_IN 2017_CD ._. Technology_NN companies_NNS are_VBP trying_VBG to_TO convince_VB shoppers_NNS to_TO use_VB their_PRP$ handsets_NNS ,_, rather_RB than_IN plastic_JJ cards_NNS ,_, to_TO pay_VB for_IN purchases_NNS ._. Android_NNP Pay_VB can_MD be_VB used_VBN with_IN smartphones_NNS that_WDT have_VBP near-field_JJ communication_NN -LRB-_-LRB- NFC_NN -RRB-_-RRB- capability_NN and_CC Google_NNP 's_POS KitKat_NNP 4.4_CD +_CC operating_VBG system_NN ._. It_PRP will_MD allow_VB users_NNS to_TO store_VB their_PRP$ credit_NN card_NN details_NNS on_IN their_PRP$ phones_NNS ,_, as_RB well_RB as_IN loyalty_NN cards_NNS and_CC other_JJ data_NNS ._. Existing_VBG users_NNS of_IN the_DT Google_NNP Wallet_NNP app_NN can_MD access_VB Android_NNP Pay_VB through_IN an_DT update_VBP ,_, while_IN new_JJ users_NNS can_MD download_VB it_PRP from_IN the_DT Google_NNP Play_NNP app_NN store_NN in_IN the_DT coming_JJ days_NNS ._. Retailers_NNS including_VBG Macy_NNP 's_POS ,_, Bloomingdale_NNP 's_POS and_CC Subway_NNP are_VBP among_IN the_DT first_JJ to_TO participate_VB in_IN Android_NNP Pay_VB ,_, with_IN more_JJR to_TO come_VB ._. It_PRP will_MD be_VB extended_VBN to_TO mobile_JJ checkouts_NNS in_IN some_DT apps_NNS later_RB this_DT year_NN ._.

I hope I have made my requirement clear and would like the patterns extracted from a file listing the patterns.
I work under Windows and have tried to solve the problem using Regexes in Perl but the problem seems to be beyond my skills. Any help and if possible with commented code would be of great use to me to better my skills. I am 65+ and believe that one is never too old to learn.
Incidentally the data analysed will be provided to the Community working on Verbs in languages.
Many thanks for your help.

gimley

View Public Profile for gimley

Find all posts by gimley

09-11-2015

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

something along these lines - didn't check the output for the complete correctness.
Maybe a smaller corpus file/line could help....
awk -f gim.awk gimPat.txt gimCorpus.txt where gim.awk is:

Code:

FNR==NR {
  gsub("word", "[^ ][^ ]*")
  pat[$0]
  next
}
{
  for(i in pat) {
    s=$0
    while (match(s,i)) {
      print substr(s,RSTART, RLENGTH)
      s=substr(s,RSTART+RLENGTH+1)
    }
  }
}

results in:

Code:

are_VBP trying_VB
can_MD be_VB used_VBN
will_MD be_VB extended_VBN
will_MD compete_VB
can_MD be_VB
will_MD allow_VB
can_MD access_VB
can_MD download_VB
will_MD be_VB
are_VBP
have_VBP
update_VBP
are_VBP
are_VBP trying_VBG

This User Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

09-11-2015

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks. Could I test it on a large corpus and get back to you. I will need some time for that but the test on a small corpus seems to work wonderfully.

gimley

View Public Profile for gimley

Find all posts by gimley

09-11-2015

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

Let us know how the testing goes...

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

09-11-2015

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Thanks once again. I tested it on a 28 Mb file. I ran the script in batch mode to test time in and time out. It took around 9 minutes to execute and the results are perfect. Incidentally this is a tool which can be used for generic pattern matching provided the pattern is correctly provided and I hope it will prove useful to others also.
Thanks a lot.

This User Gave Thanks to gimley For This Post:

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Grepping verbal forms from a large corpus

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Alignment tool to join text files in 2 directories to create a parallel corpus

Discussion started by: gimley

2. Shell Programming and Scripting

Creating verbal structures from a dictionary and a template

Discussion started by: gimley

3. Homework & Coursework Questions

Dialog forms

Discussion started by: sasisken

4. Shell Programming and Scripting

Creating Frequency of words from a file by accessing a corpus

Discussion started by: gimley

5. Shell Programming and Scripting

Performance issue in Grepping large files

Discussion started by: millan

6. Shell Programming and Scripting

Grepping large list of files

Discussion started by: angshuman

7. Shell Programming and Scripting

Linguistic project: extract co-occurrences from text corpus

Discussion started by: bobylapointe

8. UNIX for Dummies Questions & Answers

Unix Forms

Discussion started by: NightWatchman

9. UNIX for Dummies Questions & Answers

Forms

Discussion started by: Dan Rooney