awk or perl script for preposition splitter


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk or perl script for preposition splitter
# 1  
Old 03-20-2015
awk or perl script for preposition splitter

Hello,
I am writing a Natural Language Parser and one of the tools I need is to separate prepositional phrase markers which begin with a Preposition. I have a long list of such markers (sample given below)and am looking for a script in awk or perl which will allow me to access a look-up file containing these prepositions and split them.
A sample is given below:
The text below is a tagged text using a Language parser
Code:
[ There_EX could_MD be_VB more_RBR  casualties_NNS in_IN the_DT mishap_NN ,_, ''_null]

The expected output would be
Code:
[ There_EX could_MD be_VB more_RBR  casualties_NNS]
[ in_IN the_DT mishap_NN ,_, ''_null]

The prepositions would necessarily be preceded by
Code:
NN
NNS
NNP
followed by space

as in the example above.
A sample list of the preposition markers is given below:
Code:
to_IN
in_IN
towards_IN
across_IN
for_IN
into_IN
up to _IN

Many thanks in advance for help. A commented code would help even more to enable me to read from a list and insert a new line when the condition is met.

Last edited by zaxxon; 03-20-2015 at 07:13 AM.. Reason: code tag mismatch
# 2  
Old 03-20-2015
Code:
$
$
$ cat -n data.txt
     1  [ There_EX could_MD be_VB more_RBR  casualties_NNS in_IN the_DT mishap_NN ,_, ''_null]
     2  [ Some other text here_NNP towards_IN  mary had_NN upto_IN a_NNS across_IN little lamb_NNP nope_IN a false alarm.]

$
$ cat -n preposition_splitter.pl
     1  #!/usr/bin/perl
     2  use strict;
     3
     4  # We set up a hash to store the precedence words. Hashes will
     5  # allow fast lookup as we iterate through the words. We look for
     6  # words that end with any key in the "precedence" hash.
     7  my %precedence = ( 'NN' => 1,
     8                     'NNS' => 1,
     9                     'NNP' => 1
    10                   );
    11
    12  # And another hash for the preposition markers. Same idea as above.
    13  my %marker = (  'to_IN' => 1,
    14                  'in_IN' => 1,
    15                  'towards_IN' => 1,
    16                  'across_IN' => 1,
    17                  'for_IN' => 1,
    18                  'into_IN' => 1,
    19                  'upto_IN' => 1
    20               );
    21
    22  # Set up the data file
    23  my $file = "data.txt";
    24
    25  # String variable to hold the tokens until we reach a "potential" newline
    26  my $str;
    27
    28  # Marker to be set if we reach a word that should potentially be followed
    29  # by a newline
    30  my $potential_nl = 0;
    31
    32  # Open the file, loop through each line and, within each line,
    33  # loop through each word.
    34  open (FH, "<", $file) or die "Can't open $file: $!";
    35  while (<FH>) {
    36      # Remove the EOL character
    37      chomp;
    38      # A note about what we mean by a "word". Perl's "\w" metacharacter includes the
    39      # following: a-zA-Z_ i.e. upper and lower case letters and the underscore. If the
    40      # data has other characters besides these, we add them to our character class.
    41      # Hence the following characters have been added to our definition of
    42      # a word: "]","[", ",", "'", "."
    43      while (/\s*([\w\]\[,'\.]+)\s*/g) {
    44          my $word = $1;
    45          my $suffix;
    46          ($suffix = $word) =~ s/^.*_//;
    47          #printf("Line: %d %s\n", $., $word);
    48
    49          # If the suffix was found in our precedence hash, then
    50          # we set the potential newline marker
    51          if (defined $precedence{$suffix}) { $potential_nl = 1 }
    52
    53          # If the newline marker was set in the earlier iteration and
    54          # the current word is present in the "marker" hash, then
    55          # (a)  we print the string adding brackets at beginning/end if needed
    56          # (b)  flush the string $str and reset the newline marker
    57          if ($potential_nl and defined $marker{$word}) {
    58              if ( $str !~ m/^\s*\[/ ) {$str = "[".$str }
    59              if ( $str !~ m/\s*\]$/ ) {$str .= "]" }
    60              print $str, "\n";
    61              $str = "";
    62              $potential_nl = 0;
    63          } elsif ($word =~ m/\]$/) {  # do the same thing if we reach the end of line
    64              $str .= $word;
    65              if ( $str !~ m/^\s*\[/ ) {$str = "[".$str }
    66              if ( $str !~ m/\s*\]$/ ) {$str .= "]" }
    67              print $str, "\n";
    68              $str = "";
    69              $potential_nl = 0;
    70              next;
    71          }
    72          $str .= $word." ";
    73      }
    74  }
    75  close (FH) or die "Can't close $file: $!";
    76

$
$
$ perl preposition_splitter.pl
[ There_EX could_MD be_VB more_RBR casualties_NNS ]
[in_IN the_DT mishap_NN ,_, ''_null]
[ Some other text here_NNP ]
[towards_IN mary had_NN ]
[upto_IN a_NNS ]
[across_IN little lamb_NNP nope_IN a false alarm.]

$
$

This User Gave Thanks to durden_tyler For This Post:
# 3  
Old 03-20-2015
Many thanks for your ever-so-helpful script which works perfectly. I am sorry for the delay in responding but my server was down all evening and is just up today morning.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

File splitter

I have below script which does splitting based on a different criteria. can it be amended to produce required result SrcFileName=XML_DUMP awk '/<\?xml version="1\.0" encoding="utf-8"\?>/{n++} n{f="'"${SrcFileName}_"'" sprintf("%04d",n) ".txt" print >> f close(f)}' $SrcFileName.txt My... (3 Replies)
Discussion started by: santosh2k2
3 Replies

2. Shell Programming and Scripting

Source xml file splitter

I have a source file that contains multiple XML files concatenated in it. The separator string between files is <?xml version="1.0" encoding="utf-8"?>. I wanted to split files in multiple files with mentioned names. I had used a awk code earlier to spilt files in number of lines i.e. awk... (10 Replies)
Discussion started by: santosh2k2
10 Replies

3. Shell Programming and Scripting

File Splitter output filename

Issue: I am able to split source file in multiple files of 10 rows each but unable to get the required outputfile name. please advise. Details: input = A.txt having 44 rows required output = A_001.txt , A_002.txt and so on. Can below awk be modified to give required result current... (19 Replies)
Discussion started by: santosh2k2
19 Replies

4. Shell Programming and Scripting

Text Splitter

Hi, I need to split files based on text: BEGIN DSJOB Identifier "LA" DateModified "2011-10-28" TimeModified "11.10.02" BEGIN DSRECORD Identifier "ROOT" BEGIN DSSUBRECORD Owner "APT" Name "RecordJobPerformanceData" Value "0" ... (16 Replies)
Discussion started by: unme
16 Replies

5. Shell Programming and Scripting

Help with convert awk script into perl

Input file (a list of input file name with *.txt extension): campus.com_icmp_ping_alive.txt data_local_cd_httpd.txt data_local_cd.txt new_local_cd_mysql.txt new_local_cd_nagios_content.txt Desired output file: data local_cd_httpd data local_cd new local_cd_mysql new ... (9 Replies)
Discussion started by: perl_beginner
9 Replies

6. Shell Programming and Scripting

Syllable splitter in Perl

Hello, I am a relative newbie and want to split Names in English into syllables. Does anyone know of a perl script which does that. Since my main area is linguistics, I would be happy to add rules to it and post the perl script back for other users. I tried the CPan perl modules but they don't... (6 Replies)
Discussion started by: gimley
6 Replies

7. Shell Programming and Scripting

awk script in perl

Hi Linux users, I have to convert a shell script in a perl script! The command takes two files (two tables) and compares them to find the same values in 4 columns ($2" "$3" "$8" "$9) and prints out only the common lines. This is the command: cat first_file.txt | while read i; do cat... (2 Replies)
Discussion started by: m_elena
2 Replies

8. Programming

Help with splitter code in JAVA

I was creating a file using splitter and printwriter. The result in the file come out as: TO:bbb,ccc,eee Instead of, TO:bbb TO:ccc TO:eee May I know what's wrong with this? (1 Reply)
Discussion started by: eel
1 Replies

9. Shell Programming and Scripting

Awk script into Perl

Hello, I have not programmed in Perl, but maybe someone can help me or point me to other links. I have searched for and found a solution to my initial problem. I have a text file of data where I want to search for a particular string but return the prior line. I found out here something that... (3 Replies)
Discussion started by: bsp18974
3 Replies

10. Shell Programming and Scripting

perl as awk replacement in a script.

Hey all, Im trying to write a script on windows, which Im not too familiar with. Im generally a bash scripting guy but am using perl for this case. My question is... I have this exact output: 2 Dir(s) 6,380,429,312 bytes free and I just need to get the number out... (4 Replies)
Discussion started by: trey85stang
4 Replies
Login or Register to Ask a Question