Sentence delimiter in perl: modifications needed


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Sentence delimiter in perl: modifications needed
# 1  
Old 03-03-2015
Sentence delimiter in perl: modifications needed

Hello,
I found this Perl Script on the EuroParl website which does Sentence Splitting.

Code:
#!/usr/bin/perl -w

# Based on Preprocessor written by Philipp Koehn

binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");

use FindBin qw($Bin);
use strict;

my $mydir = "$Bin/nonbreaking_prefixes";

my %NONBREAKING_PREFIX = ();
my $language = "en";
my $QUIET = 0;
my $HELP = 0;

while (@ARGV) {
	$_ = shift;
	/^-l$/ && ($language = shift, next);
	/^-q$/ && ($QUIET = 1, next);
	/^-h$/ && ($HELP = 1, next);
}

if ($HELP) {
    print "Usage ./split-sentences.perl (-l [en|de|...]) < textfile > splitfile\n";
	exit;
}
if (!$QUIET) {
	print STDERR "Sentence Splitter v3\n";
	print STDERR "Language: $language\n";
}

my $prefixfile = "$mydir/nonbreaking_prefix.$language";

#default back to English if we don't have a language-specific prefix file
if (!(-e $prefixfile)) {
	$prefixfile = "$mydir/nonbreaking_prefix.en";
	print STDERR "WARNING: No known abbreviations for language '$language', attempting fall-back to English version...\n";
	die ("ERROR: No abbreviations files found in $mydir\n") unless (-e $prefixfile);
}

if (-e "$prefixfile") {
	open(PREFIX, "<:utf8", "$prefixfile");
	while (<PREFIX>) {
		my $item = $_;
		chomp($item);
		if (($item) && (substr($item,0,1) ne "#")) {
			if ($item =~ /(.*)[\s]+(\#NUMERIC_ONLY\#)/) {
				$NONBREAKING_PREFIX{$1} = 2;
			} else {
				$NONBREAKING_PREFIX{$item} = 1;
			}
		}
	}
	close(PREFIX);
}

##loop text, add lines together until we get a blank line or a <p>
my $text = "";
while(<STDIN>) {
	chop;
	if (/^<.+>$/ || /^\s*$/) {
		#time to process this block, we've hit a blank or <p>
		&do_it_for($text,$_);
		print "<P>\n" if (/^\s*$/ && $text); ##if we have text followed by <P>
		$text = "";
	}
	else {
		#append the text, with a space
		$text .= $_. " ";
	}
}
#do the leftover text
&do_it_for($text,"") if $text;


sub do_it_for {
	my($text,$markup) = @_;
	print &preprocess($text) if $text;
	print "$markup\n" if ($markup =~ /^<.+>$/);
	#chop($text);
}

sub preprocess {
	# clean up spaces at head and tail of each line as well as any double-spacing
	$text =~ s/ +/ /g;
	$text =~ s/\n /\n/g;
	$text =~ s/ \n/\n/g;
	$text =~ s/^ //g;
	$text =~ s/ $//g;
	
	#this is one paragraph
	my($text) = @_;
	
	#####add sentence breaks as needed#####
	
	#non-period end of sentence markers (?!) followed by sentence starters.
	$text =~ s/([?!]) +([\'\"\(\[\¿\¡\p{IsPi}]*[\p{IsUpper}])/$1\n$2/g;
		
	#multi-dots followed by sentence starters
	$text =~ s/(\.[\.]+) +([\'\"\(\[\¿\¡\p{IsPi}]*[\p{IsUpper}])/$1\n$2/g;
	
	# add breaks for sentences that end with some sort of punctuation inside a quote or parenthetical and are followed by a possible sentence starter punctuation and upper case
	$text =~ s/([?!\.][\ ]*[\'\"\)\]\p{IsPf}]+) +([\'\"\(\[\¿\¡\p{IsPi}]*[\ ]*[\p{IsUpper}])/$1\n$2/g;
		
	# add breaks for sentences that end with some sort of punctuation are followed by a sentence starter punctuation and upper case
	$text =~ s/([?!\.]) +([\'\"\(\[\¿\¡\p{IsPi}]+[\ ]*[\p{IsUpper}])/$1\n$2/g;
	
	# special punctuation cases are covered. Check all remaining periods.
	my $word;
	my $i;
	my @words = split(/ /,$text);
	$text = "";
	for ($i=0;$i<(scalar(@words)-1);$i++) {
		if ($words[$i] =~ /([\p{IsAlnum}\.\-]*)([\'\"\)\]\%\p{IsPf}]*)(\.+)$/) {
			#check if $1 is a known honorific and $2 is empty, never break
			my $prefix = $1;
			my $starting_punct = $2;
			if($prefix && $NONBREAKING_PREFIX{$prefix} && $NONBREAKING_PREFIX{$prefix} == 1 && !$starting_punct) {
				#not breaking;
			} elsif ($words[$i] =~ /(\.)[\p{IsUpper}\-]+(\.+)$/) {
				#not breaking - upper case acronym	
			} elsif($words[$i+1] =~ /^([ ]*[\'\"\(\[\¿\¡\p{IsPi}]*[ ]*[\p{IsUpper}0-9])/) {
				#the next word has a bunch of initial quotes, maybe a space, then either upper case or a number
				$words[$i] = $words[$i]."\n" unless ($prefix && $NONBREAKING_PREFIX{$prefix} && $NONBREAKING_PREFIX{$prefix} == 2 && !$starting_punct && ($words[$i+1] =~ /^[0-9]+/));
				#we always add a return for these unless we have a numeric non-breaker and a number start
			}
			
		}
		$text = $text.$words[$i]." ";
	}
	
	#we stopped one token from the end to allow for easy look-ahead. Append it now.
	$text = $text.$words[$i];
	
	# clean up spaces at head and tail of each line as well as any double-spacing
	$text =~ s/ +/ /g;
	$text =~ s/\n /\n/g;
	$text =~ s/ \n/\n/g;
	$text =~ s/^ //g;
	$text =~ s/ $//g;
	
	#add trailing break
	$text .= "\n" unless $text =~ /\n$/;
	
	return $text;
	
}

The script reads from a language file (attached as a zipped file) located separately in a folder
Code:
non_breaking_prefixes.

and then splits the sentence accurately.
However there are two issues which need to be solved.

a. It so happens that in quite a few corpora (especially news corpora), the full-stop is inadvertently forgotten and there is a simple hard return as in the example below.

Code:
The easily accessible drug was being widely used by school and college students as revealed by Dr. Yusuf Merchant
With the ban on the drug, Dr Yusuf Merchant, the man who first brought it to public notice has heaved a sigh of relief
Dr Merchant had filed a Public Interest Litigation (PIL) asking for a ban on the drug.

In that case the script treats the text as an absence of a full-stop and instead of retaining the the two lines separately, conjoins them in one running sentence.
Code:
The easily accessible drug was being widely used by school and college students as revealed by Dr. Yusuf Merchant With the ban on the drug, Dr Yusuf Merchant, the man who first brought it to public notice has heaved a sigh of relief Dr Merchant had filed a Public Interest Litigation (PIL) asking for a ban on the drug.

How do I make PERL introduce a hard return as a sentence delimiter in the script. I have tried to insert the hex values of a hard return
Code:
0A  or 0D

but they do not seem to do the trick.

My second query is pertinent to other languages such as Indic where characters such as
U+0964 DEVANAGARI DANDA
are used as sentence delimiters.
In case I want to insert these as such where do I insert them. I inserted it at
line 109
# add breaks for sentences that end with some sort of punctuation are followed by a sentence starter punctuation and upper case
Code:
	$text =~ s/([?!\.]) +([\'\"\(\[\¿\।\¡\p{IsPi}]+[\ ]*[\p{IsUpper}])/$1\n$2/g;

But to no avail.
I am providing a test sentence below:
Code:
थोड़ा ठीक है। क्या हम मामाजी को अभी देख सकते हैं? उनको दिन में मिलना भी मुश्किल है।

Any solution to these two issues would be of great help. Since the script is in OpenSource it would help other users also because I would be putting up the script with these modifications with due acknowledgement on the Moses site.
Could anybody provide a solution please. Thank you.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Perl Code to change file delimiter (passed as argument) to bar delimiter

Hi, Extremely new to Perl scripting, but need a quick fix without using TEXT::CSV I need to read in a file, pass any delimiter as an argument, and convert it to bar delimited on the output. In addition, enclose fields within double quotes in case of any embedded delimiters. Any help would... (2 Replies)
Discussion started by: JPB1977
2 Replies

2. Shell Programming and Scripting

Shell script to put delimiter for a no delimiter variable length text file

Hi, I have a No Delimiter variable length text file with following schema - Column Name Data length Firstname 5 Lastname 5 age 3 phoneno1 10 phoneno2 10 phoneno3 10 sample data - ... (16 Replies)
Discussion started by: Gaurav Martha
16 Replies

3. Shell Programming and Scripting

Regex to identify a full-stop as a sentence delimiter

Hello, Splitting a sentence using the full-stop/question-mark/exclamation is a common device. Whereas the question-mark / exclamation do not pose too much of a problem; the full-stop as a sentence delimiter raises certain issues because of its varied use: just to name a few. Standard parsers... (9 Replies)
Discussion started by: gimley
9 Replies

4. Programming

perl - problem sending text with delimiter

Hello, i encountered this in perl but it might be command line related as well: i am sending text as an argument to echo command on remote computer. if the text has alphanumeric characters only, say 'hello world' all is well. if however text has metacharachters, e.g. 'hello | world' or even... (2 Replies)
Discussion started by: ole111
2 Replies

5. UNIX for Dummies Questions & Answers

Script to ask for a sentence and then count number of spaces in the sentence

Hi People, I need some Help to write a unix script that asks for a sentence to be typed out then with the sentence. Counts the number of spaces within the sentence and then echo's out "The Number Of Spaces In The Sentence is 4" as a example Thanks Danielle (12 Replies)
Discussion started by: charlie101208
12 Replies

6. Shell Programming and Scripting

Modifications to a file

Hi, I do not have a clue how to do this nor can I find information on it but I have a file that looks like this (basically 3 columns and tab delimited). I need this in a particular format in order for a program to actually read it. chr1 2 4 chr1 2 5 chr1 3 6 chr2 1 4 chr2 2 5 ... (2 Replies)
Discussion started by: kylle345
2 Replies

7. Shell Programming and Scripting

Replacement of sentence in perl

Hi, I have 3 arrays: @arr1=("Furthermore, apigenin treatment increased the level of association of the RNA binding protein HuR with endogenous p53 mRNA","one of the mechanisms by which apigenin induces p53 protein expression is enhancement of translation through the RNA binding protein... (1 Reply)
Discussion started by: vanitham
1 Replies

8. Shell Programming and Scripting

How to remove duplicate sentence/string in perl?

Hi, I have two strings like this in an array: For example: @a=("Brain aging is associated with a progressive imbalance between intracellular concentration of Reactive Oxygen Species","Brain aging is associated with a progressive imbalance between intracellular concentration of Reactive... (9 Replies)
Discussion started by: vanitham
9 Replies

9. Shell Programming and Scripting

how to differentiate columns of a file in perl with no specific delimiter

Hi everybody, This time I am having one issue in perl. I have to create comma separated file using the following type of information. The problem is the columns do not have any specific delimiter. So while using split I am getting different value. Some where it is space(S) and some where it is... (9 Replies)
Discussion started by: Amiya Rath
9 Replies

10. Shell Programming and Scripting

replace space with delimiter in whole file -perl

Hi I have a file which have say about 100,000 records.. the records in it look like Some kind of text 1234567891 abcd February 14, 2008 03:58:54 AM lmnop This is how it looks.. if u notice there is a 2byte space between each column.. and im planning to replace that with '|' .. ... (11 Replies)
Discussion started by: meghana
11 Replies
Login or Register to Ask a Question