Identifying a sentence and putting it on a new line

07-06-2014

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Identifying a sentence and putting it on a new line

I am revisiting the problem of sentence splitting. I have a Perl Script which splits a para into sentences, but acronyms and short forms create an issue

Code:

#!/usr/bin/perl

use feature qw/say/;
use strict;
use warnings;

my $s;
my @arr;

while(<>) {
  chomp $_;
  $s .= $_ . " ";
}
@arr = $s =~ m/[A-Z].+?[.;](?=[^.;][A-Z]|\s*$)/g;
foreach (@arr) {
    say;
}

I have identified a list of abbreviations from a large corpus (not necessarily exhaustive) which are given below and which I would like to integrate in the script but since I am still learning Perl, I have not been able to integrate them. I am giving below the list of such cases. The list is not complete and can be added to. The syntax is as under:

Code:

Abbr["followed by the abbreviation];

Abbr["Co."];
Abbr["Corp."];
Abbr["vs."];
Abbr["e.g."];
Abbr["etc."];
Abbr["ex."];
Abbr["cf."];
Abbr["eg."];
Abbr["Jan."];
Abbr["Feb."];
Abbr["Mar."];
Abbr["Apr."];
Abbr["Jun."];
Abbr["Jul."];
Abbr["Aug."];
Abbr["Sep."];
Abbr["Sept."];
Abbr["Oct."];
Abbr["Nov."];
Abbr["Dec."];
Abbr["jan."];
Abbr["feb."];
Abbr["mar."];
Abbr["apr."];
Abbr["jun."];
Abbr["jul."];
Abbr["aug."];
Abbr["sep."];
Abbr["sept."];
Abbr["oct."];
Abbr["nov."];
Abbr["dec."];
Abbr["ed."];
Abbr["eds."];
Abbr["repr."];
Abbr["trans."];
Abbr["vol."];
Abbr["vols."];
Abbr["rev."];
Abbr["est."];
Abbr["b."];
Abbr["m."];
Abbr["bur."];
Abbr["d."];
Abbr["r."];
Abbr["M."];
Abbr["Dept."];
Abbr["Mr."];
Abbr["Jr."];
Abbr["Ms."];
Abbr["Mrs."];
Abbr["Dr."];

How do I integrate these and ensure that when the script encounters the above exceptions, it does not treat the full-stop as a sentence delimiter as in in the examples below?

Code:

Mr. Smith said today is Jun. 15th.
Jones Inc. filed for bankruptcy.

A couple of examples for such integration would suffice and I will integrate the rest.
Many thanks

Moderator's Comments:

edit by bakunin: corrected your CODE-tags

Last edited by bakunin; 07-06-2014 at 07:26 AM..

gimley

View Public Profile for gimley

Find all posts by gimley

07-06-2014

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

Perhaps these cpan modules may help;I have not used the first, but I have used the second with success In addition to the links below, see post #9 in thread Using awk to find sentences. for for a demonstration ... cheers, drl

Quote:

The Text::Sentence module contains the function split_sentences, which splits text into its constituent sentences, based on a fairly approximate regex. If you set the locale before calling it, it will deal correctly with locale dependant capitalization to identify sentence boundaries. Certain well know exceptions, such as abreviations, may cause incorrect segmentations.

Above from:
https://metacpan.org/pod/Text::Sentence

Quote:

The Lingua::EN::Sentence module contains the function get_sentences, which splits text into its constituent sentences, based on a regular expression and a list of abbreviations (built in and given).
Certain well know exceptions, such as abreviations, may cause incorrect segmentations. But some of them are already integrated into this code and are being taken care of. Still, if you see that there are words causing the get_sentences() to fail, you can add those to the module, so it notices them.

Above from:
https://metacpan.org/pod/Lingua::EN::Sentence

drl

View Public Profile for drl

Find all posts by drl

07-06-2014

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

How about this awk solution. function sentence_parse will populate an array with each sentence.

Features:
- ignores acronyms from the supplied file
- ignores any text within quoted strings
- removes leading spaces (spaces following previous full-stop)

Code:

awk '
function rindex(str,c)
{
  return match(str,"\\" c "[^" c "]*$")? RSTART : 0
}

function sentence_parse(text,sentences,quote,start,num,pos,last_word) {
   quote = (quote ? quote : "\"")
   start=num=1
   for(pos = 1; pos <= length(text) ; pos++) {
       char=substr(text, pos,1)
       if(start && char != " ") start=0
       if(!start) {
           sentences[num] = sentences[num] char
           if(char==quote) inquote=!inquote
           if (!inquote && char == ".") {
              last_word = substr(substr(text,1,pos), 1+rindex(substr(text,1,pos), " "))
              if(!(last_word in acro)) {
                  start=1
                  num++
              }
           }
        }
    }
}

NR==FNR{
 if($0 ~ "^Abbr\\[\"") {
     split($0,vals,"\"");
     acro[vals[2]]
 }
 next
}

{ 
   delete sen
   sentence_parse($0, sen)
   for(i=1; i<=length(sen) ; i++) print sen[i]
} ' acronyms.txt infile

Credit to vgersh99 for the rindex function

Last edited by Chubler_XL; 07-06-2014 at 05:56 PM..

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

Shell Programming and Scripting

Identifying a sentence and putting it on a new line

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Need help putting output on one line

Discussion started by: brianjb

2. Shell Programming and Scripting

Script for identifying and deleting dupes in a line

Discussion started by: gimley

3. UNIX for Dummies Questions & Answers

Identifying the first line that has zeros

Discussion started by: cosmologist

4. Shell Programming and Scripting

[grep] how to grep a sentence which has quotation marks "sentence"

Discussion started by: ciro314

5. UNIX for Dummies Questions & Answers

Script to ask for a sentence and then count number of spaces in the sentence

Discussion started by: charlie101208

6. Shell Programming and Scripting

Putting multiple sed commands on a single line

Discussion started by: varelg

7. Shell Programming and Scripting

Putting new line after certain number of character

Discussion started by: srsahu75

8. UNIX for Dummies Questions & Answers

print the line immediately after a regexp; but regexp is a sentence

Discussion started by: ownins

9. UNIX for Dummies Questions & Answers

identifying duplicates line & reporting their line number

Discussion started by: stresslog

10. UNIX for Dummies Questions & Answers

grepping for a sentence

Discussion started by: eloquent99