Identifying a sentence and putting it on a new line


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Identifying a sentence and putting it on a new line
# 1  
Old 07-06-2014
Identifying a sentence and putting it on a new line

I am revisiting the problem of sentence splitting. I have a Perl Script which splits a para into sentences, but acronyms and short forms create an issue

Code:
#!/usr/bin/perl

use feature qw/say/;
use strict;
use warnings;

my $s;
my @arr;

while(<>) {
  chomp $_;
  $s .= $_ . " ";
}
@arr = $s =~ m/[A-Z].+?[.;](?=[^.;][A-Z]|\s*$)/g;
foreach (@arr) {
    say;
}


I have identified a list of abbreviations from a large corpus (not necessarily exhaustive) which are given below and which I would like to integrate in the script but since I am still learning Perl, I have not been able to integrate them. I am giving below the list of such cases. The list is not complete and can be added to. The syntax is as under:
Code:
Abbr["followed by the abbreviation];

Abbr["Co."];
Abbr["Corp."];
Abbr["vs."];
Abbr["e.g."];
Abbr["etc."];
Abbr["ex."];
Abbr["cf."];
Abbr["eg."];
Abbr["Jan."];
Abbr["Feb."];
Abbr["Mar."];
Abbr["Apr."];
Abbr["Jun."];
Abbr["Jul."];
Abbr["Aug."];
Abbr["Sep."];
Abbr["Sept."];
Abbr["Oct."];
Abbr["Nov."];
Abbr["Dec."];
Abbr["jan."];
Abbr["feb."];
Abbr["mar."];
Abbr["apr."];
Abbr["jun."];
Abbr["jul."];
Abbr["aug."];
Abbr["sep."];
Abbr["sept."];
Abbr["oct."];
Abbr["nov."];
Abbr["dec."];
Abbr["ed."];
Abbr["eds."];
Abbr["repr."];
Abbr["trans."];
Abbr["vol."];
Abbr["vols."];
Abbr["rev."];
Abbr["est."];
Abbr["b."];
Abbr["m."];
Abbr["bur."];
Abbr["d."];
Abbr["r."];
Abbr["M."];
Abbr["Dept."];
Abbr["Mr."];
Abbr["Jr."];
Abbr["Ms."];
Abbr["Mrs."];
Abbr["Dr."];

How do I integrate these and ensure that when the script encounters the above exceptions, it does not treat the full-stop as a sentence delimiter as in in the examples below?
Code:
Mr. Smith said today is Jun. 15th.
Jones Inc. filed for bankruptcy.

A couple of examples for such integration would suffice and I will integrate the rest.
Many thanks

Moderator's Comments:
Mod Comment edit by bakunin: corrected your CODE-tags

Last edited by bakunin; 07-06-2014 at 07:26 AM..
# 2  
Old 07-06-2014
Hi.

Perhaps these cpan modules may help;I have not used the first, but I have used the second with success In addition to the links below, see post #9 in thread Using awk to find sentences. for for a demonstration ... cheers, drl

Quote:
The Text::Sentence module contains the function split_sentences, which splits text into its constituent sentences, based on a fairly approximate regex. If you set the locale before calling it, it will deal correctly with locale dependant capitalization to identify sentence boundaries. Certain well know exceptions, such as abreviations, may cause incorrect segmentations.
Above from:
https://metacpan.org/pod/Text::Sentence

Quote:
The Lingua::EN::Sentence module contains the function get_sentences, which splits text into its constituent sentences, based on a regular expression and a list of abbreviations (built in and given).
Certain well know exceptions, such as abreviations, may cause incorrect segmentations. But some of them are already integrated into this code and are being taken care of. Still, if you see that there are words causing the get_sentences() to fail, you can add those to the module, so it notices them.
Above from:
https://metacpan.org/pod/Lingua::EN::Sentence
# 3  
Old 07-06-2014
How about this awk solution. function sentence_parse will populate an array with each sentence.

Features:
- ignores acronyms from the supplied file
- ignores any text within quoted strings
- removes leading spaces (spaces following previous full-stop)

Code:
awk '
function rindex(str,c)
{
  return match(str,"\\" c "[^" c "]*$")? RSTART : 0
}

function sentence_parse(text,sentences,quote,start,num,pos,last_word) {
   quote = (quote ? quote : "\"")
   start=num=1
   for(pos = 1; pos <= length(text) ; pos++) {
       char=substr(text, pos,1)
       if(start && char != " ") start=0
       if(!start) {
           sentences[num] = sentences[num] char
           if(char==quote) inquote=!inquote
           if (!inquote && char == ".") {
              last_word = substr(substr(text,1,pos), 1+rindex(substr(text,1,pos), " "))
              if(!(last_word in acro)) {
                  start=1
                  num++
              }
           }
        }
    }
}

NR==FNR{
 if($0 ~ "^Abbr\\[\"") {
     split($0,vals,"\"");
     acro[vals[2]]
 }
 next
}

{ 
   delete sen
   sentence_parse($0, sen)
   for(i=1; i<=length(sen) ; i++) print sen[i]
} ' acronyms.txt infile


Credit to vgersh99 for the rindex function

Last edited by Chubler_XL; 07-06-2014 at 05:56 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Need help putting output on one line

Good afternoon, I have been searching the web, and these forums for help. I will try my best to explain the issue, and what my desired results are. I am doing queries in MYSQL, and need the output to be sent to a file. That file needs to have things with the same ID on the same line. To... (14 Replies)
Discussion started by: brianjb
14 Replies

2. Shell Programming and Scripting

Script for identifying and deleting dupes in a line

I am compiling a synonym dictionary which has the following structure Headword=Synonym1,Synonym2 and so on, with each synonym separated by a comma. As is usual in such cases manual preparation of synonyms results in repeating the synonym which results in dupes as in the example below:... (3 Replies)
Discussion started by: gimley
3 Replies

3. UNIX for Dummies Questions & Answers

Identifying the first line that has zeros

If I have a file like: 9350. 0.288426 9370. 0.320469 9390. 0.394475 9410. 0.353157 9430. 0.336001 9450. 0.336692 9470. 0.356827 9490. 0.359891 9510. 0.346305 9530. 0.356506 9550. 0.348306 9570. 0.36832 9590. 0.379067 9610. 0.0246704 9630. 0 9650. 0 9670. 0 (5 Replies)
Discussion started by: cosmologist
5 Replies

4. Shell Programming and Scripting

[grep] how to grep a sentence which has quotation marks "sentence"

I would like to check with grep in this configuration file: { "alt-speed-down": 200, "alt-speed-enabled": true, "alt-speed-time-begin": 1140, "alt-speed-time-day": 127, "...something..." : true, ... } "alt-speed-enabled" (the third line of the file) is setted to... (2 Replies)
Discussion started by: ciro314
2 Replies

5. UNIX for Dummies Questions & Answers

Script to ask for a sentence and then count number of spaces in the sentence

Hi People, I need some Help to write a unix script that asks for a sentence to be typed out then with the sentence. Counts the number of spaces within the sentence and then echo's out "The Number Of Spaces In The Sentence is 4" as a example Thanks Danielle (12 Replies)
Discussion started by: charlie101208
12 Replies

6. Shell Programming and Scripting

Putting multiple sed commands on a single line

Hi, I want to make sed write a part of fileA (first 7 lines) to file1 and the rest of fileA to file2 in a single call and single line in sed. If I do the following: sed '1,7w file1; 8,$w file2' fileA I get only one file named file1 plus all the characters following file1. If I try to use curly... (1 Reply)
Discussion started by: varelg
1 Replies

7. Shell Programming and Scripting

Putting new line after certain number of character

Hi, I want, if a line is more than 80 characters length then put a new line with 4 space after each 80 characters to indent the data at same position. Input: 200 Geoid and gravity anomaly data of conjugate regions of Bay of Bengal and Enderby Basin: New constraints on breakup and early... (3 Replies)
Discussion started by: srsahu75
3 Replies

8. UNIX for Dummies Questions & Answers

print the line immediately after a regexp; but regexp is a sentence

Good Day, Im new to scripting especially awk and sed. I just would like to ask help from you guys about a sed command that prints the line immediately after a regexp, but not the line containing the regexp. sed -n '/regexp/{n;p;}' filename What if my regexp is 3 word or a sentence. Im... (3 Replies)
Discussion started by: ownins
3 Replies

9. UNIX for Dummies Questions & Answers

identifying duplicates line & reporting their line number

I need to find to find duplicate lines in a document and then print the line numbers of the duplicates The files contain multiple lines with about 100 numbers on each line I need something that will output the line numbers where duplicates were found ie 1=5=7, 2=34=76 Any suggestions would be... (5 Replies)
Discussion started by: stresslog
5 Replies

10. UNIX for Dummies Questions & Answers

grepping for a sentence

Can you grep for a sentence. I have to search logs everyday at work and I was wondering if I could search for a string of words instead of just one. for example, if I had to find this sentence: "Received HTTP message type" How would I grep it (2 Replies)
Discussion started by: eloquent99
2 Replies
Login or Register to Ask a Question