Regex to identify a full-stop as a sentence delimiter


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Regex to identify a full-stop as a sentence delimiter
# 1  
Old 07-28-2012
Regex to identify a full-stop as a sentence delimiter

Hello,
Splitting a sentence using the full-stop/question-mark/exclamation is a common device. Whereas the question-mark / exclamation do not pose too much of a problem; the full-stop as a sentence delimiter raises certain issues because of its varied use:
Quote:
The temperature was 32.8 degrees Celsius. (Temperature)
His B.Sc. degree was deemed insufficient. (Acronym)
He owed the bank USD 4000.50 which he had not paid back. (Currency)
On 27.07.2004 a major earthquake occurred. (Date)
It was 17.05 by the clock. (Time)
just to name a few.

Standard parsers such as the Stanford do not parse this correctlyand treat the full-stop as a delimiter whatever be its occurrence.
A Perl script would do the job, but since I am working on dynamic data where on the fly detection is needed, I am looking for a regex which can do the job and correctly ignore the above cases and identify only valid ones.
Use of close proximity i.e. ignore if between a full-stop and the next full-stop there are only a couple of words is a possibility but does not work in all cases.
Does anyone know of a solution to this thorny issue ? Many thanks in advance for your help
# 2  
Old 07-28-2012
Do you only want to match the period that is at the end before the (xxxxx)?
# 3  
Old 07-28-2012
Hi,

The input & output of what you want is not clear for me, but about parsing full-stop.

Maybe you could say that full-stop must be followed by a \w and a capital letter or end of file ?
# 4  
Old 07-28-2012
Hello,
Maybe I was not very clear. What I want is a regex that identifies the full-stop as an end of sentence and excludes all other full-stops as listed in my mail which are not sentence delimiters but delimit entities such as Temperature, Currency, Acronyms, Dates etc.
Many thanks once again
# 5  
Old 07-28-2012
Hum i guess that when i write in english it's not clear. So let's talk regex

i said :
Quote:
Maybe you could say that full-stop must be followed by a \w and a capital letter or end of file ?
That could mean something like : '\.\w[A-Z]'
# 6  
Old 07-28-2012
Hi Many thanks.
I tried the regex you had provided.
Here is the input:
Quote:
The temperature was 32.8 degrees Celsius. His B.Sc. degree was deemed insufficient. He owed the bank USD 4000.50 which he had not paid back. On 27.07.2004 a major earthquake occurred. It was 17.05 by the clock.
What I need is that the regex should identify only sentences delimited with a full-stop.
The expected output would be:
Quote:
The temperature was 32.8 degrees Celsius.
His B.Sc. degree was deemed insufficient.
He owed the bank USD 4000.50 which he had not paid back.
On 27.07.2004 a major earthquake occurred.
It was 17.05 by the clock.
and not for example
Quote:
His B.
Sc.
degree was deemed insufficient.
The Regex which you furnished and which I applied as a Unix regex gave me the following:
Quote:
His B.Sc.
degree was deemed insufficient.
I tried quite a few tweaks but they made it worse.
Any workarounds please. I have a huge database with this type of strings and need to identify valid strings.
Many thanks
# 7  
Old 07-28-2012
sed

Hi,

Try this one,
Code:
sed -e 's/\. \([A-Z]\)/.\n\1/g' file

i think this would help you.
Cheers,
Ranga:-)
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to identify delimiter to find and replace a string with sed?

I need to find and replace a date format in a SQL script with sed. The original lines are like this: ep.begin_date, ep.end_date, ep.facility_code, AND ep.begin_date <= '01-JUL-2019' ep.begin_date, ep.end_date, ep.facility_code, AND ... (15 Replies)
Discussion started by: duke0001
15 Replies

2. UNIX for Beginners Questions & Answers

Regex to identify pattern

Hi In a file I have string in multiple lines. Like below: <?=test.getObjectName("L", "testTBL","D") ?> <?=test.getObjectName("L", "testTBL","testDB", "D") ?> I want to use regex to search for the pattern "<?=test.getObjectName...?>" If the parenthesis has 3 parameters then return 2nd... (5 Replies)
Discussion started by: dashing201
5 Replies

3. Shell Programming and Scripting

Regex to identify illegal characters in a perso-arabic database

I am working on Sindhi: a perso-Arabic script and since it shares the Unicode-block with over 400 other languages, quite often the database contains characters which are not wanted: illegal characters. I have identified the character set of Sindhi which is given below: For clarity's sake, each... (8 Replies)
Discussion started by: gimley
8 Replies

4. Shell Programming and Scripting

Regex to identify unique words in a dictionary database

Hello, I have a dictionary which I am building for the Open Source Community. The data structure is as under HEADWORD=PARTOFSPEECH=ENGLISH MEANING as shown in the example below अ=m=Prefix signifying negation. अँहँ=ind=Interjection expressing disapprobation. अं=int=An interjection... (2 Replies)
Discussion started by: gimley
2 Replies

5. Shell Programming and Scripting

Sentence delimiter in perl: modifications needed

Hello, I found this Perl Script on the EuroParl website which does Sentence Splitting. #!/usr/bin/perl -w # Based on Preprocessor written by Philipp Koehn binmode(STDIN, ":utf8"); binmode(STDOUT, ":utf8"); binmode(STDERR, ":utf8"); use FindBin qw($Bin); use strict; my $mydir =... (0 Replies)
Discussion started by: gimley
0 Replies

6. Shell Programming and Scripting

Identify full path in argument

I have a small script to send copies of files to another computer used for tests but in the same location:pwd=`pwd` for i in "$@" do echo "rcp -p $i comp-2:$pwd/$i" rcp -p $i comp-2:$pwd/$i echo "Finished with $i" doneIs there a way I can check the parameter to see if it is a full... (5 Replies)
Discussion started by: wbport
5 Replies

7. Shell Programming and Scripting

Regex to identify word in second position on a line

I am interested in finding a regex to find a word in second position on a line. The word in question is या I tried the following PERL EXPRESSION but it did not work: ] या or ^\W या But both gave Null results I am giving below a Sample file: देना या सौंपना=delegate तह जमना या... (8 Replies)
Discussion started by: gimley
8 Replies

8. UNIX for Dummies Questions & Answers

Use Regex to identify / format a complex string

First of all, please have mercy on me. I am not a noob to programming, but I am about as noob as you can get with regex. That being said, I have a problem. I've got a string that looks something like this: Publication - Bob M. Jones, Tony X. Stark, and Fred D. Man, \"Really Awesome Article... (1 Reply)
Discussion started by: egill
1 Replies

9. Shell Programming and Scripting

How to take a full sentence and check the condition?

I have one input file and content of file is : --------------------------------------------------- Input.txt --------------------------------------------------- american express Bahnbau GmbH Bahnbau GmbH CRH Europe crh europe Helgeland Ferdigbetong AS... (8 Replies)
Discussion started by: humaemo
8 Replies

10. UNIX for Dummies Questions & Answers

Script to ask for a sentence and then count number of spaces in the sentence

Hi People, I need some Help to write a unix script that asks for a sentence to be typed out then with the sentence. Counts the number of spaces within the sentence and then echo's out "The Number Of Spaces In The Sentence is 4" as a example Thanks Danielle (12 Replies)
Discussion started by: charlie101208
12 Replies
Login or Register to Ask a Question