It's not just Mr.. It's also Mrs., Ms., Dr., Sr., Jr., and hundreds of other abbreviations. And these abbreviations don't always appear at the start of a sentence. (Or maybe you thought that the caret in (^Mr. | [?!]) means "not". It doesn't; it anchors that part of the ERE to the start of a string. And the <space> before the bracket expression is a literal <space> that must be matched exactly (and that <space> would never appear before a sentence terminating character in English text).
If your sentences all end at the end of a line, anchoring (i.e. [.!?]$ as I suggested in post #6 in this thread) should work for you. If you have multiple sentences that take multiple lines or multiple sentences on a line AND sentences that do not end at the end of a line have a sentence terminating character immediately followed by two <space> characters, then the RS value I suggested i post #6 (i.e.
with exactly two spaces before the vertical bar in that ERE) should give you records that are sentences (without the character that terminates the sentence).
But if you have abbreviations followed by a single space and sentence terminating characters followed by a single space (not a double space) and not appearing at the end of a line, you are going to find it very difficult to guess which periods terminate abbreviations and which periods terminate sentences. (Note that it is also possible for an abbrevition to appear at the end of a sentence.
And, semicolons and colons do not end English sentences. I don't understand why you're including them in your EREs.
Last edited by Don Cragun; 08-08-2017 at 03:53 PM..
Reason: Fix typo: s/[.!?]?/[.!?]$/
This User Gave Thanks to Don Cragun For This Post:
Ive got a file with words and also numbers.
Bla BLA
10 10
11 29
12 89
13 35
And i need to change "10,29,89,25" and also remove anything that contains actually words... (4 Replies)
I can print a line with an expression using this:
awk '/regex/'
I can print the line immediately before an expression using this:
awk '/regex/{print x};{x=$0}'
How do I print the line immediately before and then the line with the expression? (2 Replies)
Hi All,
I've got some strange behaviour going on when trying to manipulate a file that contains spaces.
My input file looks something like this:
xxxxxxxxx,yyyy,sss sss sss,bbbbbbb
If I use awk:
When running from the command line I get:
sss sss sss
But when running from a... (7 Replies)
Hello world,
I was wondering if there is a nicer way to write the following code (in AWK):
awk '
FNR==NR&&$1~/^m$/{tok1=1}
FNR==NR&&$1~/^m10$/{tok1=1}
' my_file
In fact, it looks for m2, m4, m6, m8 and m10 and then return a positive flag. The problem is how to define 10 thanks... (3 Replies)
I'm working on a different stage of a project that someone helped me address elsewhere in these threads.
The .docs I'm cycling through look roughly like this:
1 of 26 DOCUMENTS
Copyright 2010 The Age Company Limited
All Rights Reserved
The Age (Melbourne, Australia)
November 27, 2010... (9 Replies)
Hi,
I have a bunch of records within a directory where each one has this form:
(example file1)
1 2 50 90 80 90 43512 98 0909 79869 -9 7878 33222 8787 9090 89898 7878 8989 7878 6767 89 89 78676 9898 000 7878 5656 5454 5454
and i want for all of these files to be... (3 Replies)
How do I use single quotes as record separator in awk?
I just couldn't figure that out. I know how to use single quotes as field separator, and double quotes as both field and record separator ... (1 Reply)
Hi all,
How am I read a file, find the match regular expression and overwrite to the same files.
open DESTINATION_FILE, "<tmptravl.dat" or die "tmptravl.dat";
open NEW_DESTINATION_FILE, ">new_tmptravl.dat" or die "new_tmptravl.dat";
while (<DESTINATION_FILE>)
{
# print... (1 Reply)
Hello to all,
Please some help on this. I have the file in format as below.
How can I set the record separator as the string below in red
"No. Time Source Destination Protocol Length Info"
I've tried code below but it doesn't seem to... (6 Replies)
Discussion started by: cgkmal
6 Replies
LEARN ABOUT DEBIAN
diction
DICTION(1) User commands DICTION(1)NAME
diction - print wordy and commonly misused phrases in sentences
SYNOPSIS
diction [-b] [-d] [-f file [-n|-L language]] [file...]
diction [--beginner] [--ignore-double-words] [--file file [--no-default-file|--language language]] [file...]
diction -h|--help
diction --version
DESCRIPTION
Diction finds all sentences in a document that contain phrases from a database of frequently misused, bad or wordy diction. It further
checks for double words. If no files are given, the document is read from standard input. Each found phrase is enclosed in [ ] (brack-
ets). Suggestions and advice, if any and if asked for, are printed headed by a right arrow ->. A sentence is a sequence of words, that
starts with a capitalised word and ends with a full stop, double colon, question mark or exclaimation mark. A single letter followed by a
dot is considered an abbreviation, so it does not terminate a sentence. Various multi-letter abbreviations are recognized, they do not
terminate a sentence as well, neither do fractional numbers.
Diction understands cpp(1) #line lines for being able to give precise locations when printing sentences.
OPTIONS -b, --beginner
Complain about mistakes typically made by beginners.
-d, --ignore-double-words
Ignore double words and do not complain about them.
-s, --suggest
Suggest better wording, if any.
-f file, --file file
Read the user specified database from the specified file in addition to the default database.
-n, --no-default-file
Do not read the default database, so only the user-specified database is used.
-L language, --language language
Set the phrase file language.
-h, --help
Print a short usage message.
--version
Print the version.
ERRORS
On usage errors, 1 is returned. Termination caused by lack of memory is signalled by exit code 2.
EXAMPLE
The following example first removes all roff constructs and headers from a document and feeds the result to diction with a German database:
deroff -s file.mm | diction -L de | fmt
ENVIRONMENT
LC_MESSAGES=de|en
specifies the message language and is also used as default for the phrase language. The default language is en.
FILES
/usr/share/diction/* databases for various languages
AUTHOR
This program is GNU software, copyright 1997-2005 Michael Haardt <michael@moria.de>.
The English phrase file contains contributions by Greg Lindahl <lindahl@pbm.com>, Wil Baden, Gary D. Kline, Kimberly Hanks and Beth Morris.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MER-
CHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, write to the Free Software Foundation,
Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
HISTORY
There has been a diction command on old UNIX systems, which is now part of the AT&T DWB package. The original version was bound to roff by
enforcing a call to deroff. This version is a reimplementation and must run in a pipe with deroff(1) if you want to process roff docu-
ments. Similarly, you can run it in a pipe with dehtml(1) or detex(1) to process HTML or TeX documents.
SEE ALSO deroff(1), fmt(1), style(1)
Cherry, L.L.; Vesterman, W.: Writing Tools--The STYLE and DICTION programs, Computer Science Technical Report 91, Bell Laboratories, Murray
Hill, N.J. (1981), republished as part of the 4.4BSD User's Supplementary Documents by O'Reilly.
Strunk, William: The elements of style, Ithaca, N.Y.: Priv. print., 1918, http://coba.shsu.edu/help/strunk/
GNU June 09, 2006 DICTION(1)