Help with awk regular expression for RS record separator Post: 303001605

Sponsored Content

Top Forums Shell Programming and Scripting Help with awk regular expression for RS record separator Post 303001605 by Don Cragun on Tuesday 8th of August 2017 02:27:45 AM

08-08-2017

Registered User

It's not just Mr.. It's also Mrs., Ms., Dr., Sr., Jr., and hundreds of other abbreviations. And these abbreviations don't always appear at the start of a sentence. (Or maybe you thought that the caret in (^Mr. | [?!]) means "not". It doesn't; it anchors that part of the ERE to the start of a string. And the <space> before the bracket expression is a literal <space> that must be matched exactly (and that <space> would never appear before a sentence terminating character in English text).

If your sentences all end at the end of a line, anchoring (i.e. [.!?]$ as I suggested in post #6 in this thread) should work for you. If you have multiple sentences that take multiple lines or multiple sentences on a line AND sentences that do not end at the end of a line have a sentence terminating character immediately followed by two <space> characters, then the RS value I suggested i post #6 (i.e.

Code:

RS="[.?!](  |$)"

with exactly two spaces before the vertical bar in that ERE) should give you records that are sentences (without the character that terminates the sentence).

But if you have abbreviations followed by a single space and sentence terminating characters followed by a single space (not a double space) and not appearing at the end of a line, you are going to find it very difficult to guess which periods terminate abbreviations and which periods terminate sentences. (Note that it is also possible for an abbrevition to appear at the end of a sentence.

And, semicolons and colons do not end English sentences. I don't understand why you're including them in your EREs.

Last edited by Don Cragun; 08-08-2017 at 03:53 PM.. Reason: Fix typo: s/[.!?]?/[.!?]$/

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk and regular expression

Ive got a file with words and also numbers. Bla BLA 10 10 11 29 12 89 13 35 And i need to change "10,29,89,25" and also remove anything that contains actually words...

2. UNIX for Dummies Questions & Answers

regular expression and awk

I can print a line with an expression using this: awk '/regex/' I can print the line immediately before an expression using this: awk '/regex/{print x};{x=$0}' How do I print the line immediately before and then the line with the expression?

3. Shell Programming and Scripting

awk & cut record separator problem

Hi All, I've got some strange behaviour going on when trying to manipulate a file that contains spaces. My input file looks something like this: xxxxxxxxx,yyyy,sss sss sss,bbbbbbb If I use awk: When running from the command line I get: sss sss sss But when running from a...

4. Shell Programming and Scripting

Regular expression in AWK

Hello world, I was wondering if there is a nicer way to write the following code (in AWK): awk ' FNR==NR&&$1~/^m$/{tok1=1} FNR==NR&&$1~/^m10$/{tok1=1} ' my_file In fact, it looks for m2, m4, m6, m8 and m10 and then return a positive flag. The problem is how to define 10 thanks...

5. Shell Programming and Scripting

awk - double quotes as record separator

How do I use double quotes as a record seperator in awk?

6. Shell Programming and Scripting

awk, string as record separator, transposing rows into columns

I'm working on a different stage of a project that someone helped me address elsewhere in these threads. The .docs I'm cycling through look roughly like this: 1 of 26 DOCUMENTS Copyright 2010 The Age Company Limited All Rights Reserved The Age (Melbourne, Australia) November 27, 2010...

7. Shell Programming and Scripting

apply record separator to multiple files within a directory using awk

Hi, I have a bunch of records within a directory where each one has this form: (example file1) 1 2 50 90 80 90 43512 98 0909 79869 -9 7878 33222 8787 9090 89898 7878 8989 7878 6767 89 89 78676 9898 000 7878 5656 5454 5454 and i want for all of these files to be...

8. Shell Programming and Scripting

awk - single quotes as record separator

How do I use single quotes as record separator in awk? I just couldn't figure that out. I know how to use single quotes as field separator, and double quotes as both field and record separator ...

9. Programming

Perl: How to read from a file, do regular expression and then replace the found regular expression

Hi all, How am I read a file, find the match regular expression and overwrite to the same files. open DESTINATION_FILE, "<tmptravl.dat" or die "tmptravl.dat"; open NEW_DESTINATION_FILE, ">new_tmptravl.dat" or die "new_tmptravl.dat"; while (<DESTINATION_FILE>) { # print...

10. Shell Programming and Scripting

Use string as Record separator in awk

Hello to all, Please some help on this. I have the file in format as below. How can I set the record separator as the string below in red "No. Time Source Destination Protocol Length Info" I've tried code below but it doesn't seem to...

LEARN ABOUT DEBIAN

diction

DICTION(1)							   User commands							DICTION(1)

NAME

       diction - print wordy and commonly misused phrases in sentences

SYNOPSIS

       diction [-b] [-d] [-f file [-n|-L language]] [file...]
       diction [--beginner] [--ignore-double-words] [--file file [--no-default-file|--language language]] [file...]
       diction -h|--help
       diction --version

DESCRIPTION

       Diction	finds  all  sentences  in a document that contain phrases from a database of frequently misused, bad or wordy diction.	It further
       checks for double words.  If no files are given, the document is read from standard input.  Each found phrase is enclosed in  [	]  (brack-
       ets).   Suggestions  and  advice, if any and if asked for, are printed headed by a right arrow ->.  A sentence is a sequence of words, that
       starts with a capitalised word and ends with a full stop, double colon, question mark or exclaimation mark.  A single letter followed by  a
       dot  is	considered  an	abbreviation, so it does not terminate a sentence.  Various multi-letter abbreviations are recognized, they do not
       terminate a sentence as well, neither do fractional numbers.

       Diction understands cpp(1) #line lines for being able to give precise locations when printing sentences.

OPTIONS

       -b, --beginner
	      Complain about mistakes typically made by beginners.

       -d, --ignore-double-words
	      Ignore double words and do not complain about them.

       -s, --suggest
	      Suggest better wording, if any.

       -f file, --file file
	      Read the user specified database from the specified file in addition to the default database.

       -n, --no-default-file
	      Do not read the default database, so only the user-specified database is used.

       -L language, --language language
	      Set the phrase file language.

       -h, --help
	      Print a short usage message.

       --version
	      Print the version.

ERRORS

       On usage errors, 1 is returned.	Termination caused by lack of memory is signalled by exit code 2.

EXAMPLE

       The following example first removes all roff constructs and headers from a document and feeds the result to diction with a German database:

	      deroff -s file.mm | diction -L de | fmt

ENVIRONMENT

       LC_MESSAGES=de|en
	      specifies the message language and is also used as default for the phrase language.  The default language is en.

FILES

       /usr/share/diction/*	databases for various languages

AUTHOR

       This program is GNU software, copyright 1997-2005 Michael Haardt <michael@moria.de>.

       The English phrase file contains contributions by Greg Lindahl <lindahl@pbm.com>, Wil Baden, Gary D. Kline, Kimberly Hanks and Beth Morris.

       This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as  published	by
       the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

       This  program  is  distributed  in  the	hope  that  it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MER-
       CHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

       You should have received a copy of the GNU General Public License along with this program.  If not, write to the Free Software  Foundation,
       Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

HISTORY

       There has been a diction command on old UNIX systems, which is now part of the AT&T DWB package.  The original version was bound to roff by
       enforcing a call to deroff.  This version is a reimplementation and must run in a pipe with deroff(1) if you want  to  process  roff  docu-
       ments.  Similarly, you can run it in a pipe with dehtml(1) or detex(1) to process HTML or TeX documents.

SEE ALSO

       deroff(1), fmt(1), style(1)

       Cherry, L.L.; Vesterman, W.: Writing Tools--The STYLE and DICTION programs, Computer Science Technical Report 91, Bell Laboratories, Murray
       Hill, N.J. (1981), republished as part of the 4.4BSD User's Supplementary Documents by O'Reilly.

       Strunk, William: The elements of style, Ithaca, N.Y.: Priv. print., 1918, http://coba.shsu.edu/help/strunk/

GNU
								   June 09, 2006							DICTION(1)