Sponsored Content
Top Forums Shell Programming and Scripting Help with awk regular expression for RS record separator Post 303001605 by Don Cragun on Tuesday 8th of August 2017 02:27:45 AM
Old 08-08-2017
It's not just Mr.. It's also Mrs., Ms., Dr., Sr., Jr., and hundreds of other abbreviations. And these abbreviations don't always appear at the start of a sentence. (Or maybe you thought that the caret in (^Mr. | [?!]) means "not". It doesn't; it anchors that part of the ERE to the start of a string. And the <space> before the bracket expression is a literal <space> that must be matched exactly (and that <space> would never appear before a sentence terminating character in English text).

If your sentences all end at the end of a line, anchoring (i.e. [.!?]$ as I suggested in post #6 in this thread) should work for you. If you have multiple sentences that take multiple lines or multiple sentences on a line AND sentences that do not end at the end of a line have a sentence terminating character immediately followed by two <space> characters, then the RS value I suggested i post #6 (i.e.
Code:
RS="[.?!](  |$)"

with exactly two spaces before the vertical bar in that ERE) should give you records that are sentences (without the character that terminates the sentence).

But if you have abbreviations followed by a single space and sentence terminating characters followed by a single space (not a double space) and not appearing at the end of a line, you are going to find it very difficult to guess which periods terminate abbreviations and which periods terminate sentences. (Note that it is also possible for an abbrevition to appear at the end of a sentence.

And, semicolons and colons do not end English sentences. I don't understand why you're including them in your EREs.

Last edited by Don Cragun; 08-08-2017 at 03:53 PM.. Reason: Fix typo: s/[.!?]?/[.!?]$/
This User Gave Thanks to Don Cragun For This Post:
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk and regular expression

Ive got a file with words and also numbers. Bla BLA 10 10 11 29 12 89 13 35 And i need to change "10,29,89,25" and also remove anything that contains actually words... (4 Replies)
Discussion started by: maskot
4 Replies

2. UNIX for Dummies Questions & Answers

regular expression and awk

I can print a line with an expression using this: awk '/regex/' I can print the line immediately before an expression using this: awk '/regex/{print x};{x=$0}' How do I print the line immediately before and then the line with the expression? (2 Replies)
Discussion started by: nickg
2 Replies

3. Shell Programming and Scripting

awk & cut record separator problem

Hi All, I've got some strange behaviour going on when trying to manipulate a file that contains spaces. My input file looks something like this: xxxxxxxxx,yyyy,sss sss sss,bbbbbbb If I use awk: When running from the command line I get: sss sss sss But when running from a... (7 Replies)
Discussion started by: pondlife
7 Replies

4. Shell Programming and Scripting

Regular expression in AWK

Hello world, I was wondering if there is a nicer way to write the following code (in AWK): awk ' FNR==NR&&$1~/^m$/{tok1=1} FNR==NR&&$1~/^m10$/{tok1=1} ' my_file In fact, it looks for m2, m4, m6, m8 and m10 and then return a positive flag. The problem is how to define 10 thanks... (3 Replies)
Discussion started by: jolecanard
3 Replies

5. Shell Programming and Scripting

awk - double quotes as record separator

How do I use double quotes as a record seperator in awk? (4 Replies)
Discussion started by: locoroco
4 Replies

6. Shell Programming and Scripting

awk, string as record separator, transposing rows into columns

I'm working on a different stage of a project that someone helped me address elsewhere in these threads. The .docs I'm cycling through look roughly like this: 1 of 26 DOCUMENTS Copyright 2010 The Age Company Limited All Rights Reserved The Age (Melbourne, Australia) November 27, 2010... (9 Replies)
Discussion started by: spindoctor
9 Replies

7. Shell Programming and Scripting

apply record separator to multiple files within a directory using awk

Hi, I have a bunch of records within a directory where each one has this form: (example file1) 1 2 50 90 80 90 43512 98 0909 79869 -9 7878 33222 8787 9090 89898 7878 8989 7878 6767 89 89 78676 9898 000 7878 5656 5454 5454 and i want for all of these files to be... (3 Replies)
Discussion started by: amarn
3 Replies

8. Shell Programming and Scripting

awk - single quotes as record separator

How do I use single quotes as record separator in awk? I just couldn't figure that out. I know how to use single quotes as field separator, and double quotes as both field and record separator ... (1 Reply)
Discussion started by: locoroco
1 Replies

9. Programming

Perl: How to read from a file, do regular expression and then replace the found regular expression

Hi all, How am I read a file, find the match regular expression and overwrite to the same files. open DESTINATION_FILE, "<tmptravl.dat" or die "tmptravl.dat"; open NEW_DESTINATION_FILE, ">new_tmptravl.dat" or die "new_tmptravl.dat"; while (<DESTINATION_FILE>) { # print... (1 Reply)
Discussion started by: jessy83
1 Replies

10. Shell Programming and Scripting

Use string as Record separator in awk

Hello to all, Please some help on this. I have the file in format as below. How can I set the record separator as the string below in red "No. Time Source Destination Protocol Length Info" I've tried code below but it doesn't seem to... (6 Replies)
Discussion started by: cgkmal
6 Replies
DICTION(1)							   User commands							DICTION(1)

NAME
diction - print wordy and commonly misused phrases in sentences SYNOPSIS
diction [-b] [-d] [-f file [-n|-L language]] [file...] diction [--beginner] [--ignore-double-words] [--file file [--no-default-file|--language language]] [file...] diction -h|--help diction --version DESCRIPTION
Diction finds all sentences in a document that contain phrases from a database of frequently misused, bad or wordy diction. It further checks for double words. If no files are given, the document is read from standard input. Each found phrase is enclosed in [ ] (brack- ets). Suggestions and advice, if any and if asked for, are printed headed by a right arrow ->. A sentence is a sequence of words, that starts with a capitalised word and ends with a full stop, double colon, question mark or exclaimation mark. A single letter followed by a dot is considered an abbreviation, so it does not terminate a sentence. Various multi-letter abbreviations are recognized, they do not terminate a sentence as well, neither do fractional numbers. Diction understands cpp(1) #line lines for being able to give precise locations when printing sentences. OPTIONS
-b, --beginner Complain about mistakes typically made by beginners. -d, --ignore-double-words Ignore double words and do not complain about them. -s, --suggest Suggest better wording, if any. -f file, --file file Read the user specified database from the specified file in addition to the default database. -n, --no-default-file Do not read the default database, so only the user-specified database is used. -L language, --language language Set the phrase file language. -h, --help Print a short usage message. --version Print the version. ERRORS
On usage errors, 1 is returned. Termination caused by lack of memory is signalled by exit code 2. EXAMPLE
The following example first removes all roff constructs and headers from a document and feeds the result to diction with a German database: deroff -s file.mm | diction -L de | fmt ENVIRONMENT
LC_MESSAGES=de|en specifies the message language and is also used as default for the phrase language. The default language is en. FILES
/usr/share/diction/* databases for various languages AUTHOR
This program is GNU software, copyright 1997-2005 Michael Haardt <michael@moria.de>. The English phrase file contains contributions by Greg Lindahl <lindahl@pbm.com>, Wil Baden, Gary D. Kline, Kimberly Hanks and Beth Morris. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MER- CHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. HISTORY
There has been a diction command on old UNIX systems, which is now part of the AT&T DWB package. The original version was bound to roff by enforcing a call to deroff. This version is a reimplementation and must run in a pipe with deroff(1) if you want to process roff docu- ments. Similarly, you can run it in a pipe with dehtml(1) or detex(1) to process HTML or TeX documents. SEE ALSO
deroff(1), fmt(1), style(1) Cherry, L.L.; Vesterman, W.: Writing Tools--The STYLE and DICTION programs, Computer Science Technical Report 91, Bell Laboratories, Murray Hill, N.J. (1981), republished as part of the 4.4BSD User's Supplementary Documents by O'Reilly. Strunk, William: The elements of style, Ithaca, N.Y.: Priv. print., 1918, http://coba.shsu.edu/help/strunk/ GNU
June 09, 2006 DICTION(1)
All times are GMT -4. The time now is 05:40 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy