Help with awk regular expression for RS record separator

08-07-2017

Registered User

10, 0

Join Date: Aug 2017

Last Activity: 2 January 2018, 11:41 AM EST

Posts: 10

Thanks Given: 9

Thanked 0 Times in 0 Posts

Help with awk regular expression for RS record separator

Hi,

I'm using gawk to read a text file and count the sentences.
I want to use a record separator of a period, exclamation mark and a question mark.

The problem is that the file contains words like "Mr. Smith" so the periods in the appellation are tripping my record separator.

This is my code snippet:

Code:

BEGIN {
       RS="[.?!]"
}

This actually works fine until the file contains words like Mr. Smith.

So I tried like this:

Code:

RS="[^Mr.][.?!]"

Or like this:

Code:

RS="!Mr.[.?!]"

Or like this:

Code:

RS = "!(Mr.)[.?!]"

But I coudn't get any of them to work

Any ideas how I can do this?

Last edited by Don Cragun; 08-07-2017 at 10:10 PM.. Reason: Add CODE and ICODE tags.

1Brajesh

View Public Profile for 1Brajesh

Find all posts by 1Brajesh

08-07-2017

Registered User

23, 7

Join Date: Aug 2017

Last Activity: 26 June 2018, 2:58 PM EDT

Posts: 23

Thanks Given: 3

Thanked 7 Times in 7 Posts

Hi, try

Code:

RS="[^\"Mr.\"][.?!]"

This User Gave Thanks to ctac_ For This Post:

ctac_

View Public Profile for ctac_

Find all posts by ctac_

08-07-2017

Registered User

10, 0

Join Date: Aug 2017

Last Activity: 2 January 2018, 11:41 AM EST

Posts: 10

Thanks Given: 9

Thanked 0 Times in 0 Posts

No it didn't work

It broke the file that was working.
I have a file without any "Mr." words.

By adding your suggestion, even the file without any "Mr." words stops working.

For example, it reads "one." as "on", "two." as tw, "three" as "thre".

This is the same as what was happening with my attempt below too.

---------- Post updated at 06:22 PM ---------- Previous update was at 06:18 PM ----------

Here's my full code:

Code:

#!/bin/bash

BEGIN {
       RS="[.?!;:]"           #       There is a problem with Mr. and Mrs. 
       maxWords=0
      }

{

if (maxWords<NF) 
     { 
        maxWords=NF
        longestSentence = $0
     }

for (i=1;i<=NF;i++) 
        a[$i]++

}

END{ 
      i=1;
      for(k in a) 
      {
        print i, k, a[k];
        i++;
      }
      print
      print("There were", NR, "sentences and the longest sentence had", maxWords, "words and there were", length(a), "unique words")
      print ("The longest sentence was:", longestSentence)
}

---------- Post updated at 06:25 PM ---------- Previous update was at 06:22 PM ----------

And the test file I'm using, which works fine as the code is above, but when I start changing the RS expression, even this file which has no "Mr." stops working.

----start of file----

Code:

one.
two two. 
three three three!
four four four four five five five five five.
six six six six six six?

------end of file---------

Last edited by Don Cragun; 08-07-2017 at 10:12 PM.. Reason: Add CODE tags.

1Brajesh

View Public Profile for 1Brajesh

Find all posts by 1Brajesh

08-07-2017

Read Only

1,278, 486

Join Date: Sep 2012

Last Activity: 27 February 2020, 8:59 PM EST

Location: Houston, Texas, USA

Posts: 1,278

Thanks Given: 0

Thanked 486 Times in 451 Posts

Code:

awk '
BEGIN {
   eol="[.?!;:]$" # There is a problem with Mr. and Mrs.
   maxWords=0
}

{
   if (NF>maxWords) {
      maxWords=NF
      longestSentence = $0
   }

   for (i=1;i<=NF;i++) {
      sub(eol "$", "", $i)
      a[$i]++
   }
}

END{
for(k in a) print ++ii, k, a[k];
print ""
print("There were", NR, "sentences and the longest sentence had", maxWords, "words and there were", length(a), "unique words")
print ("The longest sentence was:", longestSentence)
}
' infile

This User Gave Thanks to rdrtx1 For This Post:

rdrtx1

View Public Profile for rdrtx1

Find all posts by rdrtx1

08-07-2017

Registered User

10, 0

Join Date: Aug 2017

Last Activity: 2 January 2018, 11:41 AM EST

Posts: 10

Thanks Given: 9

Thanked 0 Times in 0 Posts

hmmm...interesting...isn't the record separator a newline now?
What if one sentence spans multiple newlines? Won't it be counted as two or more sentences?

Also, I don't understand exactly what the sub command is doing?

thank you

1Brajesh

View Public Profile for 1Brajesh

Find all posts by 1Brajesh

08-07-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

If what you want to do is separate records at points where the last character on a line is a <period>, <question-mark>, or <exclamation-point>, you probably want to use:

Code:

RS="[.?!]$"

as rdrtx1 suggested.

Using RS="[.?!;:]" splits records on <period>, <question-mark>, <exclamation-point>, <semicolon>, and <colon> anywhere on a line.

Using RS="[^\"Mr.\"][.?!]" splits records on any two character sequence where the first character is not a <backslash>, <double-quote>, <uppercase-M>, <lowercase-r>, <period>, <backslash>, or <double-quote> and the second character is a <period>, <question-mark>, or <exclamation-point>. This ERE makes no sense to me for this use.

If, in addition to splitting when a set of characters is found at the end of a line, you also wanted to find that set of characters followed by two <space> characters (which is the common way of separating sentences in old fashioned text files), you could use:

Code:

RS="[.?!](  |$)"

Note that most of the above is talking about gawk and does not necessarily apply to other standards-conforming versions of awk. The standards state that it if more than one character is assigned to RS, it is unspecified whether RS is treated as a multi-character ERE that acts as the record separator or only the 1st character assigned to RS acts as the record separator. If RS is set to an empty string, the record separator is a sequence of two or more adjacent <newline> characters.

The default record separator is a <newline>. When RS is set to something other than a <newline>, <newline> (in addition to whatever FS is set to) is a field separator.

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-08-2017

Registered User

10, 0

Join Date: Aug 2017

Last Activity: 2 January 2018, 11:41 AM EST

Posts: 10

Thanks Given: 9

Thanked 0 Times in 0 Posts

Hi Don,

I'm trying to capture an english sentence in a record.
This sentence could be very long and span multiple lines in a file.

My perfect record separator would be a period, exclamation point, question mark, semicolon or colon.

However, my code sees the word "Mr." it thinks that's the end of the sentence because of the period that is part of Mr. So I want it detect that "Mr." is NOT part of the record separator.

Semantically:
Not (Mr.) but ok with any of these [.!?;:]

But syntactically I don't know how to do this, I'm trying like this:

Code:

 RS = (^Mr. | [.!?;:])

But its not working?

Moderator's Comments:

Please use CODE tags when displaying sample input, output, and code segments (as required by forum rules).

Last edited by Don Cragun; 08-08-2017 at 01:15 AM.. Reason: Add CODE tags.

1Brajesh

View Public Profile for 1Brajesh

Find all posts by 1Brajesh

Shell Programming and Scripting

Help with awk regular expression for RS record separator

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Use string as Record separator in awk

Discussion started by: cgkmal

2. Programming

Perl: How to read from a file, do regular expression and then replace the found regular expression

Discussion started by: jessy83

3. Shell Programming and Scripting

awk - single quotes as record separator

Discussion started by: locoroco

4. Shell Programming and Scripting

apply record separator to multiple files within a directory using awk

Discussion started by: amarn

5. Shell Programming and Scripting

awk, string as record separator, transposing rows into columns

Discussion started by: spindoctor

6. Shell Programming and Scripting

awk - double quotes as record separator

Discussion started by: locoroco

7. Shell Programming and Scripting

Regular expression in AWK

Discussion started by: jolecanard

8. Shell Programming and Scripting

awk & cut record separator problem

Discussion started by: pondlife

9. UNIX for Dummies Questions & Answers

regular expression and awk

Discussion started by: nickg

10. Shell Programming and Scripting

awk and regular expression

Discussion started by: maskot