Gawk and regexp

10-25-2014

Registered User

12, 0

Join Date: Mar 2013

Last Activity: 24 February 2017, 1:18 PM EST

Posts: 12

Thanks Given: 2

Thanked 0 Times in 0 Posts

Gawk and regexp

Hello,

This is a problem I've worked on a while and can't figure out.

There is a file.txt

Code:

..some stuff..
[[Category:98 births]]
[[Category:2nd C. deaths]]
..some stuff..

The Awk program is trying to extract the year portion of the birth and death ("98: and "2nd C.") using the below technique

Code:

#!/bin/awk

@include filefuncs

BEGIN{

  fp = readfile("file.txt")

  birth = years(fp, "births")
  death = years(fp, "deaths")

  print "Birth = "birth
  print "Death = "death

}
function years(fp, type     ,a,b)
{
  if(type == "births") {
    if(match(fp,/\yCategory:[[:punct:]A-z0-9]* births\y/, a)) 
      split(a[0],b,":|births")
      return b[2]
 }
 if(type == "deaths") {
   if(match(fp,/\yCategory:[[:punct:]A-z0-9]* deaths\y/, a)) 
     split(a[0],b,":|deaths")
     return b[2]
 }
 return -1
}

There are other ways to do it via the command line, but I need inside a function in a script using readfile().

The above code returns the correct birth year, but the death year is mangled because the regex is grabbing both the birth and death strings.

It works when not using the readfile() function, instead getline and this regex

Code:

match(article,"Category:[A-z0-9].* deaths", a)

The ".*" grabs everything to the end of the line and since readline makes the entire file a single line it grabs to the end of the "line" (file). That's why I'm using word boundary ("\y"), which works, but it doesn't work if there is a space in the data, such as the case here with the death string ("2nd C."). I tried adding "[:space:]" but that didn't work. I think this is solvable with the right regex but I'm out of ideas.

Last edited by Scrutinizer; 10-26-2014 at 03:36 AM.. Reason: extra code tags

Mid Ocean

View Public Profile for Mid Ocean

Find all posts by Mid Ocean

10-26-2014

Registered User

945, 306

Join Date: Jun 2011

Last Activity: 1 January 2020, 5:25 PM EST

Location: South Carolina, USA

Posts: 945

Thanks Given: 32

Thanked 306 Times in 284 Posts

Why does it have to use readfile()? Why not use awk the way it was meant to be used? Is this homework?

neutronscott

View Public Profile for neutronscott

Visit neutronscott's homepage!

Find all posts by neutronscott

10-26-2014

Registered User

12, 0

Join Date: Mar 2013

Last Activity: 24 February 2017, 1:18 PM EST

Posts: 12

Thanks Given: 2

Thanked 0 Times in 0 Posts

OK - I found the problem. The match line should look like this:

Code:

match(article,/\yCategory:[[:space:].A-z0-9]*deaths\y/,a)

The problem was [: punct:] was matching the [[]] characters in file.txt .. so in order to match the "." in "2nd C." it's now noted directly (right before the A-z).

Thanks.

Last edited by Scrutinizer; 10-26-2014 at 03:36 AM.. Reason: code tags

Mid Ocean

View Public Profile for Mid Ocean

Find all posts by Mid Ocean

10-26-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by Mid Ocean

OK - I found the problem. The match line should look like this:

Code:

match(article,/\yCategory:[[:space:].A-z0-9]*deaths\y/,a)

The problem was [: punct:] was matching the [[]] characters in file.txt .. so in order to match the "." in "2nd C." it's now noted directly (right before the A-z).

Thanks.

The RE [A-z] also includes the characters [, \, ], ^, _, and `. The RE [[:space:]] contains all whitespace characters; I'm guessing that you just want a space character instead. And, if you're trying to catch common forms of dates, you probably also want to include comma (for dates like December 25, 1999. So, a better RE would probably be:

Code:

match(article,/\yCategory:[ .,[:alnum:]]* deaths\y/,a)

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-26-2014

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Note: A-z0-9 will probably not do what you want:

Code:

$ echo \[ | grep '[A-z0-9]' 
[

This is because square brackets fall within that range. Moreover, ranges like that are also dependent on locale which could produce other unexpected results. So it would be better to use [:alnum:] instead.

---
Also the code looks a bit convoluted for such a simple task. I don't see why you would need to use gawk and read the entire file in memory, while this could also be done by using awk's line processing mid section, which is typically used for this. I would suggest you read up on that.

---
You could perhaps also consider selecting a different line processing tool like GNU sed

Code:

sed -rn 's/.*\[\[Category:(.*) (births|deaths)\]\].*/\u\2: \1/p' file

Which would maybe produce similarly acceptable results..

Code:

Births: 98
Deaths: 2nd C.

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

10-26-2014

Registered User

12, 0

Join Date: Mar 2013

Last Activity: 24 February 2017, 1:18 PM EST

Posts: 12

Thanks Given: 2

Thanked 0 Times in 0 Posts

Thank you, Don. That is much better. I did not realize A-z was catching the square brackets which appears to be the underlying problem.

Scrutinizer, it's part of a larger Awk script which passes the "fp" variable around to different functions for processing, not a shell script.

Last edited by Mid Ocean; 10-29-2014 at 01:28 AM..

Mid Ocean

View Public Profile for Mid Ocean

Find all posts by Mid Ocean

Shell Programming and Scripting

Gawk and regexp

10 More Discussions You Might Find Interesting

1. What is on Your Mind?

A Regexp You Can Use Everywhere

Discussion started by: bakunin

2. Shell Programming and Scripting

Regexp

Discussion started by: bbaker@copesan.

3. Shell Programming and Scripting

Perl regexp help

Discussion started by: sol_nov

4. Shell Programming and Scripting

help with grep regexp

Discussion started by: jwinsk

5. UNIX for Dummies Questions & Answers

print the line immediately after a regexp; but regexp is a sentence

Discussion started by: ownins

6. Shell Programming and Scripting

Help with regexp

Discussion started by: agasamapetilon

7. Shell Programming and Scripting

regexp help

Discussion started by: TinCanFury

8. UNIX for Dummies Questions & Answers

grep using regexp

Discussion started by: earnstaf

9. Shell Programming and Scripting

regexp with sed again!!!

Discussion started by: andy2000

10. UNIX for Advanced & Expert Users

regexp

Discussion started by: penfold