Gawk and regexp


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Gawk and regexp
# 1  
Old 10-25-2014
Gawk and regexp

Hello,

This is a problem I've worked on a while and can't figure out.

There is a file.txt

Code:
..some stuff..
[[Category:98 births]]
[[Category:2nd C. deaths]]
..some stuff..

The Awk program is trying to extract the year portion of the birth and death ("98: and "2nd C.") using the below technique

Code:
#!/bin/awk

@include filefuncs

BEGIN{

  fp = readfile("file.txt")

  birth = years(fp, "births")
  death = years(fp, "deaths")

  print "Birth = "birth
  print "Death = "death

}
function years(fp, type     ,a,b)
{
  if(type == "births") {
    if(match(fp,/\yCategory:[[:punct:]A-z0-9]* births\y/, a)) 
      split(a[0],b,":|births")
      return b[2]
 }
 if(type == "deaths") {
   if(match(fp,/\yCategory:[[:punct:]A-z0-9]* deaths\y/, a)) 
     split(a[0],b,":|deaths")
     return b[2]
 }
 return -1
}

There are other ways to do it via the command line, but I need inside a function in a script using readfile().

The above code returns the correct birth year, but the death year is mangled because the regex is grabbing both the birth and death strings.

It works when not using the readfile() function, instead getline and this regex

Code:
match(article,"Category:[A-z0-9].* deaths", a)

The ".*" grabs everything to the end of the line and since readline makes the entire file a single line it grabs to the end of the "line" (file). That's why I'm using word boundary ("\y"), which works, but it doesn't work if there is a space in the data, such as the case here with the death string ("2nd C."). I tried adding "[:space:]" but that didn't work. I think this is solvable with the right regex but I'm out of ideas.

Last edited by Scrutinizer; 10-26-2014 at 03:36 AM.. Reason: extra code tags
# 2  
Old 10-26-2014
Why does it have to use readfile()? Why not use awk the way it was meant to be used? Is this homework?
# 3  
Old 10-26-2014
OK - I found the problem. The match line should look like this:

Code:
match(article,/\yCategory:[[:space:].A-z0-9]*deaths\y/,a)

The problem was [: punct:] was matching the [[]] characters in file.txt .. so in order to match the "." in "2nd C." it's now noted directly (right before the A-z).

Thanks.

Last edited by Scrutinizer; 10-26-2014 at 03:36 AM.. Reason: code tags
# 4  
Old 10-26-2014
Quote:
Originally Posted by Mid Ocean
OK - I found the problem. The match line should look like this:

Code:
match(article,/\yCategory:[[:space:].A-z0-9]*deaths\y/,a)

The problem was [: punct:] was matching the [[]] characters in file.txt .. so in order to match the "." in "2nd C." it's now noted directly (right before the A-z).

Thanks.
The RE [A-z] also includes the characters [, \, ], ^, _, and `. The RE [[:space:]] contains all whitespace characters; I'm guessing that you just want a space character instead. And, if you're trying to catch common forms of dates, you probably also want to include comma (for dates like December 25, 1999. So, a better RE would probably be:
Code:
match(article,/\yCategory:[ .,[:alnum:]]* deaths\y/,a)

This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 10-26-2014
Note: A-z0-9 will probably not do what you want:
Code:
$ echo \[ | grep '[A-z0-9]' 
[

This is because square brackets fall within that range. Moreover, ranges like that are also dependent on locale which could produce other unexpected results. So it would be better to use [:alnum:] instead.

---
Also the code looks a bit convoluted for such a simple task. I don't see why you would need to use gawk and read the entire file in memory, while this could also be done by using awk's line processing mid section, which is typically used for this. I would suggest you read up on that.

---
You could perhaps also consider selecting a different line processing tool like GNU sed
Code:
sed -rn 's/.*\[\[Category:(.*) (births|deaths)\]\].*/\u\2: \1/p' file

Which would maybe produce similarly acceptable results..
Code:
Births: 98
Deaths: 2nd C.

# 6  
Old 10-26-2014
Thank you, Don. That is much better. I did not realize A-z was catching the square brackets which appears to be the underlying problem.

Scrutinizer, it's part of a larger Awk script which passes the "fp" variable around to different functions for processing, not a shell script.

Last edited by Mid Ocean; 10-29-2014 at 01:28 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. What is on Your Mind?

A Regexp You Can Use Everywhere

¯\_(ツ)_/¯ bakunin (0 Replies)
Discussion started by: bakunin
0 Replies

2. Shell Programming and Scripting

Regexp

I would like to extract "1333 Fairlane" given the below text. The word "Building:" is always present. The wording between Building and the beginning of the address can be almost anything. It appears the the hyphen is there most of the time. Campus: Fairlane Business Park Building:... (9 Replies)
Discussion started by: bbaker@copesan.
9 Replies

3. Shell Programming and Scripting

Perl regexp help

Hi, I have file like below: 1|1212|34353|5fdf 6575||dfgdg sfsdf |afsf||4|aasfbc|~1213~~~~~ 1|1212|34353|5fdf 6575||dfgdg sfsdf |affsf| |4|abc|~rwarw~~asa~~~123~312313 1|1212|34353|5fdf 6575||dfgdg sfsdf |afasfs||4|aasfdbc|~564564~~~~ 1|1212|34353|5fdf 6575||dfgdg sfsdf... (1 Reply)
Discussion started by: sol_nov
1 Replies

4. Shell Programming and Scripting

help with grep regexp

My input file looks like this: 13154|X,the deer hunter 13154|Y,the good life 1316|,american idol 1316|,bowling 1316|,chuck etc... The X, Y, or any other character (besides a comma) after the pipe is a "Device Type". I want to strip out lines that do not have a device type. I have... (2 Replies)
Discussion started by: jwinsk
2 Replies

5. UNIX for Dummies Questions & Answers

print the line immediately after a regexp; but regexp is a sentence

Good Day, Im new to scripting especially awk and sed. I just would like to ask help from you guys about a sed command that prints the line immediately after a regexp, but not the line containing the regexp. sed -n '/regexp/{n;p;}' filename What if my regexp is 3 word or a sentence. Im... (3 Replies)
Discussion started by: ownins
3 Replies

6. Shell Programming and Scripting

Help with regexp

Hi there! I would like to know how to find and replace all numbers in a *.html file and make them bold. Any help will be appreciated! :) (7 Replies)
Discussion started by: agasamapetilon
7 Replies

7. Shell Programming and Scripting

regexp help

I'd like to know if there is a catchall line for renaming the following patterns: s01e03 -> 01x03 s4e9 -> 04x09 s10e08 ->10x08 and possibly even: 318 -> 03x18 1002 ->10x02 if its the first 3 or first digit number in the string. thanks! (0 Replies)
Discussion started by: TinCanFury
0 Replies

8. UNIX for Dummies Questions & Answers

grep using regexp

I have 2 files called stuff-egress-filter and stuff-ingress filter. There are also files called something like stuff-egress-F/0 I want to match the first two... I tried (i realize there is no filename... I'm piping this from the ls command) grep stuff-*-filter Finds nothing. If I... (18 Replies)
Discussion started by: earnstaf
18 Replies

9. Shell Programming and Scripting

regexp with sed again!!!

please help: I want to add 1 space between string and numbers: input file: abcd12345 output file: abcd 1234 The following sed command does not work: sed 's/\(+\)\(+\)/\1 \2/' file Any ideas, please Andy (2 Replies)
Discussion started by: andy2000
2 Replies

10. UNIX for Advanced & Expert Users

regexp

Hi guys, does anyone know how to test for a regular expression - i want to include it in a script to make sure the variable is a regexp cheers (1 Reply)
Discussion started by: penfold
1 Replies
Login or Register to Ask a Question