Sponsored Content
Full Discussion: Gawk and regexp
Top Forums Shell Programming and Scripting Gawk and regexp Post 302922517 by Mid Ocean on Saturday 25th of October 2014 08:05:24 PM
Old 10-25-2014
Gawk and regexp

Hello,

This is a problem I've worked on a while and can't figure out.

There is a file.txt

Code:
..some stuff..
[[Category:98 births]]
[[Category:2nd C. deaths]]
..some stuff..

The Awk program is trying to extract the year portion of the birth and death ("98: and "2nd C.") using the below technique

Code:
#!/bin/awk

@include filefuncs

BEGIN{

  fp = readfile("file.txt")

  birth = years(fp, "births")
  death = years(fp, "deaths")

  print "Birth = "birth
  print "Death = "death

}
function years(fp, type     ,a,b)
{
  if(type == "births") {
    if(match(fp,/\yCategory:[[:punct:]A-z0-9]* births\y/, a)) 
      split(a[0],b,":|births")
      return b[2]
 }
 if(type == "deaths") {
   if(match(fp,/\yCategory:[[:punct:]A-z0-9]* deaths\y/, a)) 
     split(a[0],b,":|deaths")
     return b[2]
 }
 return -1
}

There are other ways to do it via the command line, but I need inside a function in a script using readfile().

The above code returns the correct birth year, but the death year is mangled because the regex is grabbing both the birth and death strings.

It works when not using the readfile() function, instead getline and this regex

Code:
match(article,"Category:[A-z0-9].* deaths", a)

The ".*" grabs everything to the end of the line and since readline makes the entire file a single line it grabs to the end of the "line" (file). That's why I'm using word boundary ("\y"), which works, but it doesn't work if there is a space in the data, such as the case here with the death string ("2nd C."). I tried adding "[:space:]" but that didn't work. I think this is solvable with the right regex but I'm out of ideas.

Last edited by Scrutinizer; 10-26-2014 at 03:36 AM.. Reason: extra code tags
 

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

regexp

Hi guys, does anyone know how to test for a regular expression - i want to include it in a script to make sure the variable is a regexp cheers (1 Reply)
Discussion started by: penfold
1 Replies

2. Shell Programming and Scripting

regexp with sed again!!!

please help: I want to add 1 space between string and numbers: input file: abcd12345 output file: abcd 1234 The following sed command does not work: sed 's/\(+\)\(+\)/\1 \2/' file Any ideas, please Andy (2 Replies)
Discussion started by: andy2000
2 Replies

3. UNIX for Dummies Questions & Answers

grep using regexp

I have 2 files called stuff-egress-filter and stuff-ingress filter. There are also files called something like stuff-egress-F/0 I want to match the first two... I tried (i realize there is no filename... I'm piping this from the ls command) grep stuff-*-filter Finds nothing. If I... (18 Replies)
Discussion started by: earnstaf
18 Replies

4. Shell Programming and Scripting

regexp help

I'd like to know if there is a catchall line for renaming the following patterns: s01e03 -> 01x03 s4e9 -> 04x09 s10e08 ->10x08 and possibly even: 318 -> 03x18 1002 ->10x02 if its the first 3 or first digit number in the string. thanks! (0 Replies)
Discussion started by: TinCanFury
0 Replies

5. Shell Programming and Scripting

Help with regexp

Hi there! I would like to know how to find and replace all numbers in a *.html file and make them bold. Any help will be appreciated! :) (7 Replies)
Discussion started by: agasamapetilon
7 Replies

6. UNIX for Dummies Questions & Answers

print the line immediately after a regexp; but regexp is a sentence

Good Day, Im new to scripting especially awk and sed. I just would like to ask help from you guys about a sed command that prints the line immediately after a regexp, but not the line containing the regexp. sed -n '/regexp/{n;p;}' filename What if my regexp is 3 word or a sentence. Im... (3 Replies)
Discussion started by: ownins
3 Replies

7. Shell Programming and Scripting

help with grep regexp

My input file looks like this: 13154|X,the deer hunter 13154|Y,the good life 1316|,american idol 1316|,bowling 1316|,chuck etc... The X, Y, or any other character (besides a comma) after the pipe is a "Device Type". I want to strip out lines that do not have a device type. I have... (2 Replies)
Discussion started by: jwinsk
2 Replies

8. Shell Programming and Scripting

Perl regexp help

Hi, I have file like below: 1|1212|34353|5fdf 6575||dfgdg sfsdf |afsf||4|aasfbc|~1213~~~~~ 1|1212|34353|5fdf 6575||dfgdg sfsdf |affsf| |4|abc|~rwarw~~asa~~~123~312313 1|1212|34353|5fdf 6575||dfgdg sfsdf |afasfs||4|aasfdbc|~564564~~~~ 1|1212|34353|5fdf 6575||dfgdg sfsdf... (1 Reply)
Discussion started by: sol_nov
1 Replies

9. Shell Programming and Scripting

Regexp

I would like to extract "1333 Fairlane" given the below text. The word "Building:" is always present. The wording between Building and the beginning of the address can be almost anything. It appears the the hyphen is there most of the time. Campus: Fairlane Business Park Building:... (9 Replies)
Discussion started by: bbaker@copesan.
9 Replies

10. What is on Your Mind?

A Regexp You Can Use Everywhere

¯\_(ツ)_/¯ bakunin (0 Replies)
Discussion started by: bakunin
0 Replies
CALENDAR(1)						    BSD General Commands Manual 					       CALENDAR(1)

NAME
calendar -- reminder service SYNOPSIS
calendar [-ab] [-A num] [-B num] [-l num] [-w num] [-f calendarfile] [-t [[[cc]yy]mm]dd] DESCRIPTION
The calendar utility checks the current directory or the directory specified by the CALENDAR_DIR environment variable for a file named calendar and displays lines that begin with either today's date or tomorrow's. On Fridays, events on Friday through Monday are displayed. The options are as follows: -A num Print lines from today and next num days (forward, future). Defaults to one. (same as -l) -a Process the ``calendar'' files of all users and mail the results to them. This requires superuser privileges. -B num Print lines from today and previous num days (backward, past). -b Enforce special date calculation mode for KOI8 calendars. -l num Print lines from today and next num days (forward, future). Defaults to one. (same as -A) -w num Print lines from today and next num days, only if today is Friday (forward, future). Defaults to two, which causes calendar to print entries through the weekend on Fridays. -f calendarfile Use calendarfile as the default calendar file. -t [[[cc]yy]mm]dd Act like the specified value is ``today'' instead of using the current date. If yy is specified, but cc is not, a value for yy between 69 and 99 results in a cc value of 19. Otherwise, a cc value of 20 is used. To handle calendars in your national code table you can specify ``LANG=<locale_name>'' in the calendar file as early as possible. To handle national Easter names in the calendars, ``Easter=<national_name>'' (for Catholic Easter) or ``Paskha=<national_name>'' (for Orthodox Easter) can be used. A special locale name exists: 'utf-8'. Specifying ``LANG=utf-8'' indicates that the dates will be read using the C locale, and the descrip- tions will be encoded in UTF-8. This is usually used for the distributed calendar files. The ``CALENDAR'' variable can be used to specify the style. Only 'Julian' and 'Gregorian' styles are currently supported. Use ``CALENDAR='' to return to the default (Gregorian). To enforce special date calculation mode for Cyrillic calendars you should specify ``LANG=<local_name>'' and ``BODUN=<bodun_prefix>'' where <local_name> can be ru_RU.KOI8-R, uk_UA.KOI8-U or by_BY.KOI8-B. Note that the locale is reset to the user's default for each new file that is read. This is so that locales from one file do not accidentally carry over into another file. Other lines should begin with a month and day. They may be entered in almost any format, either numeric or as character strings. If proper locale is set, national months and weekdays names can be used. A single asterisk (`*') matches every month. A day without a month matches that day of every week. A month without a day matches the first of that month. Two numbers default to the month followed by the day. Lines with leading tabs default to the last entered date, allowing multiple line specifications for a single date. ``Easter'' (may be followed by a positive or negative integer) is Easter for this year. ``Paskha'' (may be followed by a positive or negative integer) is Orthodox Easter for this year. Weekdays may be followed by ``-4'' ... ``+5'' (aliases last, first, second, third, fourth) for moving events like ``the last Monday in April''. By convention, dates followed by an asterisk ('*') are not fixed, i.e., change from year to year. Day descriptions start after the first <tab> character in the line; if the line does not contain a <tab> character, it isn't printed out. If the first character in the line is a <tab> character, it is treated as the continuation of the previous description. The calendar file is preprocessed by cpp(1), allowing the inclusion of shared files such as company holidays or meetings. If the shared file is not referenced by a full pathname, cpp(1) searches in the current (or home) directory first, and then in the directory directory /etc/calendar, and finally in /usr/share/calendar. Empty lines and lines protected by the C commenting syntax (/* ... */) are ignored. Some possible calendar entries (a sequence denotes a <tab> character): LANG=C Easter=Ostern #include <calendar.usholiday> #include <calendar.birthday> 6/15 June 15 (if ambiguous, will default to month/day). Jun. 15 June 15. 15 June June 15. Thursday Every Thursday. June Every June 1st. 15 * 15th of every month. May Sun+2 second Sunday in May (Muttertag) 04/SunLast last Sunday in April, summer time in Europe Easter Easter Ostern-2 Good Friday (2 days before Easter) Paskha Orthodox Easter FILES
calendar File in current directory. ~/.calendar Directory in the user's home directory (which calendar changes into, if it exists). ~/.calendar/calendar File to use if no calendar file exists in the current directory. ~/.calendar/nomail calendar will not send mail if this file exists. calendar.all International and national calendar files. calendar.birthday Births and deaths of famous (and not-so-famous) people. calendar.christian Christian holidays (should be updated yearly by the local system administrator so that roving holidays are set cor- rectly for the current year). calendar.computer Days of special significance to computer people. calendar.croatian Croatian calendar. calendar.discord Discordian calendar (all rites reversed). calendar.fictional Fantasy and fiction dates (mostly LOTR). calendar.french French calendar. calendar.german German calendar. calendar.history Miscellaneous history. calendar.holiday Other holidays (including the not-well-known, obscure, and really obscure). calendar.judaic Jewish holidays (should be updated yearly by the local system administrator so that roving holidays are set correctly for the current year). calendar.music Musical events, births, and deaths (strongly oriented toward rock n' roll). calendar.openbsd OpenBSD related events. calendar.pagan Pagan holidays, celebrations and festivals. calendar.russian Russian calendar. calendar.space Cosmic history. calendar.ushistory U.S. history. calendar.usholiday U.S. holidays. calendar.world World wide calendar. SEE ALSO
at(1), cal(1), cpp(1), mail(1), cron(8) STANDARDS
The calendar program previously selected lines which had the correct date anywhere in the line. This is no longer true: the date is only recognized when it occurs at the beginning of a line. COMPATIBILITY
The calendar command will only display lines that use a <tab> character to separate the date and description, or that begin with a <tab>. This is different than in previous releases. The -t flag argument syntax is from the original FreeBSD calendar program. The -l and -w flags are Debian-specific enhancements. Also, the original calendar program did not accept 0 as an argument to the -A flag. Using 'utf-8' as a locale name is a Debian-specific enhancement. HISTORY
A calendar command appeared in Version 7 AT&T UNIX. BUGS
calendar doesn't handle all Jewish holidays or moon phases. BSD
September 13, 2011 BSD
All times are GMT -4. The time now is 06:41 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy