👤
Home Man
Search
Today's Posts
Register

Linux & Unix Commands - Search Man Pages
Man Page or Keyword Search:
Select Section of Man Page:
Select Man Page Repository:

Linux 2.6 - man page for regexec (linux section 3posix)

REGCOMP(P)			    POSIX Programmer's Manual			       REGCOMP(P)

NAME
       regcomp, regerror, regexec, regfree - regular expression matching

SYNOPSIS
       #include <regex.h>

       int regcomp(regex_t *restrict preg, const char *restrict pattern,
	      int cflags);
       size_t regerror(int errcode, const regex_t *restrict preg,
	      char *restrict errbuf, size_t errbuf_size);
       int regexec(const regex_t *restrict preg, const char *restrict string,
	      size_t nmatch, regmatch_t pmatch[restrict], int eflags);
       void regfree(regex_t *preg);

DESCRIPTION
       These  functions interpret basic and extended regular expressions as described in the Base
       Definitions volume of IEEE Std 1003.1-2001, Chapter 9, Regular Expressions.

       The regex_t structure is defined in <regex.h> and contains at least the following member:

		   Member Type	Member Name  Description
		   size_t	re_nsub      Number of parenthesized subexpressions.

       The regmatch_t structure is defined in <regex.h> and contains at least the following  mem-
       bers:

		    Member Type Member Name Description
		    regoff_t	rm_so	    Byte offset from start of string to
					    start of substring.
		    regoff_t	rm_eo	    Byte offset from start of string of the
					    first character after the end of sub-
					    string.

       The regcomp() function shall compile  the  regular  expression  contained  in  the  string
       pointed	to  by	the pattern argument and place the results in the structure pointed to by
       preg.  The cflags argument is the bitwise-inclusive OR of zero or more  of  the	following
       flags, which are defined in the <regex.h> header:

       REG_EXTENDED
	      Use Extended Regular Expressions.

       REG_ICASE
	      Ignore  case  in	match.	(See the Base Definitions volume of IEEE Std 1003.1-2001,
	      Chapter 9, Regular Expressions.)

       REG_NOSUB
	      Report only success/fail in regexec().

       REG_NEWLINE
	      Change the handling of <newline>s, as described in the text.

       The default regular expression type for pattern is a Basic Regular Expression. The  appli-
       cation can specify Extended Regular Expressions using the REG_EXTENDED cflags flag.

       If  the REG_NOSUB flag was not set in cflags, then regcomp() shall set re_nsub to the num-
       ber of parenthesized subexpressions (delimited by "\(\)" in basic regular  expressions  or
       "()" in extended regular expressions) found in pattern.

       The  regexec()  function  compares the null-terminated string specified by string with the
       compiled regular expression preg initialized by a previous call to regcomp().  If it finds
       a  match,  regexec() shall return 0; otherwise, it shall return non-zero indicating either
       no match or an error. The eflags argument is the bitwise-inclusive OR of zero or  more  of
       the following flags, which are defined in the <regex.h> header:

       REG_NOTBOL
	      The  first character of the string pointed to by string is not the beginning of the
	      line. Therefore, the circumflex character ( '^' ), when taken as a special  charac-
	      ter, shall not match the beginning of string.

       REG_NOTEOL
	      The  last  character of the string pointed to by string is not the end of the line.
	      Therefore, the dollar sign ( '$' ), when taken as a special  character,  shall  not
	      match the end of string.

       If  nmatch  is  0 or REG_NOSUB was set in the cflags argument to regcomp(), then regexec()
       shall ignore the pmatch argument. Otherwise, the application shall ensure that the  pmatch
       argument points to an array with at least nmatch elements, and regexec() shall fill in the
       elements of that array with offsets of the substrings of string	that  correspond  to  the
       parenthesized subexpressions of pattern: pmatch[ i]. rm_so shall be the byte offset of the
       beginning and pmatch[ i]. rm_eo shall be one greater than the byte offset of  the  end  of
       substring  i.  (Subexpression  i begins at the ith matched open parenthesis, counting from
       1.) Offsets in pmatch[0] identify the substring that corresponds  to  the  entire  regular
       expression.  Unused elements of pmatch up to pmatch[ nmatch-1] shall be filled with -1. If
       there are more than nmatch subexpressions in pattern ( pattern itself counts as	a  subex-
       pression), then regexec() shall still do the match, but shall record only the first nmatch
       substrings.

       When matching a basic or extended regular expression, any given	parenthesized  subexpres-
       sion  of pattern might participate in the match of several different substrings of string,
       or it might not match any substring even though the pattern as a whole did match. The fol-
       lowing rules shall be used to determine which substrings to report in pmatch when matching
       regular expressions:

	1. If subexpression i in a regular expression is not contained within another  subexpres-
	   sion, and it participated in the match several times, then the byte offsets in pmatch[
	   i] shall delimit the last such match.

	2. If subexpression i is not contained within another subexpression, and it did not  par-
	   ticipate in an otherwise successful match, the byte offsets in pmatch[ i] shall be -1.
	   A subexpression does not participate in the match when: '*' or "\{\}" appears  immedi-
	   ately  after  the  subexpression in a basic regular expression, or '*' , '?' , or "{}"
	   appears immediately after the subexpression in an extended regular expression, and the
	   subexpression did not match (matched 0 times)

       or: '|' is used in an extended regular expression to select this subexpression or another,
       and the other subexpression matched.

	3. If subexpression i is contained within another subexpression j, and i is not contained
	   within  any	other subexpression that is contained within j, and a match of subexpres-
	   sion j is reported in pmatch[ j], then the  match  or  non-match  of  subexpression	i
	   reported  in pmatch[ i] shall be as described in 1. and 2.  above, but within the sub-
	   string reported in pmatch[ j] rather than the whole string. The offsets in pmatch[  i]
	   are still relative to the start of string.

	4. If subexpression i is contained in subexpression j, and the byte offsets in pmatch[ j]
	   are -1, then the pointers in pmatch[ i] shall also be -1.

	5. If subexpression i matched a zero-length string, then both byte offsets in pmatch[  i]
	   shall be the byte offset of the character or null terminator immediately following the
	   zero-length string.

       If, when regexec() is called, the locale is different from when the regular expression was
       compiled, the result is undefined.

       If  REG_NEWLINE	is  not  set  in  cflags,  then a <newline> in pattern or string shall be
       treated as an ordinary character. If REG_NEWLINE is set, then <newline> shall  be  treated
       as an ordinary character except as follows:

	1. A <newline> in string shall not be matched by a period outside a bracket expression or
	   by  any  form  of  a  non-matching  list  (see  the	 Base	Definitions   volume   of
	   IEEE Std 1003.1-2001, Chapter 9, Regular Expressions).

	2. A  circumflex  (  '^' ) in pattern, when used to specify expression anchoring (see the
	   Base Definitions volume of IEEE Std 1003.1-2001, Section 9.3.8, BRE Expression Anchor-
	   ing),  shall  match	the  zero-length  string immediately after a <newline> in string,
	   regardless of the setting of REG_NOTBOL.

	3. A dollar sign ( '$' ) in pattern, when used to  specify  expression	anchoring,  shall
	   match  the  zero-length string immediately before a <newline> in string, regardless of
	   the setting of REG_NOTEOL.

       The regfree() function frees any memory allocated by regcomp() associated with preg.

       The following constants are defined as error return values:

       REG_NOMATCH
	      regexec() failed to match.

       REG_BADPAT
	      Invalid regular expression.

       REG_ECOLLATE
	      Invalid collating element referenced.

       REG_ECTYPE
	      Invalid character class type referenced.

       REG_EESCAPE
	      Trailing '\' in pattern.

       REG_ESUBREG
	      Number in "\digit" invalid or in error.

       REG_EBRACK
	      "[]" imbalance.

       REG_EPAREN
	      "\(\)" or "()" imbalance.

       REG_EBRACE
	      "\{\}" imbalance.

       REG_BADBR
	      Content of "\{\}" invalid: not a number, number too large, more than  two  numbers,
	      first larger than second.

       REG_ERANGE
	      Invalid endpoint in range expression.

       REG_ESPACE
	      Out of memory.

       REG_BADRPT
	      '?' , '*' , or '+' not preceded by valid regular expression.

       The  regerror()	function  provides  a  mapping from error codes returned by regcomp() and
       regexec() to unspecified printable strings. It generates a  string  corresponding  to  the
       value  of  the  errcode	argument, which the application shall ensure is the last non-zero
       value returned by regcomp() or regexec() with the given value of preg. If errcode  is  not
       such a value, the content of the generated string is unspecified.

       If preg is a null pointer, but errcode is a value returned by a previous call to regexec()
       or regcomp(), the regerror() still generates an error string corresponding to the value of
       errcode, but it might not be as detailed under some implementations.

       If the errbuf_size argument is not 0, regerror() shall place the generated string into the
       buffer of size errbuf_size bytes pointed to by errbuf. If the string (including the termi-
       nating  null) cannot fit in the buffer, regerror() shall truncate the string and null-ter-
       minate the result.

       If errbuf_size is 0, regerror() shall ignore the errbuf argument, and return the  size  of
       the buffer needed to hold the generated string.

       If  the	preg  argument	to  regexec()  or  regfree() is not a compiled regular expression
       returned by regcomp(), the result is undefined. A preg is no longer treated as a  compiled
       regular expression after it is given to regfree().

RETURN VALUE
       Upon  successful  completion,  the  regcomp() function shall return 0. Otherwise, it shall
       return an integer value indicating an error as described in <regex.h>, and the content  of
       preg  is  undefined.  If  a  code  is  returned,  the  interpretation shall be as given in
       <regex.h>.

       If regcomp() detects an invalid RE, it may return REG_BADPAT, or it may return one of  the
       error codes that more precisely describes the error.

       Upon  successful  completion,  the  regexec() function shall return 0. Otherwise, it shall
       return REG_NOMATCH to indicate no match.

       Upon successful completion, the regerror() function  shall  return  the	number	of  bytes
       needed  to hold the entire generated string, including the null termination. If the return
       value is greater than errbuf_size, the string returned in the buffer pointed to by  errbuf
       has been truncated.

       The regfree() function shall not return a value.

ERRORS
       No errors are defined.

       The following sections are informative.

EXAMPLES
	      #include <regex.h>

	      /*
	       * Match string against the extended regular expression in
	       * pattern, treating errors as no match.
	       *
	       * Return 1 for match, 0 for no match.
	       */

	      int
	      match(const char *string, char *pattern)
	      {
		  int	 status;
		  regex_t    re;

		  if (regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB) != 0) {
		      return(0);      /* Report error. */
		  }
		  status = regexec(&re, string, (size_t) 0, NULL, 0);
		  regfree(&re);
		  if (status != 0) {
		      return(0);      /* Report error. */
		  }
		  return(1);
	      }

       The  following  demonstrates  how the REG_NOTBOL flag could be used with regexec() to find
       all substrings in a line that match a pattern supplied by a user. (For simplicity  of  the
       example, very little error checking is done.)

	      (void) regcomp (&re, pattern, 0);
	      /* This call to regexec() finds the first match on the line. */
	      error = regexec (&re, &buffer[0], 1, &pm, 0);
	      while (error == 0) {  /* While matches found. */
		  /* Substring found between pm.rm_so and pm.rm_eo. */
		  /* This call to regexec() finds the next match. */
		  error = regexec (&re, buffer + pm.rm_eo, 1, &pm, REG_NOTBOL);
	      }

APPLICATION USAGE
       An application could use:

	      regerror(code,preg,(char *)NULL,(size_t)0)

       to find out how big a buffer is needed for the generated string, malloc() a buffer to hold
       the string, and then call regerror() again to get  the  string.	Alternatively,	it  could
       allocate a fixed, static buffer that is big enough to hold most strings, and then use mal-
       loc() to allocate a larger buffer if it finds that this is too small.

       To match a pattern as described in the Shell and Utilities volume of IEEE Std 1003.1-2001,
       Section 2.13, Pattern Matching Notation, use the fnmatch() function.

RATIONALE
       The regexec() function must fill in all nmatch elements of pmatch, where nmatch and pmatch
       are supplied by the application, even if some elements of pmatch do not correspond to sub-
       expressions  in pattern. The application writer should note that there is probably no rea-
       son for using a value of nmatch that is larger than preg-> re_nsub+1.

       The REG_NEWLINE flag supports a use of RE matching that is  needed  in  some  applications
       like text editors. In such applications, the user supplies an RE asking the application to
       find a line that matches the given expression. An anchor in such  an  RE  anchors  at  the
       beginning  or  end of any line. Such an application can pass a sequence of <newline>-sepa-
       rated lines to regexec() as a single long string and specify REG_NEWLINE to  regcomp()  to
       get  the  desired  behavior.  The application must ensure that there are no explicit <new-
       line>s in pattern if it wants to ensure that any match occurs  entirely	within	a  single
       line.

       The  REG_NEWLINE flag affects the behavior of regexec(), but it is in the cflags parameter
       to regcomp() to allow flexibility of implementation. Some  implementations  will  want  to
       generate  the  same  compiled RE in regcomp() regardless of the setting of REG_NEWLINE and
       have regexec() handle anchors differently based on the setting of the flag.  Other  imple-
       mentations will generate different compiled REs based on the REG_NEWLINE.

       The  REG_ICASE flag supports the operations taken by the grep -i option and the historical
       implementations of ex and vi.  Including this flag will make  it  easier  for  application
       code to be written that does the same thing as these utilities.

       The substrings reported in pmatch[] are defined using offsets from the start of the string
       rather than pointers. Since this is a new interface, there should be no impact on histori-
       cal  implementations  or applications, and offsets should be just as easy to use as point-
       ers. The change to offsets was made to facilitate future extensions in which the string to
       be  searched is presented to regexec() in blocks, allowing a string to be searched that is
       not all in memory at once.

       The type regoff_t is used for the elements of pmatch[] to ensure that the application  can
       represent  either  the largest possible array in memory (important for an application con-
       forming to the Shell and Utilities volume of IEEE Std 1003.1-2001) or the largest possible
       file  (important  for  an  application  using  the  extension  where a file is searched in
       chunks).

       The standard developers rejected the inclusion of a regsub() function that would  be  used
       to  do substitutions for a matched RE. While such a routine would be useful to some appli-
       cations, its utility would be much more limited than the matching function described here.
       Both RE parsing and substitution are possible to implement without support other than that
       required by the ISO C standard, but matching is much more complex than substituting.   The
       only difficult part of substitution, given the information supplied by regexec(), is find-
       ing the next character in a string when there can be multi-byte characters. That is a much
       larger issue, and one that needs a more general solution.

       The  errno  variable  has  not been used for error returns to avoid filling the errno name
       space for this feature.

       The interface is defined so that the matched substrings rm_sp and rm_ep are in a  separate
       regmatch_t  structure  instead  of in regex_t. This allows a single compiled RE to be used
       simultaneously in several contexts; in main() and a signal handler, perhaps, or in  multi-
       ple  threads  of  lightweight  processes. (The preg argument to regexec() is declared with
       type const, so the implementation is not permitted to use the structure to store  interme-
       diate results.) It also allows an application to request an arbitrary number of substrings
       from an RE. The number of subexpressions in the RE is reported in re_nsub in  preg.   With
       this change to regexec(), consideration was given to dropping the REG_NOSUB flag since the
       user can now specify this with a zero nmatch  argument  to  regexec().	However,  keeping
       REG_NOSUB  allows  an implementation to use a different (perhaps more efficient) algorithm
       if it knows in regcomp() that no subexpressions need be reported.  The  implementation  is
       only  required  to fill in pmatch if nmatch is not zero and if REG_NOSUB is not specified.
       Note that the size_t type, as defined in the ISO C standard, is unsigned, so the  descrip-
       tion of regexec() does not need to address negative values of nmatch.

       REG_NOTBOL  was added to allow an application to do repeated searches for the same pattern
       in a line. If the pattern contains a circumflex character that should match the	beginning
       of  a  line,  then the pattern should only match when matched against the beginning of the
       line. Without the REG_NOTBOL flag, the application could rewrite the expression for subse-
       quent matches, but in the general case this would require parsing the expression. The need
       for REG_NOTEOL is not as clear; it was added for symmetry.

       The addition of the regerror() function	addresses  the	historical  need  for  conforming
       application  programs  to  have	access to error information more than "Function failed to
       compile/match your RE for unknown reasons".

       This interface provides for two different methods of dealing with  error  conditions.  The
       specific error codes (REG_EBRACE, for example), defined in <regex.h>, allow an application
       to recover from an error if it is so able. Many applications, especially  those	that  use
       patterns supplied by a user, will not try to deal with specific error cases, but will just
       use regerror() to obtain a human-readable error message to present to the user.

       The regerror() function uses a scheme similar to confstr() to deal  with  the  problem  of
       allocating memory to hold the generated string. The scheme used by strerror() in the ISO C
       standard was considered unacceptable since  it  creates	difficulties  for  multi-threaded
       applications.

       The  preg argument is provided to regerror() to allow an implementation to generate a more
       descriptive message than would be possible with errcode alone.  An  implementation  might,
       for  example,  save  the  character  offset of the offending character of the pattern in a
       field of preg, and then include that in the generated message string.  The  implementation
       may also ignore preg.

       A  REG_FILENAME flag was considered, but omitted. This flag caused regexec() to match pat-
       terns as described in the Shell and  Utilities  volume  of  IEEE Std 1003.1-2001,  Section
       2.13,  Pattern  Matching  Notation  instead  of	REs.  This service is now provided by the
       fnmatch() function.

       Notice that there is a difference in philosophy between the ISO POSIX-2:1993 standard  and
       IEEE Std 1003.1-2001  in  how  to  handle a "bad" regular expression. The ISO POSIX-2:1993
       standard says that many bad constructs "produce undefined results", or that "the interpre-
       tation  is undefined". IEEE Std 1003.1-2001, however, says that the interpretation of such
       REs is unspecified. The term "undefined" means that the action by the  application  is  an
       error, of similar severity to passing a bad pointer to a function.

       The regcomp() and regexec() functions are required to accept any null-terminated string as
       the pattern argument. If the meaning of the string is "undefined",  the	behavior  of  the
       function  is  "unspecified".  IEEE Std 1003.1-2001 does not specify how the functions will
       interpret the pattern; they might return error codes, or they might do pattern matching in
       some completely unexpected way, but they should not do something like abort the process.

FUTURE DIRECTIONS
       None.

SEE ALSO
       fnmatch()  ,  glob()  ,	Shell and Utilities volume of IEEE Std 1003.1-2001, Section 2.13,
       Pattern Matching Notation, Base Definitions volume  of  IEEE Std 1003.1-2001,  Chapter  9,
       Regular Expressions, <regex.h>, <sys/types.h>

COPYRIGHT
       Portions  of  this  text  are  reprinted  and  reproduced in electronic form from IEEE Std
       1003.1, 2003 Edition, Standard for Information Technology  --  Portable	Operating  System
       Interface  (POSIX), The Open Group Base Specifications Issue 6, Copyright (C) 2001-2003 by
       the Institute of Electrical and Electronics Engineers, Inc and  The  Open  Group.  In  the
       event  of  any  discrepancy  between this version and the original IEEE and The Open Group
       Standard, the original IEEE and The Open Group Standard is the referee document. The orig-
       inal Standard can be obtained online at http://www.opengroup.org/unix/online.html .

IEEE/The Open Group			       2003				       REGCOMP(P)


All times are GMT -4. The time now is 05:26 PM.

Unix & Linux Forums Content Copyrightę1993-2018. All Rights Reserved.
×
UNIX.COM Login
Username:
Password:  
Show Password





Not a Forum Member?
Forgot Password?