Unix/Linux Go Back    

OpenSolaris 2009.06 - man page for pcre (opensolaris section 3)

Linux & Unix Commands - Search Man Pages
Man Page or Keyword Search:   man
Select Man Page Set:       apropos Keyword Search (sections above)

PCRE(3) 										  PCRE(3)

       PCRE - Perl-compatible regular expressions


       The  PCRE library is a set of functions that implement regular expression pattern matching
       using the same syntax and semantics as Perl, with just a few differences. Certain features
       that appeared in Python and PCRE before they appeared in Perl are also available using the
       Python syntax. There is also some support for certain .NET and Oniguruma syntax items, and
       there  is an option for requesting some minor changes that give better JavaScript compati-

       The current implementation of PCRE (release 7.x) corresponds approximately with Perl 5.10,
       including  support for UTF-8 encoded strings and Unicode general category properties. How-
       ever, UTF-8 and Unicode support has to be explicitly enabled; it is not the  default.  The
       Unicode tables correspond to Unicode release 5.0.0.

       In  addition to the Perl-compatible matching function, PCRE contains an alternative match-
       ing function that matches the same compiled patterns in a different way. In  certain  cir-
       cumstances,  the  alternative  function	has  some advantages. For a discussion of the two
       matching algorithms, see the pcrematching page.

       PCRE is written in C and released as a C library. A number of people have written wrappers
       and interfaces of various kinds. In particular, Google Inc.  have provided a comprehensive
       C++ wrapper. This is now included as part of the PCRE distribution. The pcrecpp	page  has
       details of this interface. Other people's contributions can be found in the Contrib direc-
       tory at the primary FTP site, which is:


       Details of exactly which Perl regular expression features are and  are  not  supported  by
       PCRE are given in separate documents. See the pcrepattern and pcrecompat pages. There is a
       syntax summary in the pcresyntax page.

       Some features of PCRE can be included, excluded, or changed when the library is built. The
       pcre_config()  function	makes  it  possible  for  a client to discover which features are
       available. The features themselves are described  in  the  pcrebuild  page.  Documentation
       about  building	PCRE for various operating systems can be found in the README file in the
       source distribution.

       The library contains a number of undocumented internal functions and data tables that  are
       used  by  more than one of the exported external functions, but which are not intended for
       use by external callers. Their names all begin with "_pcre_",  which  hopefully	will  not
       provoke	any  name clashes. In some environments, it is possible to control which external
       symbols are exported when a shared library is built, and in these cases	the  undocumented
       symbols are not exported.


       The  user  documentation  for  PCRE comprises a number of different sections. In the "man"
       format, each of these is a separate "man page". In the HTML format,  each  is  a  separate
       page,  linked from the index page. In the plain text format, all the sections are concate-
       nated, for ease of searching. The sections are as follows:

	 pcre		   this document
	 pcre-config	   show PCRE installation configuration information
	 pcreapi	   details of PCRE's native C API
	 pcrebuild	   options for building PCRE
	 pcrecallout	   details of the callout feature
	 pcrecompat	   discussion of Perl compatibility
	 pcrecpp	   details of the C++ wrapper
	 pcregrep	   description of the pcregrep command
	 pcrematching	   discussion of the two matching algorithms
	 pcrepartial	   details of the partial matching facility
	 pcrepattern	   syntax and semantics of supported
			     regular expressions
	 pcresyntax	   quick syntax reference
	 pcreperform	   discussion of performance issues
	 pcreposix	   the POSIX-compatible C API
	 pcreprecompile    details of saving and re-using precompiled patterns
	 pcresample	   discussion of the sample program
	 pcrestack	   discussion of stack usage
	 pcretest	   description of the pcretest testing command

       In addition, in the "man" and HTML formats, there is a short page for each C library func-
       tion, listing its arguments and results.


       There  are  some size limitations in PCRE but it is hoped that they will never in practice
       be relevant.

       The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE is compiled with the
       default	internal  linkage  size of 2. If you want to process regular expressions that are
       truly enormous, you can compile PCRE with an internal linkage size of  3  or  4	(see  the
       README  file  in  the source distribution and the pcrebuild documentation for details). In
       these cases the limit is substantially larger.  However, the speed of execution is slower.

       All values in repeating quantifiers must be less than 65536.

       There is no limit to the number of parenthesized subpatterns, but there	can  be  no  more
       than 65535 capturing subpatterns.

       The maximum length of name for a named subpattern is 32 characters, and the maximum number
       of named subpatterns is 10000.

       The maximum length of a subject string is the largest  positive	number	that  an  integer
       variable can hold. However, when using the traditional matching function, PCRE uses recur-
       sion to handle subpatterns and indefinite repetition.  This means that the available stack
       space  may  limit  the size of a subject string that can be processed by certain patterns.
       For a discussion of stack issues, see the pcrestack documentation.


       From release 3.3, PCRE has had some support for character strings  encoded  in  the  UTF-8
       format.	For  release 4.0 this was greatly extended to cover most common requirements, and
       in release 5.0 additional support for Unicode general category properties was added.

       In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the  code,
       and, in addition, you must call pcre_compile() with the PCRE_UTF8 option flag. When you do
       this, both the pattern and any subject strings that are matched against it are treated  as
       UTF-8 strings instead of just strings of bytes.

       If you compile PCRE with UTF-8 support, but do not use it at run time, the library will be
       a bit bigger, but the additional run time overhead is limited  to  testing  the	PCRE_UTF8
       flag occasionally, so should not be very big.

       If  PCRE  is  built with Unicode character property support (which implies UTF-8 support),
       the escape sequences \p{..}, \P{..}, and \X are supported.  The available properties  that
       can  be tested are limited to the general category properties such as Lu for an upper case
       letter or Nd for a decimal number, the Unicode script names such as Arabic or Han, and the
       derived properties Any and L&. A full list is given in the pcrepattern documentation. Only
       the short names for properties are supported. For example, \p{L}  matches  a  letter.  Its
       Perl  synonym,  \p{Letter},  is	not supported.	Furthermore, in Perl, many properties may
       optionally be prefixed by "Is", for compatibility with Perl 5.6.  PCRE  does  not  support

   Validity of UTF-8 strings

       When  you  set  the  PCRE_UTF8  flag,  the strings passed as patterns and subjects are (by
       default) checked for validity on entry to the relevant  functions.  From  release  7.3  of
       PCRE,  the check is according the rules of RFC 3629, which are themselves derived from the
       Unicode specification. Earlier releases of PCRE followed the  rules  of	RFC  2279,  which
       allows  the  full  range of 31-bit values (0 to 0x7FFFFFFF). The current check allows only
       values in the range U+0 to U+10FFFF, excluding U+D800 to U+DFFF.

       The excluded code points are the "Low Surrogate Area" of Unicode,  of  which  the  Unicode
       Standard  says  this:  "The Low Surrogate Area does not contain any character assignments,
       consequently no character code charts or namelists are provided for this area.  Surrogates
       are reserved for use with UTF-16 and then must be used in pairs." The code points that are
       encoded by UTF-16 pairs are available as independent code points in  the  UTF-8	encoding.
       (In  other  words,  the	whole  surrogate  thing is a fudge for UTF-16 which unfortunately
       messes up UTF-8.)

       If an invalid UTF-8 string is passed to PCRE,  an  error  return  (PCRE_ERROR_BADUTF8)  is
       given. In some situations, you may already know that your strings are valid, and therefore
       want  to  skip  these  checks  in  order  to  improve  performance.   If   you	set   the
       PCRE_NO_UTF8_CHECK  flag  at compile time or at run time, PCRE assumes that the pattern or
       subject it is given (respectively) contains only valid UTF-8 codes. In this case, it  does
       not diagnose an invalid UTF-8 string.

       If  you	pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, what happens depends
       on why the string is invalid. If the string conforms to the "old" definition of UTF-8 (RFC
       2279),  it  is  processed as a string of characters in the range 0 to 0x7FFFFFFF. In other
       words, apart from the initial validity test, PCRE (when in  UTF-8  mode)  handles  strings
       according to the more liberal rules of RFC 2279. However, if the string does not even con-
       form to RFC 2279, the result is undefined. Your program may crash.

       If you want to process strings of values in the full range 0 to 0x7FFFFFFF, encoded  in	a
       UTF-8-like  manner  as  per the old RFC, you can set PCRE_NO_UTF8_CHECK to bypass the more
       restrictive test. However, in this situation, you will have to  apply  your  own  validity

   General comments about UTF-8 mode

       1. An unbraced hexadecimal escape sequence (such as \xb3) matches a two-byte UTF-8 charac-
       ter if the value is greater than 127.

       2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 characters for values
       greater than \177.

       3.  Repeat  quantifiers	apply  to complete UTF-8 characters, not to individual bytes, for
       example: \x{100}{3}.

       4. The dot metacharacter matches one UTF-8 character instead of a single byte.

       5. The escape sequence \C can be used to match a single byte in UTF-8 mode,  but  its  use
       can lead to some strange effects. This facility is not available in the alternative match-
       ing function, pcre_dfa_exec().

       6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test  characters  of
       any code value, but the characters that PCRE recognizes as digits, spaces, or word charac-
       ters remain the same set as before, all with values less than 256. This remains true  even
       when  PCRE includes Unicode property support, because to do otherwise would slow down PCRE
       in many common cases. If you really want to test for a wider sense of, say,  "digit",  you
       must use Unicode property tests such as \p{Nd}.

       7.  Similarly,  characters that match the POSIX named character classes are all low-valued

       8. However, the Perl 5.10 horizontal and vertical whitespace matching escapes (\h, \H, \v,
       and \V) do match all the appropriate Unicode characters.

       9.  Case-insensitive  matching  applies only to characters whose values are less than 128,
       unless PCRE is built with Unicode property support. Even when Unicode property support  is
       available,  PCRE  still uses its own character tables when checking the case of low-valued
       characters, so as not to degrade performance.  The Unicode property  information  is  used
       only  for  characters with higher values. Even when Unicode property support is available,
       PCRE supports case-insensitive matching only when there is a one-to-one mapping between	a
       letter's cases. There are a small number of many-to-one mappings in Unicode; these are not
       supported by PCRE.


       Philip Hazel
       University Computing Service
       Cambridge CB2 3QH, England.

       Putting an actual email address here seems to have been a spam magnet, so  I've	taken  it
       away.  If you want to email me, use my two initials, followed by the two digits 10, at the
       domain cam.ac.uk.


       Last updated: 12 April 2008
       Copyright (c) 1997-2008 University of Cambridge.

       See attributes(5) for descriptions of the following attributes:

       |Availability	    | SUNWpcre	      |
       |Interface Stability | Uncommitted     |
       Source for PCRE is available on http://opensolaris.org.

Unix & Linux Commands & Man Pages : ©2000 - 2018 Unix and Linux Forums

All times are GMT -4. The time now is 09:46 AM.