Home Man
Today's Posts

Linux & Unix Commands - Search Man Pages

X11R7.4 - man page for pcre (x11r4 section 3)

PCRE(3) 			     Library Functions Manual				  PCRE(3)

       PCRE - Perl-compatible regular expressions


       The  PCRE library is a set of functions that implement regular expression pattern matching
       using the same syntax and semantics as Perl, with just a few differences. Certain features
       that appeared in Python and PCRE before they appeared in Perl are also available using the
       Python syntax. There is also some support for certain .NET and Oniguruma syntax items, and
       there  is an option for requesting some minor changes that give better JavaScript compati-

       The current implementation of PCRE (release 7.x) corresponds approximately with Perl 5.10,
       including  support for UTF-8 encoded strings and Unicode general category properties. How-
       ever, UTF-8 and Unicode support has to be explicitly enabled; it is not the  default.  The
       Unicode tables correspond to Unicode release 5.1.

       In  addition to the Perl-compatible matching function, PCRE contains an alternative match-
       ing function that matches the same compiled patterns in a different way. In  certain  cir-
       cumstances,  the  alternative  function	has  some advantages. For a discussion of the two
       matching algorithms, see the pcrematching page.

       PCRE is written in C and released as a C library. A number of people have written wrappers
       and interfaces of various kinds. In particular, Google Inc.  have provided a comprehensive
       C++ wrapper. This is now included as part of the PCRE distribution. The pcrecpp	page  has
       details of this interface. Other people's contributions can be found in the Contrib direc-
       tory at the primary FTP site, which is:


       Details of exactly which Perl regular expression features are and  are  not  supported  by
       PCRE are given in separate documents. See the pcrepattern and pcrecompat pages. There is a
       syntax summary in the pcresyntax page.

       Some features of PCRE can be included, excluded, or changed when the library is built. The
       pcre_config()  function	makes  it  possible  for  a client to discover which features are
       available. The features themselves are described  in  the  pcrebuild  page.  Documentation
       about  building	PCRE for various operating systems can be found in the README file in the
       source distribution.

       The library contains a number of undocumented internal functions and data tables that  are
       used  by  more than one of the exported external functions, but which are not intended for
       use by external callers. Their names all begin with "_pcre_",  which  hopefully	will  not
       provoke	any  name clashes. In some environments, it is possible to control which external
       symbols are exported when a shared library is built, and in these cases	the  undocumented
       symbols are not exported.


       The  user  documentation  for  PCRE comprises a number of different sections. In the "man"
       format, each of these is a separate "man page". In the HTML format,  each  is  a  separate
       page,  linked from the index page. In the plain text format, all the sections are concate-
       nated, for ease of searching. The sections are as follows:

	 pcre		   this document
	 pcre-config	   show PCRE installation configuration information
	 pcreapi	   details of PCRE's native C API
	 pcrebuild	   options for building PCRE
	 pcrecallout	   details of the callout feature
	 pcrecompat	   discussion of Perl compatibility
	 pcrecpp	   details of the C++ wrapper
	 pcregrep	   description of the pcregrep command
	 pcrematching	   discussion of the two matching algorithms
	 pcrepartial	   details of the partial matching facility
	 pcrepattern	   syntax and semantics of supported
			     regular expressions
	 pcresyntax	   quick syntax reference
	 pcreperform	   discussion of performance issues
	 pcreposix	   the POSIX-compatible C API
	 pcreprecompile    details of saving and re-using precompiled patterns
	 pcresample	   discussion of the sample program
	 pcrestack	   discussion of stack usage
	 pcretest	   description of the pcretest testing command

       In addition, in the "man" and HTML formats, there is a short page for each C library func-
       tion, listing its arguments and results.


       There  are  some size limitations in PCRE but it is hoped that they will never in practice
       be relevant.

       The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE is compiled with the
       default	internal  linkage  size of 2. If you want to process regular expressions that are
       truly enormous, you can compile PCRE with an internal linkage size of  3  or  4	(see  the
       README  file  in  the source distribution and the pcrebuild documentation for details). In
       these cases the limit is substantially larger.  However, the speed of execution is slower.

       All values in repeating quantifiers must be less than 65536.

       There is no limit to the number of parenthesized subpatterns, but there	can  be  no  more
       than 65535 capturing subpatterns.

       The maximum length of name for a named subpattern is 32 characters, and the maximum number
       of named subpatterns is 10000.

       The maximum length of a subject string is the largest  positive	number	that  an  integer
       variable can hold. However, when using the traditional matching function, PCRE uses recur-
       sion to handle subpatterns and indefinite repetition.  This means that the available stack
       space  may  limit  the size of a subject string that can be processed by certain patterns.
       For a discussion of stack issues, see the pcrestack documentation.


       From release 3.3, PCRE has had some support for character strings  encoded  in  the  UTF-8
       format.	For  release 4.0 this was greatly extended to cover most common requirements, and
       in release 5.0 additional support for Unicode general category properties was added.

       In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the  code,
       and, in addition, you must call pcre_compile() with the PCRE_UTF8 option flag, or the pat-
       tern must start with the sequence (*UTF8). When either of these is the case, both the pat-
       tern  and  any  subject	strings  that are matched against it are treated as UTF-8 strings
       instead of just strings of bytes.

       If you compile PCRE with UTF-8 support, but do not use it at run time, the library will be
       a  bit  bigger,	but  the additional run time overhead is limited to testing the PCRE_UTF8
       flag occasionally, so should not be very big.

       If PCRE is built with Unicode character property support (which	implies  UTF-8	support),
       the  escape sequences \p{..}, \P{..}, and \X are supported.  The available properties that
       can be tested are limited to the general category properties such as Lu for an upper  case
       letter or Nd for a decimal number, the Unicode script names such as Arabic or Han, and the
       derived properties Any and L&. A full list is given in the pcrepattern documentation. Only
       the  short  names  for  properties are supported. For example, \p{L} matches a letter. Its
       Perl synonym, \p{Letter}, is not supported.  Furthermore, in  Perl,  many  properties  may
       optionally  be  prefixed  by  "Is", for compatibility with Perl 5.6. PCRE does not support

   Validity of UTF-8 strings

       When you set the PCRE_UTF8 flag, the strings passed  as	patterns  and  subjects  are  (by
       default)  checked  for  validity  on  entry to the relevant functions. From release 7.3 of
       PCRE, the check is according the rules of RFC 3629, which are themselves derived from  the
       Unicode	specification.	Earlier  releases  of  PCRE followed the rules of RFC 2279, which
       allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current  check  allows  only
       values in the range U+0 to U+10FFFF, excluding U+D800 to U+DFFF.

       The  excluded  code  points  are the "Low Surrogate Area" of Unicode, of which the Unicode
       Standard says this: "The Low Surrogate Area does not contain  any  character  assignments,
       consequently  no character code charts or namelists are provided for this area. Surrogates
       are reserved for use with UTF-16 and then must be used in pairs." The code points that are
       encoded	by  UTF-16  pairs are available as independent code points in the UTF-8 encoding.
       (In other words, the whole surrogate thing is  a  fudge	for  UTF-16  which  unfortunately
       messes up UTF-8.)

       If  an  invalid	UTF-8  string  is passed to PCRE, an error return (PCRE_ERROR_BADUTF8) is
       given. In some situations, you may already know that your strings are valid, and therefore
       want   to   skip   these   checks  in  order  to  improve  performance.	If  you  set  the
       PCRE_NO_UTF8_CHECK flag at compile time or at run time, PCRE assumes that the  pattern  or
       subject	it is given (respectively) contains only valid UTF-8 codes. In this case, it does
       not diagnose an invalid UTF-8 string.

       If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, what  happens  depends
       on why the string is invalid. If the string conforms to the "old" definition of UTF-8 (RFC
       2279), it is processed as a string of characters in the range 0 to  0x7FFFFFFF.	In  other
       words,  apart  from  the  initial validity test, PCRE (when in UTF-8 mode) handles strings
       according to the more liberal rules of RFC 2279. However, if the string does not even con-
       form to RFC 2279, the result is undefined. Your program may crash.

       If  you	want to process strings of values in the full range 0 to 0x7FFFFFFF, encoded in a
       UTF-8-like manner as per the old RFC, you can set PCRE_NO_UTF8_CHECK to	bypass	the  more
       restrictive  test.  However,  in  this situation, you will have to apply your own validity

   General comments about UTF-8 mode

       1. An unbraced hexadecimal escape sequence (such as \xb3) matches a two-byte UTF-8 charac-
       ter if the value is greater than 127.

       2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 characters for values
       greater than \177.

       3. Repeat quantifiers apply to complete UTF-8 characters, not  to  individual  bytes,  for
       example: \x{100}{3}.

       4. The dot metacharacter matches one UTF-8 character instead of a single byte.

       5.  The	escape	sequence \C can be used to match a single byte in UTF-8 mode, but its use
       can lead to some strange effects. This facility is not available in the alternative match-
       ing function, pcre_dfa_exec().

       6.  The	character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test characters of
       any code value, but the characters that PCRE recognizes as digits, spaces, or word charac-
       ters  remain the same set as before, all with values less than 256. This remains true even
       when PCRE includes Unicode property support, because to do otherwise would slow down  PCRE
       in  many  common cases. If you really want to test for a wider sense of, say, "digit", you
       must use Unicode property tests such as \p{Nd}. Note that this also applies to \b, because
       it is defined in terms of \w and \W.

       7.  Similarly,  characters that match the POSIX named character classes are all low-valued

       8. However, the Perl 5.10 horizontal and vertical whitespace matching escapes (\h, \H, \v,
       and \V) do match all the appropriate Unicode characters.

       9.  Case-insensitive  matching  applies only to characters whose values are less than 128,
       unless PCRE is built with Unicode property support. Even when Unicode property support  is
       available,  PCRE  still uses its own character tables when checking the case of low-valued
       characters, so as not to degrade performance.  The Unicode property  information  is  used
       only  for  characters with higher values. Even when Unicode property support is available,
       PCRE supports case-insensitive matching only when there is a one-to-one mapping between	a
       letter's cases. There are a small number of many-to-one mappings in Unicode; these are not
       supported by PCRE.


       Philip Hazel
       University Computing Service
       Cambridge CB2 3QH, England.

       Putting an actual email address here seems to have been a spam magnet, so  I've	taken  it
       away.  If you want to email me, use my two initials, followed by the two digits 10, at the
       domain cam.ac.uk.


       Last updated: 11 April 2009
       Copyright (c) 1997-2009 University of Cambridge.


All times are GMT -4. The time now is 05:16 AM.

Unix & Linux Forums Content Copyrightę1993-2018. All Rights Reserved.
Show Password