Home Man
Today's Posts

Linux & Unix Commands - Search Man Pages
Man Page or Keyword Search:
Select Section of Man Page:
Select Man Page Repository:

Linux 2.6 - man page for pcrematching (linux section 3)


       PCRE - Perl-compatible regular expressions


       This document describes the two different algorithms that are available in PCRE for match-
       ing a compiled regular expression against a given subject string. The "standard" algorithm
       is  the	one  provided  by the pcre_exec() function.  This works in the same was as Perl's
       matching function, and provides a Perl-compatible matching operation.

       An alternative algorithm is provided by the pcre_dfa_exec() function; this operates  in	a
       different  way,	and  is not Perl-compatible. It has advantages and disadvantages compared
       with the standard algorithm, and these are described below.

       When there is only one possible way in which a given subject string can match  a  pattern,
       the two algorithms give the same answer. A difference arises, however, when there are mul-
       tiple possibilities. For example, if the pattern


       is matched against the string

	 <something> <something else> <something further>

       there are three possible answers. The standard algorithm finds only one of  them,  whereas
       the alternative algorithm finds all three.


       The  set  of strings that are matched by a regular expression can be represented as a tree
       structure. An unlimited repetition in the pattern makes the tree of infinite size, but  it
       is  still  a  tree.  Matching the pattern to a given subject string (from a given starting
       point) can be thought of as a search of the tree.  There are two ways to  search  a  tree:
       depth-first  and  breadth-first,  and these correspond to the two matching algorithms pro-
       vided by PCRE.


       In the terminology of Jeffrey Friedl's book "Mastering Regular Expressions", the  standard
       algorithm  is  an  "NFA	algorithm". It conducts a depth-first search of the pattern tree.
       That is, it proceeds along a single path through  the  tree,  checking  that  the  subject
       matches	what  is required. When there is a mismatch, the algorithm tries any alternatives
       at the current point, and if they all fail, it backs up to the previous	branch	point  in
       the tree, and tries the next alternative branch at that level. This often involves backing
       up (moving to the left) in the subject string as  well.	The  order  in	which  repetition
       branches are tried is controlled by the greedy or ungreedy nature of the quantifier.

       If  a  leaf node is reached, a matching string has been found, and at that point the algo-
       rithm stops. Thus, if there is more than one possible match, this  algorithm  returns  the
       first  one  that it finds. Whether this is the shortest, the longest, or some intermediate
       length depends on the way the greedy and ungreedy repetition quantifiers are specified  in
       the pattern.

       Because	it  ends up with a single path through the tree, it is relatively straightforward
       for this algorithm to keep track of the substrings that are matched  by	portions  of  the
       pattern	in  parentheses.  This provides support for capturing parentheses and back refer-


       This algorithm conducts a breadth-first search of the tree. Starting from the first match-
       ing  point in the subject, it scans the subject string from left to right, once, character
       by character, and as it does this, it remembers all the paths through the tree that repre-
       sent  valid matches. In Friedl's terminology, this is a kind of "DFA algorithm", though it
       is not implemented as a traditional finite state machine (it keeps multiple states  active

       Although  the  general  principle  of this matching algorithm is that it scans the subject
       string only once, without backtracking, there is one exception: when a  lookaround  asser-
       tion  is  encountered,  the characters following or preceding the current point have to be
       independently inspected.

       The scan continues until either the end of the subject is reached, or there  are  no  more
       unterminated  paths. At this point, terminated paths represent the different matching pos-
       sibilities (if there are none, the match has failed).  Thus, if there  is  more	than  one
       possible match, this algorithm finds all of them, and in particular, it finds the longest.
       The matches are returned in decreasing order of length. There is an  option  to	stop  the
       algorithm after the first match (which is necessarily the shortest) is found.

       Note  that  all	the matches that are found start at the same point in the subject. If the


       is matched against the string "the caterpillar catchment", the result will  be  the  three
       strings	"caterpillar",	"cater",  and "cat" that start at the fifth character of the sub-
       ject. The algorithm does not automatically move on to find matches  that  start	at  later

       There  are  a number of features of PCRE regular expressions that are not supported by the
       alternative matching algorithm. They are as follows:

       1. Because the algorithm finds all possible matches, the greedy or ungreedy nature of rep-
       etition	quantifiers  is  not  relevant.  Greedy  and  ungreedy quantifiers are treated in
       exactly the same way. However, possessive quantifiers can make a difference when what fol-
       lows could also match what is quantified, for example in a pattern like this:


       This  pattern  matches  "aaab!" but not "aaa!", which would be matched by a non-possessive
       quantifier. Similarly, if an atomic group is present, it is matched as if it were a stand-
       alone pattern at the current point, and the longest match is then "locked in" for the rest
       of the overall pattern.

       2. When dealing with multiple paths through the tree simultaneously, it is  not	straight-
       forward to keep track of captured substrings for the different matching possibilities, and
       PCRE's implementation of this algorithm does not attempt to do this. This  means  that  no
       captured substrings are available.

       3.  Because  no	substrings  are captured, back references within the pattern are not sup-
       ported, and cause errors if encountered.

       4. For the same reason, conditional expressions that use a backreference as the	condition
       or test for a specific group recursion are not supported.

       5. Because many paths through the tree may be active, the \K escape sequence, which resets
       the start of the match when encountered (but may be on some paths and not on  others),  is
       not supported. It causes an error if encountered.

       6.  Callouts  are  supported,  but the value of the capture_top field is always 1, and the
       value of the capture_last field is always -1.

       7. The \C escape sequence, which (in the standard algorithm) matches a single  byte,  even
       in  UTF-8  mode, is not supported because the alternative algorithm moves through the sub-
       ject string one character at a time, for all active paths through the tree.

       8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not  supported.
       (*FAIL) is supported, and behaves like a failing negative assertion.


       Using the alternative matching algorithm provides the following advantages:

       1. All possible matches (at a single point in the subject) are automatically found, and in
       particular, the longest match is found. To find more than one  match  using  the  standard
       algorithm, you have to do kludgy things with callouts.

       2.  Because  the alternative algorithm scans the subject string just once, and never needs
       to backtrack, it is possible to pass very long subject strings to the matching function in
       several	pieces,  checking  for	partial matching each time. Although it is possible to do
       multi-segment matching using the standard algorithm (pcre_exec()), by retaining	partially
       matched substrings, it is more complicated. The pcrepartial documentation gives details of
       partial matching and discusses multi-segment matching.


       The alternative algorithm suffers from a number of disadvantages:

       1. It is substantially slower than the standard algorithm. This is partly because  it  has
       to  search  for	all possible matches, but is also because it is less susceptible to opti-

       2. Capturing parentheses and back references are not supported.

       3. Although atomic groups are supported, their use does not provide the performance advan-
       tage that it does for the standard algorithm.


       Philip Hazel
       University Computing Service
       Cambridge CB2 3QH, England.


       Last updated: 17 November 2010
       Copyright (c) 1997-2010 University of Cambridge.


All times are GMT -4. The time now is 06:11 PM.

Unix & Linux Forums Content Copyrightę1993-2018. All Rights Reserved.
Show Password