PERLREAPI(1) Perl Programmers Reference Guide PERLREAPI(1)
NAME
perlreapi - Perl regular expression plugin interface
DESCRIPTION
As of Perl 5.9.5 there is a new interface for plugging and using regular expression engines other than the default one.
Each engine is supposed to provide access to a constant structure of the following format:
typedef struct regexp_engine {
REGEXP* (*comp) (pTHX_
const SV * const pattern, const U32 flags);
I32 (*exec) (pTHX_
REGEXP * const rx,
char* stringarg,
char* strend, char* strbeg,
I32 minend, SV* screamer,
void* data, U32 flags);
char* (*intuit) (pTHX_
REGEXP * const rx, SV *sv,
char *strpos, char *strend, U32 flags,
struct re_scream_pos_data_s *data);
SV* (*checkstr) (pTHX_ REGEXP * const rx);
void (*free) (pTHX_ REGEXP * const rx);
void (*numbered_buff_FETCH) (pTHX_
REGEXP * const rx,
const I32 paren,
SV * const sv);
void (*numbered_buff_STORE) (pTHX_
REGEXP * const rx,
const I32 paren,
SV const * const value);
I32 (*numbered_buff_LENGTH) (pTHX_
REGEXP * const rx,
const SV * const sv,
const I32 paren);
SV* (*named_buff) (pTHX_
REGEXP * const rx,
SV * const key,
SV * const value,
U32 flags);
SV* (*named_buff_iter) (pTHX_
REGEXP * const rx,
const SV * const lastkey,
const U32 flags);
SV* (*qr_package)(pTHX_ REGEXP * const rx);
#ifdef USE_ITHREADS
void* (*dupe) (pTHX_ REGEXP * const rx, CLONE_PARAMS *param);
#endif
REGEXP* (*op_comp) (...);
When a regexp is compiled, its "engine" field is then set to point at the appropriate structure, so that when it needs to be used Perl can
find the right routines to do so.
In order to install a new regexp handler, $^H{regcomp} is set to an integer which (when casted appropriately) resolves to one of these
structures. When compiling, the "comp" method is executed, and the resulting "regexp" structure's engine field is expected to point back
at the same structure.
The pTHX_ symbol in the definition is a macro used by Perl under threading to provide an extra argument to the routine holding a pointer
back to the interpreter that is executing the regexp. So under threading all routines get an extra argument.
Callbacks
comp
REGEXP* comp(pTHX_ const SV * const pattern, const U32 flags);
Compile the pattern stored in "pattern" using the given "flags" and return a pointer to a prepared "REGEXP" structure that can perform the
match. See "The REGEXP structure" below for an explanation of the individual fields in the REGEXP struct.
The "pattern" parameter is the scalar that was used as the pattern. Previous versions of Perl would pass two "char*" indicating the start
and end of the stringified pattern; the following snippet can be used to get the old parameters:
STRLEN plen;
char* exp = SvPV(pattern, plen);
char* xend = exp + plen;
Since any scalar can be passed as a pattern, it's possible to implement an engine that does something with an array (""ook" =~ [ qw/ eek
hlagh / ]") or with the non-stringified form of a compiled regular expression (""ook" =~ qr/eek/"). Perl's own engine will always
stringify everything using the snippet above, but that doesn't mean other engines have to.
The "flags" parameter is a bitfield which indicates which of the "msixp" flags the regex was compiled with. It also contains additional
info, such as if "use locale" is in effect.
The "eogc" flags are stripped out before being passed to the comp routine. The regex engine does not need to know if any of these are set,
as those flags should only affect what Perl does with the pattern and its match variables, not how it gets compiled and executed.
By the time the comp callback is called, some of these flags have already had effect (noted below where applicable). However most of their
effect occurs after the comp callback has run, in routines that read the "rx->extflags" field which it populates.
In general the flags should be preserved in "rx->extflags" after compilation, although the regex engine might want to add or delete some of
them to invoke or disable some special behavior in Perl. The flags along with any special behavior they cause are documented below:
The pattern modifiers:
"/m" - RXf_PMf_MULTILINE
If this is in "rx->extflags" it will be passed to "Perl_fbm_instr" by "pp_split" which will treat the subject string as a multi-line
string.
"/s" - RXf_PMf_SINGLELINE
"/i" - RXf_PMf_FOLD
"/x" - RXf_PMf_EXTENDED
If present on a regex, "#" comments will be handled differently by the tokenizer in some cases.
TODO: Document those cases.
"/p" - RXf_PMf_KEEPCOPY
TODO: Document this
Character set
The character set semantics are determined by an enum that is contained in this field. This is still experimental and subject to
change, but the current interface returns the rules by use of the in-line function "get_regex_charset(const U32 flags)". The only
currently documented value returned from it is REGEX_LOCALE_CHARSET, which is set if "use locale" is in effect. If present in
"rx->extflags", "split" will use the locale dependent definition of whitespace when RXf_SKIPWHITE or RXf_WHITE is in effect. ASCII
whitespace is defined as per isSPACE, and by the internal macros "is_utf8_space" under UTF-8, and "isSPACE_LC" under "use locale".
Additional flags:
RXf_SPLIT
This flag was removed in perl 5.18.0. "split ' '" is now special-cased solely in the parser. RXf_SPLIT is still #defined, so you can
test for it. This is how it used to work:
If "split" is invoked as "split ' '" or with no arguments (which really means "split(' ', $_)", see split), Perl will set this flag.
The regex engine can then check for it and set the SKIPWHITE and WHITE extflags. To do this, the Perl engine does:
if (flags & RXf_SPLIT && r->prelen == 1 && r->precomp[0] == ' ')
r->extflags |= (RXf_SKIPWHITE|RXf_WHITE);
These flags can be set during compilation to enable optimizations in the "split" operator.
RXf_SKIPWHITE
This flag was removed in perl 5.18.0. It is still #defined, so you can set it, but doing so will have no effect. This is how it used
to work:
If the flag is present in "rx->extflags" "split" will delete whitespace from the start of the subject string before it's operated on.
What is considered whitespace depends on if the subject is a UTF-8 string and if the "RXf_PMf_LOCALE" flag is set.
If RXf_WHITE is set in addition to this flag, "split" will behave like "split " "" under the Perl engine.
RXf_START_ONLY
Tells the split operator to split the target string on newlines ("
") without invoking the regex engine.
Perl's engine sets this if the pattern is "/^/" ("plen == 1 && *exp == '^'"), even under "/^/s"; see split. Of course a different
regex engine might want to use the same optimizations with a different syntax.
RXf_WHITE
Tells the split operator to split the target string on whitespace without invoking the regex engine. The definition of whitespace
varies depending on if the target string is a UTF-8 string and on if RXf_PMf_LOCALE is set.
Perl's engine sets this flag if the pattern is "s+".
RXf_NULL
Tells the split operator to split the target string on characters. The definition of character varies depending on if the target
string is a UTF-8 string.
Perl's engine sets this flag on empty patterns, this optimization makes "split //" much faster than it would otherwise be. It's even
faster than "unpack".
RXf_NO_INPLACE_SUBST
Added in perl 5.18.0, this flag indicates that a regular expression might perform an operation that would interfere with inplace
substituion. For instance it might contain lookbehind, or assign to non-magical variables (such as $REGMARK and $REGERROR) during
matching. "s///" will skip certain optimisations when this is set.
exec
I32 exec(pTHX_ REGEXP * const rx,
char *stringarg, char* strend, char* strbeg,
I32 minend, SV* screamer,
void* data, U32 flags);
Execute a regexp. The arguments are
rx The regular expression to execute.
screamer
This strangely-named arg is the SV to be matched against. Note that the actual char array to be matched against is supplied by the
arguments described below; the SV is just used to determine UTF8ness, "pos()" etc.
strbeg
Pointer to the physical start of the string.
strend
Pointer to the character following the physical end of the string (i.e. the "