Query: perlreapi
OS: mojave
Section: 1
Format: Original Unix Latex Style Formatted with HTML and a Horizontal Scroll Bar
PERLREAPI(1) Perl Programmers Reference Guide PERLREAPI(1)NAMEperlreapi - Perl regular expression plugin interfaceDESCRIPTIONAs of Perl 5.9.5 there is a new interface for plugging and using regular expression engines other than the default one. Each engine is supposed to provide access to a constant structure of the following format: typedef struct regexp_engine { REGEXP* (*comp) (pTHX_ const SV * const pattern, const U32 flags); I32 (*exec) (pTHX_ REGEXP * const rx, char* stringarg, char* strend, char* strbeg, I32 minend, SV* screamer, void* data, U32 flags); char* (*intuit) (pTHX_ REGEXP * const rx, SV *sv, char *strpos, char *strend, U32 flags, struct re_scream_pos_data_s *data); SV* (*checkstr) (pTHX_ REGEXP * const rx); void (*free) (pTHX_ REGEXP * const rx); void (*numbered_buff_FETCH) (pTHX_ REGEXP * const rx, const I32 paren, SV * const sv); void (*numbered_buff_STORE) (pTHX_ REGEXP * const rx, const I32 paren, SV const * const value); I32 (*numbered_buff_LENGTH) (pTHX_ REGEXP * const rx, const SV * const sv, const I32 paren); SV* (*named_buff) (pTHX_ REGEXP * const rx, SV * const key, SV * const value, U32 flags); SV* (*named_buff_iter) (pTHX_ REGEXP * const rx, const SV * const lastkey, const U32 flags); SV* (*qr_package)(pTHX_ REGEXP * const rx); #ifdef USE_ITHREADS void* (*dupe) (pTHX_ REGEXP * const rx, CLONE_PARAMS *param); #endif REGEXP* (*op_comp) (...); When a regexp is compiled, its "engine" field is then set to point at the appropriate structure, so that when it needs to be used Perl can find the right routines to do so. In order to install a new regexp handler, $^H{regcomp} is set to an integer which (when casted appropriately) resolves to one of these structures. When compiling, the "comp" method is executed, and the resulting "regexp" structure's engine field is expected to point back at the same structure. The pTHX_ symbol in the definition is a macro used by Perl under threading to provide an extra argument to the routine holding a pointer back to the interpreter that is executing the regexp. So under threading all routines get an extra argument. Callbacks comp REGEXP* comp(pTHX_ const SV * const pattern, const U32 flags); Compile the pattern stored in "pattern" using the given "flags" and return a pointer to a prepared "REGEXP" structure that can perform the match. See "The REGEXP structure" below for an explanation of the individual fields in the REGEXP struct. The "pattern" parameter is the scalar that was used as the pattern. Previous versions of Perl would pass two "char*" indicating the start and end of the stringified pattern; the following snippet can be used to get the old parameters: STRLEN plen; char* exp = SvPV(pattern, plen); char* xend = exp + plen; Since any scalar can be passed as a pattern, it's possible to implement an engine that does something with an array (""ook" =~ [ qw/ eek hlagh / ]") or with the non-stringified form of a compiled regular expression (""ook" =~ qr/eek/"). Perl's own engine will always stringify everything using the snippet above, but that doesn't mean other engines have to. The "flags" parameter is a bitfield which indicates which of the "msixp" flags the regex was compiled with. It also contains additional info, such as if "use locale" is in effect. The "eogc" flags are stripped out before being passed to the comp routine. The regex engine does not need to know if any of these are set, as those flags should only affect what Perl does with the pattern and its match variables, not how it gets compiled and executed. By the time the comp callback is called, some of these flags have already had effect (noted below where applicable). However most of their effect occurs after the comp callback has run, in routines that read the "rx->extflags" field which it populates. In general the flags should be preserved in "rx->extflags" after compilation, although the regex engine might want to add or delete some of them to invoke or disable some special behavior in Perl. The flags along with any special behavior they cause are documented below: The pattern modifiers: "/m" - RXf_PMf_MULTILINE If this is in "rx->extflags" it will be passed to "Perl_fbm_instr" by "pp_split" which will treat the subject string as a multi-line string. "/s" - RXf_PMf_SINGLELINE "/i" - RXf_PMf_FOLD "/x" - RXf_PMf_EXTENDED If present on a regex, "#" comments will be handled differently by the tokenizer in some cases. TODO: Document those cases. "/p" - RXf_PMf_KEEPCOPY TODO: Document this Character set The character set semantics are determined by an enum that is contained in this field. This is still experimental and subject to change, but the current interface returns the rules by use of the in-line function "get_regex_charset(const U32 flags)". The only currently documented value returned from it is REGEX_LOCALE_CHARSET, which is set if "use locale" is in effect. If present in "rx->extflags", "split" will use the locale dependent definition of whitespace when RXf_SKIPWHITE or RXf_WHITE is in effect. ASCII whitespace is defined as per isSPACE, and by the internal macros "is_utf8_space" under UTF-8, and "isSPACE_LC" under "use locale". Additional flags: RXf_SPLIT This flag was removed in perl 5.18.0. "split ' '" is now special-cased solely in the parser. RXf_SPLIT is still #defined, so you can test for it. This is how it used to work: If "split" is invoked as "split ' '" or with no arguments (which really means "split(' ', $_)", see split), Perl will set this flag. The regex engine can then check for it and set the SKIPWHITE and WHITE extflags. To do this, the Perl engine does: if (flags & RXf_SPLIT && r->prelen == 1 && r->precomp[0] == ' ') r->extflags |= (RXf_SKIPWHITE|RXf_WHITE); These flags can be set during compilation to enable optimizations in the "split" operator. RXf_SKIPWHITE This flag was removed in perl 5.18.0. It is still #defined, so you can set it, but doing so will have no effect. This is how it used to work: If the flag is present in "rx->extflags" "split" will delete whitespace from the start of the subject string before it's operated on. What is considered whitespace depends on if the subject is a UTF-8 string and if the "RXf_PMf_LOCALE" flag is set. If RXf_WHITE is set in addition to this flag, "split" will behave like "split " "" under the Perl engine. RXf_START_ONLY Tells the split operator to split the target string on newlines (" ") without invoking the regex engine. Perl's engine sets this if the pattern is "/^/" ("plen == 1 && *exp == '^'"), even under "/^/s"; see split. Of course a different regex engine might want to use the same optimizations with a different syntax. RXf_WHITE Tells the split operator to split the target string on whitespace without invoking the regex engine. The definition of whitespace varies depending on if the target string is a UTF-8 string and on if RXf_PMf_LOCALE is set. Perl's engine sets this flag if the pattern is "s+". RXf_NULL Tells the split operator to split the target string on characters. The definition of character varies depending on if the target string is a UTF-8 string. Perl's engine sets this flag on empty patterns, this optimization makes "split //" much faster than it would otherwise be. It's even faster than "unpack". RXf_NO_INPLACE_SUBST Added in perl 5.18.0, this flag indicates that a regular expression might perform an operation that would interfere with inplace substituion. For instance it might contain lookbehind, or assign to non-magical variables (such as $REGMARK and $REGERROR) during matching. "s///" will skip certain optimisations when this is set. exec I32 exec(pTHX_ REGEXP * const rx, char *stringarg, char* strend, char* strbeg, I32 minend, SV* screamer, void* data, U32 flags); Execute a regexp. The arguments are rx The regular expression to execute. screamer This strangely-named arg is the SV to be matched against. Note that the actual char array to be matched against is supplied by the arguments described below; the SV is just used to determine UTF8ness, "pos()" etc. strbeg Pointer to the physical start of the string. strend Pointer to the character following the physical end of the string (i.e. the "