perlreapi - Perl regular expression plugin interface
As of Perl 5.9.5 there is a new interface for plugging and using regular expression engines other than the default one.
Each engine is supposed to provide access to a constant structure of the following format:
typedef struct regexp_engine {
REGEXP* (*comp) (pTHX_
const SV * const pattern, const U32 flags);
I32 (*exec) (pTHX_
REGEXP * const rx,
char* stringarg,
char* strend, char* strbeg,
SSize_t minend, SV* sv,
void* data, U32 flags);
char* (*intuit) (pTHX_
REGEXP * const rx, SV *sv,
const char * const strbeg,
char *strpos, char *strend, U32 flags,
struct re_scream_pos_data_s *data);
SV* (*checkstr) (pTHX_ REGEXP * const rx);
void (*free) (pTHX_ REGEXP * const rx);
void (*numbered_buff_FETCH) (pTHX_
REGEXP * const rx,
const I32 paren,
SV * const sv);
void (*numbered_buff_STORE) (pTHX_
REGEXP * const rx,
const I32 paren,
SV const * const value);
I32 (*numbered_buff_LENGTH) (pTHX_
REGEXP * const rx,
const SV * const sv,
const I32 paren);
SV* (*named_buff) (pTHX_
REGEXP * const rx,
SV * const key,
SV * const value,
U32 flags);
SV* (*named_buff_iter) (pTHX_
REGEXP * const rx,
const SV * const lastkey,
const U32 flags);
SV* (*qr_package)(pTHX_ REGEXP * const rx);
#ifdef USE_ITHREADS
void* (*dupe) (pTHX_ REGEXP * const rx, CLONE_PARAMS *param);
#endif
REGEXP* (*op_comp) (...);
When a regexp is compiled, its engine
field is then set to point at the appropriate structure, so that when it needs to be used Perl can find the right routines to do so.
In order to install a new regexp handler, $^H{regcomp}
is set to an integer which (when casted appropriately) resolves to one of these structures. When compiling, the comp
method is executed, and the resulting regexp
structure's engine field is expected to point back at the same structure.
The pTHX_ symbol in the definition is a macro used by Perl under threading to provide an extra argument to the routine holding a pointer back to the interpreter that is executing the regexp. So under threading all routines get an extra argument.
REGEXP* comp(pTHX_ const SV * const pattern, const U32 flags);
Compile the pattern stored in pattern
using the given flags
and return a pointer to a prepared REGEXP
structure that can perform the match. See "The REGEXP structure" below for an explanation of the individual fields in the REGEXP struct.
The pattern
parameter is the scalar that was used as the pattern. Previous versions of Perl would pass two char*
indicating the start and end of the stringified pattern; the following snippet can be used to get the old parameters:
STRLEN plen;
char* exp = SvPV(pattern, plen);
char* xend = exp + plen;
Since any scalar can be passed as a pattern, it's possible to implement an engine that does something with an array ("ook" =~ [ qw/ eek hlagh / ]
) or with the non-stringified form of a compiled regular expression ("ook" =~ qr/eek/
). Perl's own engine will always stringify everything using the snippet above, but that doesn't mean other engines have to.
The flags
parameter is a bitfield which indicates which of the msixpn
flags the regex was compiled with. It also contains additional info, such as if use locale
is in effect.
The eogc
flags are stripped out before being passed to the comp routine. The regex engine does not need to know if any of these are set, as those flags should only affect what Perl does with the pattern and its match variables, not how it gets compiled and executed.
By the time the comp callback is called, some of these flags have already had effect (noted below where applicable). However most of their effect occurs after the comp callback has run, in routines that read the rx->extflags
field which it populates.
In general the flags should be preserved in rx->extflags
after compilation, although the regex engine might want to add or delete some of them to invoke or disable some special behavior in Perl. The flags along with any special behavior they cause are documented below:
The pattern modifiers:
/m
- RXf_PMf_MULTILINEIf this is in rx->extflags
it will be passed to Perl_fbm_instr
by pp_split
which will treat the subject string as a multi-line string.
/s
- RXf_PMf_SINGLELINE/i
- RXf_PMf_FOLD/x
- RXf_PMf_EXTENDEDIf present on a regex, "#"
comments will be handled differently by the tokenizer in some cases.
TODO: Document those cases.