You are viewing the version of this documentation from Perl 5.39.7. This is a development version of Perl.

CONTENTS

NAME

perlreapi - Perl regular expression plugin interface

DESCRIPTION

As of Perl 5.9.5 there is a new interface for plugging and using regular expression engines other than the default one.

Each engine is supposed to provide access to a constant structure of the following format:

    typedef struct regexp_engine {
        REGEXP* (*comp) (pTHX_
                         const SV * const pattern, const U32 flags);
        I32     (*exec) (pTHX_
                         REGEXP * const rx,
                         char* stringarg,
                         char* strend, char* strbeg,
                         SSize_t minend, SV* sv,
                         void* data, U32 flags);
        char*   (*intuit) (pTHX_
                           REGEXP * const rx, SV *sv,
			   const char * const strbeg,
                           char *strpos, char *strend, U32 flags,
                           struct re_scream_pos_data_s *data);
        SV*     (*checkstr) (pTHX_ REGEXP * const rx);
        void    (*free) (pTHX_ REGEXP * const rx);
        void    (*numbered_buff_FETCH) (pTHX_
                                        REGEXP * const rx,
                                        const I32 paren,
                                        SV * const sv);
        void    (*numbered_buff_STORE) (pTHX_
                                        REGEXP * const rx,
                                        const I32 paren,
                                        SV const * const value);
        I32     (*numbered_buff_LENGTH) (pTHX_
                                         REGEXP * const rx,
                                         const SV * const sv,
                                         const I32 paren);
        SV*     (*named_buff) (pTHX_
                               REGEXP * const rx,
                               SV * const key,
                               SV * const value,
                               U32 flags);
        SV*     (*named_buff_iter) (pTHX_
                                    REGEXP * const rx,
                                    const SV * const lastkey,
                                    const U32 flags);
        SV*     (*qr_package)(pTHX_ REGEXP * const rx);
    #ifdef USE_ITHREADS
        void*   (*dupe) (pTHX_ REGEXP * const rx, CLONE_PARAMS *param);
    #endif
        REGEXP* (*op_comp) (...);

When a regexp is compiled, its engine field is then set to point at the appropriate structure, so that when it needs to be used Perl can find the right routines to do so.

In order to install a new regexp handler, $^H{regcomp} is set to an integer which (when casted appropriately) resolves to one of these structures. When compiling, the comp method is executed, and the resulting regexp structure's engine field is expected to point back at the same structure.

The pTHX_ symbol in the definition is a macro used by Perl under threading to provide an extra argument to the routine holding a pointer back to the interpreter that is executing the regexp. So under threading all routines get an extra argument.

Callbacks

comp

REGEXP* comp(pTHX_ const SV * const pattern, const U32 flags);

Compile the pattern stored in pattern using the given flags and return a pointer to a prepared REGEXP structure that can perform the match. See "The REGEXP structure" below for an explanation of the individual fields in the REGEXP struct.

The pattern parameter is the scalar that was used as the pattern. Previous versions of Perl would pass two char* indicating the start and end of the stringified pattern; the following snippet can be used to get the old parameters:

STRLEN plen;
char*  exp = SvPV(pattern, plen);
char* xend = exp + plen;

Since any scalar can be passed as a pattern, it's possible to implement an engine that does something with an array ("ook" =~ [ qw/ eek hlagh / ]) or with the non-stringified form of a compiled regular expression ("ook" =~ qr/eek/). Perl's own engine will always stringify everything using the snippet above, but that doesn't mean other engines have to.

The flags parameter is a bitfield which indicates which of the msixpn flags the regex was compiled with. It also contains additional info, such as if use locale is in effect.

The eogc flags are stripped out before being passed to the comp routine. The regex engine does not need to know if any of these are set, as those flags should only affect what Perl does with the pattern and its match variables, not how it gets compiled and executed.

By the time the comp callback is called, some of these flags have already had effect (noted below where applicable). However most of their effect occurs after the comp callback has run, in routines that read the rx->extflags field which it populates.

In general the flags should be preserved in rx->extflags after compilation, although the regex engine might want to add or delete some of them to invoke or disable some special behavior in Perl. The flags along with any special behavior they cause are documented below:

The pattern modifiers:

/m - RXf_PMf_MULTILINE

If this is in rx->extflags it will be passed to Perl_fbm_instr by pp_split which will treat the subject string as a multi-line string.

/s - RXf_PMf_SINGLELINE
/i - RXf_PMf_FOLD
/x - RXf_PMf_EXTENDED

If present on a regex, "#" comments will be handled differently by the tokenizer in some cases.

TODO: Document those cases.