You are viewing the version of this documentation from Perl 5.8.7. View the latest version

CONTENTS

NAME

Text::Balanced - Extract delimited text sequences from strings.

SYNOPSIS

 use Text::Balanced qw (
			extract_delimited
			extract_bracketed
			extract_quotelike
			extract_codeblock
			extract_variable
			extract_tagged
			extract_multiple

			gen_delimited_pat
			gen_extract_tagged
		       );

 # Extract the initial substring of $text that is delimited by
 # two (unescaped) instances of the first character in $delim.

	($extracted, $remainder) = extract_delimited($text,$delim);


 # Extract the initial substring of $text that is bracketed
 # with a delimiter(s) specified by $delim (where the string
 # in $delim contains one or more of '(){}[]<>').

	($extracted, $remainder) = extract_bracketed($text,$delim);


 # Extract the initial substring of $text that is bounded by
 # an XML tag.

	($extracted, $remainder) = extract_tagged($text);


 # Extract the initial substring of $text that is bounded by
 # a C<BEGIN>...C<END> pair. Don't allow nested C<BEGIN> tags

	($extracted, $remainder) =
		extract_tagged($text,"BEGIN","END",undef,{bad=>["BEGIN"]});


 # Extract the initial substring of $text that represents a
 # Perl "quote or quote-like operation"

	($extracted, $remainder) = extract_quotelike($text);


 # Extract the initial substring of $text that represents a block
 # of Perl code, bracketed by any of character(s) specified by $delim
 # (where the string $delim contains one or more of '(){}[]<>').

	($extracted, $remainder) = extract_codeblock($text,$delim);


 # Extract the initial substrings of $text that would be extracted by
 # one or more sequential applications of the specified functions
 # or regular expressions

	@extracted = extract_multiple($text,
				      [ \&extract_bracketed,
					\&extract_quotelike,
					\&some_other_extractor_sub,
					qr/[xyz]*/,
					'literal',
				      ]);

# Create a string representing an optimized pattern (a la Friedl) # that matches a substring delimited by any of the specified characters # (in this case: any type of quote or a slash)

$patstring = gen_delimited_pat(q{'"`/});

# Generate a reference to an anonymous sub that is just like extract_tagged # but pre-compiled and optimized for a specific pair of tags, and consequently # much faster (i.e. 3 times faster). It uses qr// for better performance on # repeated calls, so it only works under Perl 5.005 or later.

$extract_head = gen_extract_tagged('<HEAD>','</HEAD>');

($extracted, $remainder) = $extract_head->($text);

DESCRIPTION

The various extract_... subroutines may be used to extract a delimited substring, possibly after skipping a specified prefix string. By default, that prefix is optional whitespace (/\s*/), but you can change it to whatever you wish (see below).

The substring to be extracted must appear at the current pos location of the string's variable (or at index zero, if no pos position is defined). In other words, the extract_... subroutines don't extract the first occurance of a substring anywhere in a string (like an unanchored regex would). Rather, they extract an occurance of the substring appearing immediately at the current matching position in the string (like a \G-anchored regex would).

General behaviour in list contexts

In a list context, all the subroutines return a list, the first three elements of which are always:

[0]

The extracted string, including the specified delimiters. If the extraction fails an empty string is returned.

[1]

The remainder of the input string (i.e. the characters after the extracted string). On failure, the entire string is returned.

[2]

The skipped prefix (i.e. the characters before the extracted string). On failure, the empty string is returned.

Note that in a list context, the contents of the original input text (the first argument) are not modified in any way.

However, if the input text was passed in a variable, that variable's pos value is updated to point at the first character after the extracted text. That means that in a list context the various subroutines can be used much like regular expressions. For example:

while ( $next = (extract_quotelike($text))[0] )
{
	# process next quote-like (in $next)
}

General behaviour in scalar and void contexts

In a scalar context, the extracted string is returned, having first been removed from the input text. Thus, the following code also processes each quote-like operation, but actually removes them from $text:

while ( $next = extract_quotelike($text) )
{
	# process next quote-like (in $next)
}

Note that if the input text is a read-only string (i.e. a literal), no attempt is made to remove the extracted text.

In a void context the behaviour of the extraction subroutines is exactly the same as in a scalar context, except (of course) that the extracted substring is not returned.

A note about prefixes

Prefix patterns are matched without any trailing modifiers (/gimsox etc.) This can bite you if you're expecting a prefix specification like '.*?(?=<H1>)' to skip everything up to the first <H1> tag. Such a prefix pattern will only succeed if the <H1> tag is on the current line, since . normally doesn't match newlines.

To overcome this limitation, you need to turn on /s matching within the prefix pattern, using the (?s) directive: '(?s).*?(?=<H1>)'

extract_delimited

The extract_delimited function formalizes the common idiom of extracting a single-character-delimited substring from the start of a string. For example, to extract a single-quote delimited string, the following code is typically used:

($remainder = $text) =~ s/\A('(\\.|[^'])*')//s;
$extracted = $1;

but with extract_delimited it can be simplified to:

($extracted,$remainder) = extract_delimited($text, "'");

extract_delimited takes up to four scalars (the input text, the delimiters, a prefix pattern to be skipped, and any escape characters) and extracts the initial substring of the text that is appropriately delimited. If the delimiter string has multiple characters, the first one encountered in the text is taken to delimit the substring. The third argument specifies a prefix pattern that is to be skipped (but must be present!) before the substring is extracted. The final argument specifies the escape character to be used for each delimiter.

All arguments are optional. If the escape characters are not specified, every delimiter is escaped with a backslash (\). If the prefix is not specified, the pattern '\s*' - optional whitespace - is used. If the delimiter set is also not specified, the set /["'`]/ is used. If the text to be processed is not specified either, $_ is used.

In list context, extract_delimited returns a array of three elements, the extracted substring (including the surrounding delimiters), the remainder of the text, and the skipped prefix (if any). If a suitable delimited substring is not found, the first element of the array is the empty string, the second is the complete original text, and the prefix returned in the third element is an empty string.

In a scalar context, just the extracted substring is returned. In a void context, the extracted substring (and any prefix) are simply removed from the beginning of the first argument.

Examples:

# Remove a single-quoted substring from the very beginning of $text:

	$substring = extract_delimited($text, "'", '');

# Remove a single-quoted Pascalish substring (i.e. one in which
# doubling the quote character escapes it) from the very
# beginning of $text:

	$substring = extract_delimited($text, "'", '', "'");

# Extract a single- or double- quoted substring from the
# beginning of $text, optionally after some whitespace
# (note the list context to protect $text from modification):

	($substring) = extract_delimited $text, q{"'};


# Delete the substring delimited by the first '/' in $text:

	$text = join '', (extract_delimited($text,'/','[^/]*')[2,1];

Note that this last example is not the same as deleting the first quote-like pattern. For instance, if $text contained the string:

"if ('./cmd' =~ m/$UNIXCMD/s) { $cmd = $1; }"

then after the deletion it would contain:

"if ('.$UNIXCMD/s) { $cmd = $1; }"

not:

"if ('./cmd' =~ ms) { $cmd = $1; }"

See "extract_quotelike" for a (partial) solution to this problem.