Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

perlre.pod@ 3298

Visit:

Last change on this file since 3298 was 3181, checked in by bird, 19 years ago
perl 5.8.8
File size: 52.6 KB

Rev	Line
[3181]	1	=head1 NAME
	2	X<regular expression> X<regex> X<regexp>
	3
	4	perlre - Perl regular expressions
	5
	6	=head1 DESCRIPTION
	7
	8	This page describes the syntax of regular expressions in Perl.
	9
	10	If you haven't used regular expressions before, a quick-start
	11	introduction is available in L<perlrequick>, and a longer tutorial
	12	introduction is available in L<perlretut>.
	13
	14	For reference on how regular expressions are used in matching
	15	operations, plus various examples of the same, see discussions of
	16	C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
	17	Operators">.
	18
	19	Matching operations can have various modifiers. Modifiers
	20	that relate to the interpretation of the regular expression inside
	21	are listed below. Modifiers that alter the way a regular expression
	22	is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
	23	L<perlop/"Gory details of parsing quoted constructs">.
	24
	25	=over 4
	26
	27	=item i
	28	X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
	29	X<regular expression, case-insensitive>
	30
	31	Do case-insensitive pattern matching.
	32
	33	If C<use locale> is in effect, the case map is taken from the current
	34	locale. See L<perllocale>.
	35
	36	=item m
	37	X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
	38
	39	Treat string as multiple lines. That is, change "^" and "$" from matching
	40	the start or end of the string to matching the start or end of any
	41	line anywhere within the string.
	42
	43	=item s
	44	X</s> X<regex, single-line> X<regexp, single-line>
	45	X<regular expression, single-line>
	46
	47	Treat string as single line. That is, change "." to match any character
	48	whatsoever, even a newline, which normally it would not match.
	49
	50	The C</s> and C</m> modifiers both override the C<$*> setting. That
	51	is, no matter what C<$*> contains, C</s> without C</m> will force
	52	"^" to match only at the beginning of the string and "$" to match
	53	only at the end (or just before a newline at the end) of the string.
	54	Together, as /ms, they let the "." match any character whatsoever,
	55	while still allowing "^" and "$" to match, respectively, just after
	56	and just before newlines within the string.
	57
	58	=item x
	59	X</x>
	60
	61	Extend your pattern's legibility by permitting whitespace and comments.
	62
	63	=back
	64
	65	These are usually written as "the C</x> modifier", even though the delimiter
	66	in question might not really be a slash. Any of these
	67	modifiers may also be embedded within the regular expression itself using
	68	the C<(?...)> construct. See below.
	69
	70	The C</x> modifier itself needs a little more explanation. It tells
	71	the regular expression parser to ignore whitespace that is neither
	72	backslashed nor within a character class. You can use this to break up
	73	your regular expression into (slightly) more readable parts. The C<#>
	74	character is also treated as a metacharacter introducing a comment,
	75	just as in ordinary Perl code. This also means that if you want real
	76	whitespace or C<#> characters in the pattern (outside a character
	77	class, where they are unaffected by C</x>), that you'll either have to
	78	escape them or encode them using octal or hex escapes. Taken together,
	79	these features go a long way towards making Perl's regular expressions
	80	more readable. Note that you have to be careful not to include the
	81	pattern delimiter in the comment--perl has no way of knowing you did
	82	not intend to close the pattern early. See the C-comment deletion code
	83	in L<perlop>.
	84	X</x>
	85
	86	=head2 Regular Expressions
	87
	88	The patterns used in Perl pattern matching derive from supplied in
	89	the Version 8 regex routines. (The routines are derived
	90	(distantly) from Henry Spencer's freely redistributable reimplementation
	91	of the V8 routines.) See L<Version 8 Regular Expressions> for
	92	details.
	93
	94	In particular the following metacharacters have their standard I<egrep>-ish
	95	meanings:
	96	X<metacharacter>
	97	X<\> X<^> X<.> X<$> X<\|> X<(> X<()> X<[> X<[]>
	98
	99
	100	\ Quote the next metacharacter
	101	^ Match the beginning of the line
	102	. Match any character (except newline)
	103	$ Match the end of the line (or before newline at the end)
	104	\| Alternation
	105	() Grouping
	106	[] Character class
	107
	108	By default, the "^" character is guaranteed to match only the
	109	beginning of the string, the "$" character only the end (or before the
	110	newline at the end), and Perl does certain optimizations with the
	111	assumption that the string contains only one line. Embedded newlines
	112	will not be matched by "^" or "$". You may, however, wish to treat a
	113	string as a multi-line buffer, such that the "^" will match after any
	114	newline within the string, and "$" will match before any newline. At the
	115	cost of a little more overhead, you can do this by using the /m modifier
	116	on the pattern match operator. (Older programs did this by setting C<$*>,
	117	but this practice is now deprecated.)
	118	X<^> X<$> X</m>
	119
	120	To simplify multi-line substitutions, the "." character never matches a
	121	newline unless you use the C</s> modifier, which in effect tells Perl to pretend
	122	the string is a single line--even if it isn't. The C</s> modifier also
	123	overrides the setting of C<$*>, in case you have some (badly behaved) older
	124	code that sets it in another module.
	125	X<.> X</s>
	126
	127	The following standard quantifiers are recognized:
	128	X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
	129
	130	* Match 0 or more times
	131	+ Match 1 or more times
	132	? Match 1 or 0 times
	133	{n} Match exactly n times
	134	{n,} Match at least n times
	135	{n,m} Match at least n but not more than m times
	136
	137	(If a curly bracket occurs in any other context, it is treated
	138	as a regular character. In particular, the lower bound
	139	is not optional.) The "*" modifier is equivalent to C<{0,}>, the "+"
	140	modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
	141	to integral values less than a preset limit defined when perl is built.
	142	This is usually 32766 on the most common platforms. The actual limit can
	143	be seen in the error message generated by code such as this:
	144
	145	$_ **= $_ , / {$_} / for 2 .. 42;
	146
	147	By default, a quantified subpattern is "greedy", that is, it will match as
	148	many times as possible (given a particular starting location) while still
	149	allowing the rest of the pattern to match. If you want it to match the
	150	minimum number of times possible, follow the quantifier with a "?". Note
	151	that the meanings don't change, just the "greediness":
	152	X<metacharacter> X<greedy> X<greedyness>
	153	X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
	154
	155	*? Match 0 or more times
	156	+? Match 1 or more times
	157	?? Match 0 or 1 time
	158	{n}? Match exactly n times
	159	{n,}? Match at least n times
	160	{n,m}? Match at least n but not more than m times
	161
	162	Because patterns are processed as double quoted strings, the following
	163	also work:
	164	X<\t> X<\n> X<\r> X<\f> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
	165	X<\0> X<\c> X<\N> X<\x>
	166
	167	\t tab (HT, TAB)
	168	\n newline (LF, NL)
	169	\r return (CR)
	170	\f form feed (FF)
	171	\a alarm (bell) (BEL)
	172	\e escape (think troff) (ESC)
	173	\033 octal char (think of a PDP-11)
	174	\x1B hex char
	175	\x{263a} wide hex char (Unicode SMILEY)
	176	\c[ control char
	177	\N{name} named char
	178	\l lowercase next char (think vi)
	179	\u uppercase next char (think vi)
	180	\L lowercase till \E (think vi)
	181	\U uppercase till \E (think vi)
	182	\E end case modification (think vi)
	183	\Q quote (disable) pattern metacharacters till \E
	184
	185	If C<use locale> is in effect, the case map used by C<\l>, C<\L>, C<\u>
	186	and C<\U> is taken from the current locale. See L<perllocale>. For
	187	documentation of C<\N{name}>, see L<charnames>.
	188
	189	You cannot include a literal C<$> or C<@> within a C<\Q> sequence.
	190	An unescaped C<$> or C<@> interpolates the corresponding variable,
	191	while escaping will cause the literal string C<\$> to be matched.
	192	You'll need to write something like C<m/\Quser\E\@\Qhost/>.
	193
	194	In addition, Perl defines the following:
	195	X<metacharacter>
	196	X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C>
	197	X<word> X<whitespace>
	198
	199	\w Match a "word" character (alphanumeric plus "_")
	200	\W Match a non-"word" character
	201	\s Match a whitespace character
	202	\S Match a non-whitespace character
	203	\d Match a digit character
	204	\D Match a non-digit character
	205	\pP Match P, named property. Use \p{Prop} for longer names.
	206	\PP Match non-P
	207	\X Match eXtended Unicode "combining character sequence",
	208	equivalent to (?:\PM\pM*)
	209	\C Match a single C char (octet) even under Unicode.
	210	NOTE: breaks up characters into their UTF-8 bytes,
	211	so you may end up with malformed pieces of UTF-8.
	212	Unsupported in lookbehind.
	213
	214	A C<\w> matches a single alphanumeric character (an alphabetic
	215	character, or a decimal digit) or C<_>, not a whole word. Use C<\w+>
	216	to match a string of Perl-identifier characters (which isn't the same
	217	as matching an English word). If C<use locale> is in effect, the list
	218	of alphabetic characters generated by C<\w> is taken from the current
	219	locale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>,
	220	C<\d>, and C<\D> within character classes, but if you try to use them
	221	as endpoints of a range, that's not a range, the "-" is understood
	222	literally. If Unicode is in effect, C<\s> matches also "\x{85}",
	223	"\x{2028}, and "\x{2029}", see L<perlunicode> for more details about
	224	C<\pP>, C<\PP>, and C<\X>, and L<perluniintro> about Unicode in general.
	225	You can define your own C<\p> and C<\P> properties, see L<perlunicode>.
	226	X<\w> X<\W> X<word>
	227
	228	The POSIX character class syntax
	229	X<character class>
	230
	231	[:class:]
	232
	233	is also available. The available classes and their backslash
	234	equivalents (if available) are as follows:
	235	X<character class>
	236	X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
	237	X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
	238
	239	alpha
	240	alnum
	241	ascii
	242	blank [1]
	243	cntrl
	244	digit \d
	245	graph
	246	lower
	247	print
	248	punct
	249	space \s [2]
	250	upper
	251	word \w [3]
	252	xdigit
	253
	254	=over
	255
	256	=item [1]
	257
	258	A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace".
	259
	260	=item [2]
	261
	262	Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
	263	also the (very rare) "vertical tabulator", "\ck", chr(11).
	264
	265	=item [3]
	266
	267	A Perl extension, see above.
	268
	269	=back
	270
	271	For example use C<[:upper:]> to match all the uppercase characters.
	272	Note that the C<[]> are part of the C<[::]> construct, not part of the
	273	whole character class. For example:
	274
	275	[01[:alpha:]%]
	276
	277	matches zero, one, any alphabetic character, and the percentage sign.
	278
	279	The following equivalences to Unicode \p{} constructs and equivalent
	280	backslash character classes (if available), will hold:
	281	X<character class> X<\p> X<\p{}>
	282
	283	[:...:] \p{...} backslash
	284
	285	alpha IsAlpha
	286	alnum IsAlnum
	287	ascii IsASCII
	288	blank IsSpace
	289	cntrl IsCntrl
	290	digit IsDigit \d
	291	graph IsGraph
	292	lower IsLower
	293	print IsPrint
	294	punct IsPunct
	295	space IsSpace
	296	IsSpacePerl \s
	297	upper IsUpper
	298	word IsWord
	299	xdigit IsXDigit
	300
	301	For example C<[:lower:]> and C<\p{IsLower}> are equivalent.
	302
	303	If the C<utf8> pragma is not used but the C<locale> pragma is, the
	304	classes correlate with the usual isalpha(3) interface (except for
	305	"word" and "blank").
	306
	307	The assumedly non-obviously named classes are:
	308
	309	=over 4
	310
	311	=item cntrl
	312	X<cntrl>
	313
	314	Any control character. Usually characters that don't produce output as
	315	such but instead control the terminal somehow: for example newline and
	316	backspace are control characters. All characters with ord() less than
	317	32 are most often classified as control characters (assuming ASCII,
	318	the ISO Latin character sets, and Unicode), as is the character with
	319	the ord() value of 127 (C<DEL>).
	320
	321	=item graph
	322	X<graph>
	323
	324	Any alphanumeric or punctuation (special) character.
	325
	326	=item print
	327	X<print>
	328
	329	Any alphanumeric or punctuation (special) character or the space character.
	330
	331	=item punct
	332	X<punct>
	333
	334	Any punctuation (special) character.
	335
	336	=item xdigit
	337	X<xdigit>
	338
	339	Any hexadecimal digit. Though this may feel silly ([0-9A-Fa-f] would
	340	work just fine) it is included for completeness.
	341
	342	=back
	343
	344	You can negate the [::] character classes by prefixing the class name
	345	with a '^'. This is a Perl extension. For example:
	346	X<character class, negation>
	347
	348	POSIX traditional Unicode
	349
	350	[:^digit:] \D \P{IsDigit}
	351	[:^space:] \S \P{IsSpace}
	352	[:^word:] \W \P{IsWord}
	353
	354	Perl respects the POSIX standard in that POSIX character classes are
	355	only supported within a character class. The POSIX character classes
	356	[.cc.] and [=cc=] are recognized but B<not> supported and trying to
	357	use them will cause an error.
	358
	359	Perl defines the following zero-width assertions:
	360	X<zero-width assertion> X<assertion> X<regex, zero-width assertion>
	361	X<regexp, zero-width assertion>
	362	X<regular expression, zero-width assertion>
	363	X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
	364
	365	\b Match a word boundary
	366	\B Match a non-(word boundary)
	367	\A Match only at beginning of string
	368	\Z Match only at end of string, or before newline at the end
	369	\z Match only at end of string
	370	\G Match only at pos() (e.g. at the end-of-match position
	371	of prior m//g)
	372
	373	A word boundary (C<\b>) is a spot between two characters
	374	that has a C<\w> on one side of it and a C<\W> on the other side
	375	of it (in either order), counting the imaginary characters off the
	376	beginning and end of the string as matching a C<\W>. (Within
	377	character classes C<\b> represents backspace rather than a word
	378	boundary, just as it normally does in any double-quoted string.)
	379	The C<\A> and C<\Z> are just like "^" and "$", except that they
	380	won't match multiple times when the C</m> modifier is used, while
	381	"^" and "$" will match at every internal line boundary. To match
	382	the actual end of the string and not ignore an optional trailing
	383	newline, use C<\z>.
	384	X<\b> X<\A> X<\Z> X<\z> X</m>
	385
	386	The C<\G> assertion can be used to chain global matches (using
	387	C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
	388	It is also useful when writing C<lex>-like scanners, when you have
	389	several patterns that you want to match against consequent substrings
	390	of your string, see the previous reference. The actual location
	391	where C<\G> will match can also be influenced by using C<pos()> as
	392	an lvalue: see L<perlfunc/pos>. Currently C<\G> is only fully
	393	supported when anchored to the start of the pattern; while it
	394	is permitted to use it elsewhere, as in C</(?<=\G..)./g>, some
	395	such uses (C</.\G/g>, for example) currently cause problems, and
	396	it is recommended that you avoid such usage for now.
	397	X<\G>
	398
	399	The bracketing construct C<( ... )> creates capture buffers. To
	400	refer to the digit'th buffer use \<digit> within the
	401	match. Outside the match use "$" instead of "\". (The
	402	\<digit> notation works in certain circumstances outside
	403	the match. See the warning below about \1 vs $1 for details.)
	404	Referring back to another part of the match is called a
	405	I<backreference>.
	406	X<regex, capture buffer> X<regexp, capture buffer>
	407	X<regular expression, capture buffer> X<backreference>
	408
	409	There is no limit to the number of captured substrings that you may
	410	use. However Perl also uses \10, \11, etc. as aliases for \010,
	411	\011, etc. (Recall that 0 means octal, so \011 is the character at
	412	number 9 in your coded character set; which would be the 10th character,
	413	a horizontal tab under ASCII.) Perl resolves this
	414	ambiguity by interpreting \10 as a backreference only if at least 10
	415	left parentheses have opened before it. Likewise \11 is a
	416	backreference only if at least 11 left parentheses have opened
	417	before it. And so on. \1 through \9 are always interpreted as
	418	backreferences.
	419
	420	Examples:
	421
	422	s/^([^ ]) ([^ ]*)/$2 $1/; # swap first two words
	423
	424	if (/(.)\1/) { # find first doubled char
	425	print "'$1' is the first doubled character\n";
	426	}
	427
	428	if (/Time: (..):(..):(..)/) { # parse out values
	429	$hours = $1;
	430	$minutes = $2;
	431	$seconds = $3;
	432	}
	433
	434	Several special variables also refer back to portions of the previous
	435	match. C<$+> returns whatever the last bracket match matched.
	436	C<$&> returns the entire matched string. (At one point C<$0> did
	437	also, but now it returns the name of the program.) C<$`> returns
	438	everything before the matched string. C<$'> returns everything
	439	after the matched string. And C<$^N> contains whatever was matched by
	440	the most-recently closed group (submatch). C<$^N> can be used in
	441	extended patterns (see below), for example to assign a submatch to a
	442	variable.
	443	X<$+> X<$^N> X<$&> X<$`> X<$'>
	444
	445	The numbered match variables ($1, $2, $3, etc.) and the related punctuation
	446	set (C<$+>, C<$&>, C<$`>, C<$'>, and C<$^N>) are all dynamically scoped
	447	until the end of the enclosing block or until the next successful
	448	match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
	449	X<$+> X<$^N> X<$&> X<$`> X<$'>
	450	X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9>
	451
	452
	453	B<NOTE>: failed matches in Perl do not reset the match variables,
	454	which makes it easier to write code that tests for a series of more
	455	specific cases and remembers the best match.
	456
	457	B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or
	458	C<$'> anywhere in the program, it has to provide them for every
	459	pattern match. This may substantially slow your program. Perl
	460	uses the same mechanism to produce $1, $2, etc, so you also pay a
	461	price for each pattern that contains capturing parentheses. (To
	462	avoid this cost while retaining the grouping behaviour, use the
	463	extended regular expression C<(?: ... )> instead.) But if you never
	464	use C<$&>, C<$`> or C<$'>, then patterns I<without> capturing
	465	parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`>
	466	if you can, but if you can't (and some algorithms really appreciate
	467	them), once you've used them once, use them at will, because you've
	468	already paid the price. As of 5.005, C<$&> is not so costly as the
	469	other two.
	470	X<$&> X<$`> X<$'>
	471
	472	Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
	473	C<\w>, C<\n>. Unlike some other regular expression languages, there
	474	are no backslashed symbols that aren't alphanumeric. So anything
	475	that looks like \\, $, $, \<, \>, \{, or \} is always
	476	interpreted as a literal character, not a metacharacter. This was
	477	once used in a common idiom to disable or quote the special meanings
	478	of regular expression metacharacters in a string that you want to
	479	use for a pattern. Simply quote all non-"word" characters:
	480
	481	$pattern =~ s/(\W)/\\$1/g;
	482
	483	(If C<use locale> is set, then this depends on the current locale.)
	484	Today it is more common to use the quotemeta() function or the C<\Q>
	485	metaquoting escape sequence to disable all metacharacters' special
	486	meanings like this:
	487
	488	/$unquoted\Q$quoted\E$unquoted/
	489
	490	Beware that if you put literal backslashes (those not inside
	491	interpolated variables) between C<\Q> and C<\E>, double-quotish
	492	backslash interpolation may lead to confusing results. If you
	493	I<need> to use literal backslashes within C<\Q...\E>,
	494	consult L<perlop/"Gory details of parsing quoted constructs">.
	495
	496	=head2 Extended Patterns
	497
	498	Perl also defines a consistent extension syntax for features not
	499	found in standard tools like B<awk> and B<lex>. The syntax is a