Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

perlre.pod@ 3439

Visit:

Last change on this file since 3439 was 3181, checked in by bird, 19 years ago
perl 5.8.8
File size: 52.6 KB

Line
1	=head1 NAME
2	X<regular expression> X<regex> X<regexp>
3
4	perlre - Perl regular expressions
5
6	=head1 DESCRIPTION
7
8	This page describes the syntax of regular expressions in Perl.
9
10	If you haven't used regular expressions before, a quick-start
11	introduction is available in L<perlrequick>, and a longer tutorial
12	introduction is available in L<perlretut>.
13
14	For reference on how regular expressions are used in matching
15	operations, plus various examples of the same, see discussions of
16	C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
17	Operators">.
18
19	Matching operations can have various modifiers. Modifiers
20	that relate to the interpretation of the regular expression inside
21	are listed below. Modifiers that alter the way a regular expression
22	is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
23	L<perlop/"Gory details of parsing quoted constructs">.
24
25	=over 4
26
27	=item i
28	X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
29	X<regular expression, case-insensitive>
30
31	Do case-insensitive pattern matching.
32
33	If C<use locale> is in effect, the case map is taken from the current
34	locale. See L<perllocale>.
35
36	=item m
37	X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
38
39	Treat string as multiple lines. That is, change "^" and "$" from matching
40	the start or end of the string to matching the start or end of any
41	line anywhere within the string.
42
43	=item s
44	X</s> X<regex, single-line> X<regexp, single-line>
45	X<regular expression, single-line>
46
47	Treat string as single line. That is, change "." to match any character
48	whatsoever, even a newline, which normally it would not match.
49
50	The C</s> and C</m> modifiers both override the C<$*> setting. That
51	is, no matter what C<$*> contains, C</s> without C</m> will force
52	"^" to match only at the beginning of the string and "$" to match
53	only at the end (or just before a newline at the end) of the string.
54	Together, as /ms, they let the "." match any character whatsoever,
55	while still allowing "^" and "$" to match, respectively, just after
56	and just before newlines within the string.
57
58	=item x
59	X</x>
60
61	Extend your pattern's legibility by permitting whitespace and comments.
62
63	=back
64
65	These are usually written as "the C</x> modifier", even though the delimiter
66	in question might not really be a slash. Any of these
67	modifiers may also be embedded within the regular expression itself using
68	the C<(?...)> construct. See below.
69
70	The C</x> modifier itself needs a little more explanation. It tells
71	the regular expression parser to ignore whitespace that is neither
72	backslashed nor within a character class. You can use this to break up
73	your regular expression into (slightly) more readable parts. The C<#>
74	character is also treated as a metacharacter introducing a comment,
75	just as in ordinary Perl code. This also means that if you want real
76	whitespace or C<#> characters in the pattern (outside a character
77	class, where they are unaffected by C</x>), that you'll either have to
78	escape them or encode them using octal or hex escapes. Taken together,
79	these features go a long way towards making Perl's regular expressions
80	more readable. Note that you have to be careful not to include the
81	pattern delimiter in the comment--perl has no way of knowing you did
82	not intend to close the pattern early. See the C-comment deletion code
83	in L<perlop>.
84	X</x>
85
86	=head2 Regular Expressions
87
88	The patterns used in Perl pattern matching derive from supplied in
89	the Version 8 regex routines. (The routines are derived
90	(distantly) from Henry Spencer's freely redistributable reimplementation
91	of the V8 routines.) See L<Version 8 Regular Expressions> for
92	details.
93
94	In particular the following metacharacters have their standard I<egrep>-ish
95	meanings:
96	X<metacharacter>
97	X<\> X<^> X<.> X<$> X<\|> X<(> X<()> X<[> X<[]>
98
99
100	\ Quote the next metacharacter
101	^ Match the beginning of the line
102	. Match any character (except newline)
103	$ Match the end of the line (or before newline at the end)
104	\| Alternation
105	() Grouping
106	[] Character class
107
108	By default, the "^" character is guaranteed to match only the
109	beginning of the string, the "$" character only the end (or before the
110	newline at the end), and Perl does certain optimizations with the
111	assumption that the string contains only one line. Embedded newlines
112	will not be matched by "^" or "$". You may, however, wish to treat a
113	string as a multi-line buffer, such that the "^" will match after any
114	newline within the string, and "$" will match before any newline. At the
115	cost of a little more overhead, you can do this by using the /m modifier
116	on the pattern match operator. (Older programs did this by setting C<$*>,
117	but this practice is now deprecated.)
118	X<^> X<$> X</m>
119
120	To simplify multi-line substitutions, the "." character never matches a
121	newline unless you use the C</s> modifier, which in effect tells Perl to pretend
122	the string is a single line--even if it isn't. The C</s> modifier also
123	overrides the setting of C<$*>, in case you have some (badly behaved) older
124	code that sets it in another module.
125	X<.> X</s>
126
127	The following standard quantifiers are recognized:
128	X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
129
130	* Match 0 or more times
131	+ Match 1 or more times
132	? Match 1 or 0 times
133	{n} Match exactly n times
134	{n,} Match at least n times
135	{n,m} Match at least n but not more than m times
136
137	(If a curly bracket occurs in any other context, it is treated
138	as a regular character. In particular, the lower bound
139	is not optional.) The "*" modifier is equivalent to C<{0,}>, the "+"
140	modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
141	to integral values less than a preset limit defined when perl is built.
142	This is usually 32766 on the most common platforms. The actual limit can
143	be seen in the error message generated by code such as this:
144
145	$_ **= $_ , / {$_} / for 2 .. 42;
146
147	By default, a quantified subpattern is "greedy", that is, it will match as
148	many times as possible (given a particular starting location) while still
149	allowing the rest of the pattern to match. If you want it to match the
150	minimum number of times possible, follow the quantifier with a "?". Note
151	that the meanings don't change, just the "greediness":
152	X<metacharacter> X<greedy> X<greedyness>
153	X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
154
155	*? Match 0 or more times
156	+? Match 1 or more times
157	?? Match 0 or 1 time
158	{n}? Match exactly n times
159	{n,}? Match at least n times
160	{n,m}? Match at least n but not more than m times
161
162	Because patterns are processed as double quoted strings, the following
163	also work:
164	X<\t> X<\n> X<\r> X<\f> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
165	X<\0> X<\c> X<\N> X<\x>
166
167	\t tab (HT, TAB)
168	\n newline (LF, NL)
169	\r return (CR)
170	\f form feed (FF)
171	\a alarm (bell) (BEL)
172	\e escape (think troff) (ESC)
173	\033 octal char (think of a PDP-11)
174	\x1B hex char
175	\x{263a} wide hex char (Unicode SMILEY)
176	\c[ control char
177	\N{name} named char
178	\l lowercase next char (think vi)
179	\u uppercase next char (think vi)
180	\L lowercase till \E (think vi)
181	\U uppercase till \E (think vi)
182	\E end case modification (think vi)
183	\Q quote (disable) pattern metacharacters till \E
184
185	If C<use locale> is in effect, the case map used by C<\l>, C<\L>, C<\u>
186	and C<\U> is taken from the current locale. See L<perllocale>. For
187	documentation of C<\N{name}>, see L<charnames>.
188
189	You cannot include a literal C<$> or C<@> within a C<\Q> sequence.
190	An unescaped C<$> or C<@> interpolates the corresponding variable,
191	while escaping will cause the literal string C<\$> to be matched.
192	You'll need to write something like C<m/\Quser\E\@\Qhost/>.
193
194	In addition, Perl defines the following:
195	X<metacharacter>
196	X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C>
197	X<word> X<whitespace>
198
199	\w Match a "word" character (alphanumeric plus "_")
200	\W Match a non-"word" character
201	\s Match a whitespace character
202	\S Match a non-whitespace character
203	\d Match a digit character
204	\D Match a non-digit character
205	\pP Match P, named property. Use \p{Prop} for longer names.
206	\PP Match non-P
207	\X Match eXtended Unicode "combining character sequence",
208	equivalent to (?:\PM\pM*)
209	\C Match a single C char (octet) even under Unicode.
210	NOTE: breaks up characters into their UTF-8 bytes,
211	so you may end up with malformed pieces of UTF-8.
212	Unsupported in lookbehind.
213
214	A C<\w> matches a single alphanumeric character (an alphabetic
215	character, or a decimal digit) or C<_>, not a whole word. Use C<\w+>
216	to match a string of Perl-identifier characters (which isn't the same
217	as matching an English word). If C<use locale> is in effect, the list
218	of alphabetic characters generated by C<\w> is taken from the current
219	locale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>,
220	C<\d>, and C<\D> within character classes, but if you try to use them
221	as endpoints of a range, that's not a range, the "-" is understood
222	literally. If Unicode is in effect, C<\s> matches also "\x{85}",
223	"\x{2028}, and "\x{2029}", see L<perlunicode> for more details about
224	C<\pP>, C<\PP>, and C<\X>, and L<perluniintro> about Unicode in general.
225	You can define your own C<\p> and C<\P> properties, see L<perlunicode>.
226	X<\w> X<\W> X<word>
227
228	The POSIX character class syntax
229	X<character class>
230
231	[:class:]
232
233	is also available. The available classes and their backslash
234	equivalents (if available) are as follows:
235	X<character class>
236	X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
237	X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
238
239	alpha
240	alnum
241	ascii
242	blank [1]
243	cntrl
244	digit \d
245	graph
246	lower
247	print
248	punct
249	space \s [2]
250	upper
251	word \w [3]
252	xdigit
253
254	=over
255
256	=item [1]
257
258	A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace".
259
260	=item [2]
261
262	Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
263	also the (very rare) "vertical tabulator", "\ck", chr(11).
264
265	=item [3]
266
267	A Perl extension, see above.
268
269	=back
270
271	For example use C<[:upper:]> to match all the uppercase characters.
272	Note that the C<[]> are part of the C<[::]> construct, not part of the
273	whole character class. For example:
274
275	[01[:alpha:]%]
276
277	matches zero, one, any alphabetic character, and the percentage sign.
278
279	The following equivalences to Unicode \p{} constructs and equivalent
280	backslash character classes (if available), will hold:
281	X<character class> X<\p> X<\p{}>
282
283	[:...:] \p{...} backslash
284
285	alpha IsAlpha
286	alnum IsAlnum
287	ascii IsASCII
288	blank IsSpace
289	cntrl IsCntrl
290	digit IsDigit \d
291	graph IsGraph
292	lower IsLower
293	print IsPrint
294	punct IsPunct
295	space IsSpace
296	IsSpacePerl \s
297	upper IsUpper
298	word IsWord
299	xdigit IsXDigit
300
301	For example C<[:lower:]> and C<\p{IsLower}> are equivalent.
302
303	If the C<utf8> pragma is not used but the C<locale> pragma is, the
304	classes correlate with the usual isalpha(3) interface (except for
305	"word" and "blank").
306
307	The assumedly non-obviously named classes are:
308
309	=over 4
310
311	=item cntrl
312	X<cntrl>
313
314	Any control character. Usually characters that don't produce output as
315	such but instead control the terminal somehow: for example newline and
316	backspace are control characters. All characters with ord() less than
317	32 are most often classified as control characters (assuming ASCII,
318	the ISO Latin character sets, and Unicode), as is the character with
319	the ord() value of 127 (C<DEL>).
320
321	=item graph
322	X<graph>
323
324	Any alphanumeric or punctuation (special) character.
325
326	=item print
327	X<print>
328
329	Any alphanumeric or punctuation (special) character or the space character.
330
331	=item punct
332	X<punct>
333
334	Any punctuation (special) character.
335
336	=item xdigit
337	X<xdigit>
338
339	Any hexadecimal digit. Though this may feel silly ([0-9A-Fa-f] would
340	work just fine) it is included for completeness.
341
342	=back
343
344	You can negate the [::] character classes by prefixing the class name
345	with a '^'. This is a Perl extension. For example:
346	X<character class, negation>
347
348	POSIX traditional Unicode
349
350	[:^digit:] \D \P{IsDigit}
351	[:^space:] \S \P{IsSpace}
352	[:^word:] \W \P{IsWord}
353
354	Perl respects the POSIX standard in that POSIX character classes are
355	only supported within a character class. The POSIX character classes
356	[.cc.] and [=cc=] are recognized but B<not> supported and trying to
357	use them will cause an error.
358
359	Perl defines the following zero-width assertions:
360	X<zero-width assertion> X<assertion> X<regex, zero-width assertion>
361	X<regexp, zero-width assertion>
362	X<regular expression, zero-width assertion>
363	X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
364
365	\b Match a word boundary
366	\B Match a non-(word boundary)
367	\A Match only at beginning of string
368	\Z Match only at end of string, or before newline at the end
369	\z Match only at end of string
370	\G Match only at pos() (e.g. at the end-of-match position
371	of prior m//g)
372
373	A word boundary (C<\b>) is a spot between two characters
374	that has a C<\w> on one side of it and a C<\W> on the other side
375	of it (in either order), counting the imaginary characters off the
376	beginning and end of the string as matching a C<\W>. (Within
377	character classes C<\b> represents backspace rather than a word