source: trunk/essentials/dev-lang/perl/pod/perlre.pod@ 3439

Last change on this file since 3439 was 3181, checked in by bird, 19 years ago

perl 5.8.8

File size: 52.6 KB
Line 
1=head1 NAME
2X<regular expression> X<regex> X<regexp>
3
4perlre - Perl regular expressions
5
6=head1 DESCRIPTION
7
8This page describes the syntax of regular expressions in Perl.
9
10If you haven't used regular expressions before, a quick-start
11introduction is available in L<perlrequick>, and a longer tutorial
12introduction is available in L<perlretut>.
13
14For reference on how regular expressions are used in matching
15operations, plus various examples of the same, see discussions of
16C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
17Operators">.
18
19Matching operations can have various modifiers. Modifiers
20that relate to the interpretation of the regular expression inside
21are listed below. Modifiers that alter the way a regular expression
22is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
23L<perlop/"Gory details of parsing quoted constructs">.
24
25=over 4
26
27=item i
28X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
29X<regular expression, case-insensitive>
30
31Do case-insensitive pattern matching.
32
33If C<use locale> is in effect, the case map is taken from the current
34locale. See L<perllocale>.
35
36=item m
37X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
38
39Treat string as multiple lines. That is, change "^" and "$" from matching
40the start or end of the string to matching the start or end of any
41line anywhere within the string.
42
43=item s
44X</s> X<regex, single-line> X<regexp, single-line>
45X<regular expression, single-line>
46
47Treat string as single line. That is, change "." to match any character
48whatsoever, even a newline, which normally it would not match.
49
50The C</s> and C</m> modifiers both override the C<$*> setting. That
51is, no matter what C<$*> contains, C</s> without C</m> will force
52"^" to match only at the beginning of the string and "$" to match
53only at the end (or just before a newline at the end) of the string.
54Together, as /ms, they let the "." match any character whatsoever,
55while still allowing "^" and "$" to match, respectively, just after
56and just before newlines within the string.
57
58=item x
59X</x>
60
61Extend your pattern's legibility by permitting whitespace and comments.
62
63=back
64
65These are usually written as "the C</x> modifier", even though the delimiter
66in question might not really be a slash. Any of these
67modifiers may also be embedded within the regular expression itself using
68the C<(?...)> construct. See below.
69
70The C</x> modifier itself needs a little more explanation. It tells
71the regular expression parser to ignore whitespace that is neither
72backslashed nor within a character class. You can use this to break up
73your regular expression into (slightly) more readable parts. The C<#>
74character is also treated as a metacharacter introducing a comment,
75just as in ordinary Perl code. This also means that if you want real
76whitespace or C<#> characters in the pattern (outside a character
77class, where they are unaffected by C</x>), that you'll either have to
78escape them or encode them using octal or hex escapes. Taken together,
79these features go a long way towards making Perl's regular expressions
80more readable. Note that you have to be careful not to include the
81pattern delimiter in the comment--perl has no way of knowing you did
82not intend to close the pattern early. See the C-comment deletion code
83in L<perlop>.
84X</x>
85
86=head2 Regular Expressions
87
88The patterns used in Perl pattern matching derive from supplied in
89the Version 8 regex routines. (The routines are derived
90(distantly) from Henry Spencer's freely redistributable reimplementation
91of the V8 routines.) See L<Version 8 Regular Expressions> for
92details.
93
94In particular the following metacharacters have their standard I<egrep>-ish
95meanings:
96X<metacharacter>
97X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]>
98
99
100 \ Quote the next metacharacter
101 ^ Match the beginning of the line
102 . Match any character (except newline)
103 $ Match the end of the line (or before newline at the end)
104 | Alternation
105 () Grouping
106 [] Character class
107
108By default, the "^" character is guaranteed to match only the
109beginning of the string, the "$" character only the end (or before the
110newline at the end), and Perl does certain optimizations with the
111assumption that the string contains only one line. Embedded newlines
112will not be matched by "^" or "$". You may, however, wish to treat a
113string as a multi-line buffer, such that the "^" will match after any
114newline within the string, and "$" will match before any newline. At the
115cost of a little more overhead, you can do this by using the /m modifier
116on the pattern match operator. (Older programs did this by setting C<$*>,
117but this practice is now deprecated.)
118X<^> X<$> X</m>
119
120To simplify multi-line substitutions, the "." character never matches a
121newline unless you use the C</s> modifier, which in effect tells Perl to pretend
122the string is a single line--even if it isn't. The C</s> modifier also
123overrides the setting of C<$*>, in case you have some (badly behaved) older
124code that sets it in another module.
125X<.> X</s>
126
127The following standard quantifiers are recognized:
128X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
129
130 * Match 0 or more times
131 + Match 1 or more times
132 ? Match 1 or 0 times
133 {n} Match exactly n times
134 {n,} Match at least n times
135 {n,m} Match at least n but not more than m times
136
137(If a curly bracket occurs in any other context, it is treated
138as a regular character. In particular, the lower bound
139is not optional.) The "*" modifier is equivalent to C<{0,}>, the "+"
140modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
141to integral values less than a preset limit defined when perl is built.
142This is usually 32766 on the most common platforms. The actual limit can
143be seen in the error message generated by code such as this:
144
145 $_ **= $_ , / {$_} / for 2 .. 42;
146
147By default, a quantified subpattern is "greedy", that is, it will match as
148many times as possible (given a particular starting location) while still
149allowing the rest of the pattern to match. If you want it to match the
150minimum number of times possible, follow the quantifier with a "?". Note
151that the meanings don't change, just the "greediness":
152X<metacharacter> X<greedy> X<greedyness>
153X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
154
155 *? Match 0 or more times
156 +? Match 1 or more times
157 ?? Match 0 or 1 time
158 {n}? Match exactly n times
159 {n,}? Match at least n times
160 {n,m}? Match at least n but not more than m times
161
162Because patterns are processed as double quoted strings, the following
163also work:
164X<\t> X<\n> X<\r> X<\f> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
165X<\0> X<\c> X<\N> X<\x>
166
167 \t tab (HT, TAB)
168 \n newline (LF, NL)
169 \r return (CR)
170 \f form feed (FF)
171 \a alarm (bell) (BEL)
172 \e escape (think troff) (ESC)
173 \033 octal char (think of a PDP-11)
174 \x1B hex char
175 \x{263a} wide hex char (Unicode SMILEY)
176 \c[ control char
177 \N{name} named char
178 \l lowercase next char (think vi)
179 \u uppercase next char (think vi)
180 \L lowercase till \E (think vi)
181 \U uppercase till \E (think vi)
182 \E end case modification (think vi)
183 \Q quote (disable) pattern metacharacters till \E
184
185If C<use locale> is in effect, the case map used by C<\l>, C<\L>, C<\u>
186and C<\U> is taken from the current locale. See L<perllocale>. For
187documentation of C<\N{name}>, see L<charnames>.
188
189You cannot include a literal C<$> or C<@> within a C<\Q> sequence.
190An unescaped C<$> or C<@> interpolates the corresponding variable,
191while escaping will cause the literal string C<\$> to be matched.
192You'll need to write something like C<m/\Quser\E\@\Qhost/>.
193
194In addition, Perl defines the following:
195X<metacharacter>
196X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C>
197X<word> X<whitespace>
198
199 \w Match a "word" character (alphanumeric plus "_")
200 \W Match a non-"word" character
201 \s Match a whitespace character
202 \S Match a non-whitespace character
203 \d Match a digit character
204 \D Match a non-digit character
205 \pP Match P, named property. Use \p{Prop} for longer names.
206 \PP Match non-P
207 \X Match eXtended Unicode "combining character sequence",
208 equivalent to (?:\PM\pM*)
209 \C Match a single C char (octet) even under Unicode.
210 NOTE: breaks up characters into their UTF-8 bytes,
211 so you may end up with malformed pieces of UTF-8.
212 Unsupported in lookbehind.
213
214A C<\w> matches a single alphanumeric character (an alphabetic
215character, or a decimal digit) or C<_>, not a whole word. Use C<\w+>
216to match a string of Perl-identifier characters (which isn't the same
217as matching an English word). If C<use locale> is in effect, the list
218of alphabetic characters generated by C<\w> is taken from the current
219locale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>,
220C<\d>, and C<\D> within character classes, but if you try to use them
221as endpoints of a range, that's not a range, the "-" is understood
222literally. If Unicode is in effect, C<\s> matches also "\x{85}",
223"\x{2028}, and "\x{2029}", see L<perlunicode> for more details about
224C<\pP>, C<\PP>, and C<\X>, and L<perluniintro> about Unicode in general.
225You can define your own C<\p> and C<\P> properties, see L<perlunicode>.
226X<\w> X<\W> X<word>
227
228The POSIX character class syntax
229X<character class>
230
231 [:class:]
232
233is also available. The available classes and their backslash
234equivalents (if available) are as follows:
235X<character class>
236X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
237X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
238
239 alpha
240 alnum
241 ascii
242 blank [1]
243 cntrl
244 digit \d
245 graph
246 lower
247 print
248 punct
249 space \s [2]
250 upper
251 word \w [3]
252 xdigit
253
254=over
255
256=item [1]
257
258A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace".
259
260=item [2]
261
262Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
263also the (very rare) "vertical tabulator", "\ck", chr(11).
264
265=item [3]
266
267A Perl extension, see above.
268
269=back
270
271For example use C<[:upper:]> to match all the uppercase characters.
272Note that the C<[]> are part of the C<[::]> construct, not part of the
273whole character class. For example:
274
275 [01[:alpha:]%]
276
277matches zero, one, any alphabetic character, and the percentage sign.
278
279The following equivalences to Unicode \p{} constructs and equivalent
280backslash character classes (if available), will hold:
281X<character class> X<\p> X<\p{}>
282
283 [:...:] \p{...} backslash
284
285 alpha IsAlpha
286 alnum IsAlnum
287 ascii IsASCII
288 blank IsSpace
289 cntrl IsCntrl
290 digit IsDigit \d
291 graph IsGraph
292 lower IsLower
293 print IsPrint
294 punct IsPunct
295 space IsSpace
296 IsSpacePerl \s
297 upper IsUpper
298 word IsWord
299 xdigit IsXDigit
300
301For example C<[:lower:]> and C<\p{IsLower}> are equivalent.
302
303If the C<utf8> pragma is not used but the C<locale> pragma is, the
304classes correlate with the usual isalpha(3) interface (except for
305"word" and "blank").
306
307The assumedly non-obviously named classes are:
308
309=over 4
310
311=item cntrl
312X<cntrl>
313
314Any control character. Usually characters that don't produce output as
315such but instead control the terminal somehow: for example newline and
316backspace are control characters. All characters with ord() less than
31732 are most often classified as control characters (assuming ASCII,
318the ISO Latin character sets, and Unicode), as is the character with
319the ord() value of 127 (C<DEL>).
320
321=item graph
322X<graph>
323
324Any alphanumeric or punctuation (special) character.
325
326=item print
327X<print>
328
329Any alphanumeric or punctuation (special) character or the space character.
330
331=item punct
332X<punct>
333
334Any punctuation (special) character.
335
336=item xdigit
337X<xdigit>
338
339Any hexadecimal digit. Though this may feel silly ([0-9A-Fa-f] would
340work just fine) it is included for completeness.
341
342=back
343
344You can negate the [::] character classes by prefixing the class name
345with a '^'. This is a Perl extension. For example:
346X<character class, negation>
347
348 POSIX traditional Unicode
349
350 [:^digit:] \D \P{IsDigit}
351 [:^space:] \S \P{IsSpace}
352 [:^word:] \W \P{IsWord}
353
354Perl respects the POSIX standard in that POSIX character classes are
355only supported within a character class. The POSIX character classes
356[.cc.] and [=cc=] are recognized but B<not> supported and trying to
357use them will cause an error.
358
359Perl defines the following zero-width assertions:
360X<zero-width assertion> X<assertion> X<regex, zero-width assertion>
361X<regexp, zero-width assertion>
362X<regular expression, zero-width assertion>
363X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
364
365 \b Match a word boundary
366 \B Match a non-(word boundary)
367 \A Match only at beginning of string
368 \Z Match only at end of string, or before newline at the end
369 \z Match only at end of string
370 \G Match only at pos() (e.g. at the end-of-match position
371 of prior m//g)
372
373A word boundary (C<\b>) is a spot between two characters
374that has a C<\w> on one side of it and a C<\W> on the other side
375of it (in either order), counting the imaginary characters off the
376beginning and end of the string as matching a C<\W>. (Within
377character classes C<\b> represents backspace rather than a word