source: vendor/perl/5.8.8/pod/perlre.pod@ 3298

Last change on this file since 3298 was 3181, checked in by bird, 19 years ago

perl 5.8.8

File size: 52.6 KB
RevLine 
[3181]1=head1 NAME
2X<regular expression> X<regex> X<regexp>
3
4perlre - Perl regular expressions
5
6=head1 DESCRIPTION
7
8This page describes the syntax of regular expressions in Perl.
9
10If you haven't used regular expressions before, a quick-start
11introduction is available in L<perlrequick>, and a longer tutorial
12introduction is available in L<perlretut>.
13
14For reference on how regular expressions are used in matching
15operations, plus various examples of the same, see discussions of
16C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
17Operators">.
18
19Matching operations can have various modifiers. Modifiers
20that relate to the interpretation of the regular expression inside
21are listed below. Modifiers that alter the way a regular expression
22is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
23L<perlop/"Gory details of parsing quoted constructs">.
24
25=over 4
26
27=item i
28X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
29X<regular expression, case-insensitive>
30
31Do case-insensitive pattern matching.
32
33If C<use locale> is in effect, the case map is taken from the current
34locale. See L<perllocale>.
35
36=item m
37X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
38
39Treat string as multiple lines. That is, change "^" and "$" from matching
40the start or end of the string to matching the start or end of any
41line anywhere within the string.
42
43=item s
44X</s> X<regex, single-line> X<regexp, single-line>
45X<regular expression, single-line>
46
47Treat string as single line. That is, change "." to match any character
48whatsoever, even a newline, which normally it would not match.
49
50The C</s> and C</m> modifiers both override the C<$*> setting. That
51is, no matter what C<$*> contains, C</s> without C</m> will force
52"^" to match only at the beginning of the string and "$" to match
53only at the end (or just before a newline at the end) of the string.
54Together, as /ms, they let the "." match any character whatsoever,
55while still allowing "^" and "$" to match, respectively, just after
56and just before newlines within the string.
57
58=item x
59X</x>
60
61Extend your pattern's legibility by permitting whitespace and comments.
62
63=back
64
65These are usually written as "the C</x> modifier", even though the delimiter
66in question might not really be a slash. Any of these
67modifiers may also be embedded within the regular expression itself using
68the C<(?...)> construct. See below.
69
70The C</x> modifier itself needs a little more explanation. It tells
71the regular expression parser to ignore whitespace that is neither
72backslashed nor within a character class. You can use this to break up
73your regular expression into (slightly) more readable parts. The C<#>
74character is also treated as a metacharacter introducing a comment,
75just as in ordinary Perl code. This also means that if you want real
76whitespace or C<#> characters in the pattern (outside a character
77class, where they are unaffected by C</x>), that you'll either have to
78escape them or encode them using octal or hex escapes. Taken together,
79these features go a long way towards making Perl's regular expressions
80more readable. Note that you have to be careful not to include the
81pattern delimiter in the comment--perl has no way of knowing you did
82not intend to close the pattern early. See the C-comment deletion code
83in L<perlop>.
84X</x>
85
86=head2 Regular Expressions
87
88The patterns used in Perl pattern matching derive from supplied in
89the Version 8 regex routines. (The routines are derived
90(distantly) from Henry Spencer's freely redistributable reimplementation
91of the V8 routines.) See L<Version 8 Regular Expressions> for
92details.
93
94In particular the following metacharacters have their standard I<egrep>-ish
95meanings:
96X<metacharacter>
97X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]>
98
99
100 \ Quote the next metacharacter
101 ^ Match the beginning of the line
102 . Match any character (except newline)
103 $ Match the end of the line (or before newline at the end)
104 | Alternation
105 () Grouping
106 [] Character class
107
108By default, the "^" character is guaranteed to match only the
109beginning of the string, the "$" character only the end (or before the
110newline at the end), and Perl does certain optimizations with the
111assumption that the string contains only one line. Embedded newlines
112will not be matched by "^" or "$". You may, however, wish to treat a
113string as a multi-line buffer, such that the "^" will match after any
114newline within the string, and "$" will match before any newline. At the
115cost of a little more overhead, you can do this by using the /m modifier
116on the pattern match operator. (Older programs did this by setting C<$*>,
117but this practice is now deprecated.)
118X<^> X<$> X</m>
119
120To simplify multi-line substitutions, the "." character never matches a
121newline unless you use the C</s> modifier, which in effect tells Perl to pretend
122the string is a single line--even if it isn't. The C</s> modifier also
123overrides the setting of C<$*>, in case you have some (badly behaved) older
124code that sets it in another module.
125X<.> X</s>
126
127The following standard quantifiers are recognized:
128X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
129
130 * Match 0 or more times
131 + Match 1 or more times
132 ? Match 1 or 0 times
133 {n} Match exactly n times
134 {n,} Match at least n times
135 {n,m} Match at least n but not more than m times
136
137(If a curly bracket occurs in any other context, it is treated
138as a regular character. In particular, the lower bound
139is not optional.) The "*" modifier is equivalent to C<{0,}>, the "+"
140modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
141to integral values less than a preset limit defined when perl is built.
142This is usually 32766 on the most common platforms. The actual limit can
143be seen in the error message generated by code such as this:
144
145 $_ **= $_ , / {$_} / for 2 .. 42;
146
147By default, a quantified subpattern is "greedy", that is, it will match as
148many times as possible (given a particular starting location) while still
149allowing the rest of the pattern to match. If you want it to match the
150minimum number of times possible, follow the quantifier with a "?". Note
151that the meanings don't change, just the "greediness":
152X<metacharacter> X<greedy> X<greedyness>
153X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
154
155 *? Match 0 or more times
156 +? Match 1 or more times
157 ?? Match 0 or 1 time
158 {n}? Match exactly n times
159 {n,}? Match at least n times
160 {n,m}? Match at least n but not more than m times
161
162Because patterns are processed as double quoted strings, the following
163also work:
164X<\t> X<\n> X<\r> X<\f> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
165X<\0> X<\c> X<\N> X<\x>
166
167 \t tab (HT, TAB)
168 \n newline (LF, NL)
169 \r return (CR)
170 \f form feed (FF)
171 \a alarm (bell) (BEL)
172 \e escape (think troff) (ESC)
173 \033 octal char (think of a PDP-11)
174 \x1B hex char
175 \x{263a} wide hex char (Unicode SMILEY)
176 \c[ control char
177 \N{name} named char
178 \l lowercase next char (think vi)
179 \u uppercase next char (think vi)
180 \L lowercase till \E (think vi)
181 \U uppercase till \E (think vi)
182 \E end case modification (think vi)
183 \Q quote (disable) pattern metacharacters till \E
184
185If C<use locale> is in effect, the case map used by C<\l>, C<\L>, C<\u>
186and C<\U> is taken from the current locale. See L<perllocale>. For
187documentation of C<\N{name}>, see L<charnames>.
188
189You cannot include a literal C<$> or C<@> within a C<\Q> sequence.
190An unescaped C<$> or C<@> interpolates the corresponding variable,
191while escaping will cause the literal string C<\$> to be matched.
192You'll need to write something like C<m/\Quser\E\@\Qhost/>.
193
194In addition, Perl defines the following:
195X<metacharacter>
196X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C>
197X<word> X<whitespace>
198
199 \w Match a "word" character (alphanumeric plus "_")
200 \W Match a non-"word" character
201 \s Match a whitespace character
202 \S Match a non-whitespace character
203 \d Match a digit character
204 \D Match a non-digit character
205 \pP Match P, named property. Use \p{Prop} for longer names.
206 \PP Match non-P
207 \X Match eXtended Unicode "combining character sequence",
208 equivalent to (?:\PM\pM*)
209 \C Match a single C char (octet) even under Unicode.
210 NOTE: breaks up characters into their UTF-8 bytes,
211 so you may end up with malformed pieces of UTF-8.
212 Unsupported in lookbehind.
213
214A C<\w> matches a single alphanumeric character (an alphabetic
215character, or a decimal digit) or C<_>, not a whole word. Use C<\w+>
216to match a string of Perl-identifier characters (which isn't the same
217as matching an English word). If C<use locale> is in effect, the list
218of alphabetic characters generated by C<\w> is taken from the current
219locale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>,
220C<\d>, and C<\D> within character classes, but if you try to use them
221as endpoints of a range, that's not a range, the "-" is understood
222literally. If Unicode is in effect, C<\s> matches also "\x{85}",
223"\x{2028}, and "\x{2029}", see L<perlunicode> for more details about
224C<\pP>, C<\PP>, and C<\X>, and L<perluniintro> about Unicode in general.
225You can define your own C<\p> and C<\P> properties, see L<perlunicode>.
226X<\w> X<\W> X<word>
227
228The POSIX character class syntax
229X<character class>
230
231 [:class:]
232
233is also available. The available classes and their backslash
234equivalents (if available) are as follows:
235X<character class>
236X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
237X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
238
239 alpha
240 alnum
241 ascii
242 blank [1]
243 cntrl
244 digit \d
245 graph
246 lower
247 print
248 punct
249 space \s [2]
250 upper
251 word \w [3]
252 xdigit
253
254=over
255
256=item [1]
257
258A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace".
259
260=item [2]
261
262Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
263also the (very rare) "vertical tabulator", "\ck", chr(11).
264
265=item [3]
266
267A Perl extension, see above.
268
269=back
270
271For example use C<[:upper:]> to match all the uppercase characters.
272Note that the C<[]> are part of the C<[::]> construct, not part of the
273whole character class. For example:
274
275 [01[:alpha:]%]
276
277matches zero, one, any alphabetic character, and the percentage sign.
278
279The following equivalences to Unicode \p{} constructs and equivalent
280backslash character classes (if available), will hold:
281X<character class> X<\p> X<\p{}>
282
283 [:...:] \p{...} backslash
284
285 alpha IsAlpha
286 alnum IsAlnum
287 ascii IsASCII
288 blank IsSpace
289 cntrl IsCntrl
290 digit IsDigit \d
291 graph IsGraph
292 lower IsLower
293 print IsPrint
294 punct IsPunct
295 space IsSpace
296 IsSpacePerl \s
297 upper IsUpper
298 word IsWord
299 xdigit IsXDigit
300
301For example C<[:lower:]> and C<\p{IsLower}> are equivalent.
302
303If the C<utf8> pragma is not used but the C<locale> pragma is, the
304classes correlate with the usual isalpha(3) interface (except for
305"word" and "blank").
306
307The assumedly non-obviously named classes are:
308
309=over 4
310
311=item cntrl
312X<cntrl>
313
314Any control character. Usually characters that don't produce output as
315such but instead control the terminal somehow: for example newline and
316backspace are control characters. All characters with ord() less than
31732 are most often classified as control characters (assuming ASCII,
318the ISO Latin character sets, and Unicode), as is the character with
319the ord() value of 127 (C<DEL>).
320
321=item graph
322X<graph>
323
324Any alphanumeric or punctuation (special) character.
325
326=item print
327X<print>
328
329Any alphanumeric or punctuation (special) character or the space character.
330
331=item punct
332X<punct>
333
334Any punctuation (special) character.
335
336=item xdigit
337X<xdigit>
338
339Any hexadecimal digit. Though this may feel silly ([0-9A-Fa-f] would
340work just fine) it is included for completeness.
341
342=back
343
344You can negate the [::] character classes by prefixing the class name
345with a '^'. This is a Perl extension. For example:
346X<character class, negation>
347
348 POSIX traditional Unicode
349
350 [:^digit:] \D \P{IsDigit}
351 [:^space:] \S \P{IsSpace}
352 [:^word:] \W \P{IsWord}
353
354Perl respects the POSIX standard in that POSIX character classes are
355only supported within a character class. The POSIX character classes
356[.cc.] and [=cc=] are recognized but B<not> supported and trying to
357use them will cause an error.
358
359Perl defines the following zero-width assertions:
360X<zero-width assertion> X<assertion> X<regex, zero-width assertion>
361X<regexp, zero-width assertion>
362X<regular expression, zero-width assertion>
363X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
364
365 \b Match a word boundary
366 \B Match a non-(word boundary)
367 \A Match only at beginning of string
368 \Z Match only at end of string, or before newline at the end
369 \z Match only at end of string
370 \G Match only at pos() (e.g. at the end-of-match position
371 of prior m//g)
372
373A word boundary (C<\b>) is a spot between two characters
374that has a C<\w> on one side of it and a C<\W> on the other side
375of it (in either order), counting the imaginary characters off the
376beginning and end of the string as matching a C<\W>. (Within
377character classes C<\b> represents backspace rather than a word
378boundary, just as it normally does in any double-quoted string.)
379The C<\A> and C<\Z> are just like "^" and "$", except that they
380won't match multiple times when the C</m> modifier is used, while
381"^" and "$" will match at every internal line boundary. To match
382the actual end of the string and not ignore an optional trailing
383newline, use C<\z>.
384X<\b> X<\A> X<\Z> X<\z> X</m>
385
386The C<\G> assertion can be used to chain global matches (using
387C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
388It is also useful when writing C<lex>-like scanners, when you have
389several patterns that you want to match against consequent substrings
390of your string, see the previous reference. The actual location
391where C<\G> will match can also be influenced by using C<pos()> as
392an lvalue: see L<perlfunc/pos>. Currently C<\G> is only fully
393supported when anchored to the start of the pattern; while it
394is permitted to use it elsewhere, as in C</(?<=\G..)./g>, some
395such uses (C</.\G/g>, for example) currently cause problems, and
396it is recommended that you avoid such usage for now.
397X<\G>
398
399The bracketing construct C<( ... )> creates capture buffers. To
400refer to the digit'th buffer use \<digit> within the
401match. Outside the match use "$" instead of "\". (The
402\<digit> notation works in certain circumstances outside
403the match. See the warning below about \1 vs $1 for details.)
404Referring back to another part of the match is called a
405I<backreference>.
406X<regex, capture buffer> X<regexp, capture buffer>
407X<regular expression, capture buffer> X<backreference>
408
409There is no limit to the number of captured substrings that you may
410use. However Perl also uses \10, \11, etc. as aliases for \010,
411\011, etc. (Recall that 0 means octal, so \011 is the character at
412number 9 in your coded character set; which would be the 10th character,
413a horizontal tab under ASCII.) Perl resolves this
414ambiguity by interpreting \10 as a backreference only if at least 10
415left parentheses have opened before it. Likewise \11 is a
416backreference only if at least 11 left parentheses have opened
417before it. And so on. \1 through \9 are always interpreted as
418backreferences.
419
420Examples:
421
422 s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
423
424 if (/(.)\1/) { # find first doubled char
425 print "'$1' is the first doubled character\n";
426 }
427
428 if (/Time: (..):(..):(..)/) { # parse out values
429 $hours = $1;
430 $minutes = $2;
431 $seconds = $3;
432 }
433
434Several special variables also refer back to portions of the previous
435match. C<$+> returns whatever the last bracket match matched.
436C<$&> returns the entire matched string. (At one point C<$0> did
437also, but now it returns the name of the program.) C<$`> returns
438everything before the matched string. C<$'> returns everything
439after the matched string. And C<$^N> contains whatever was matched by
440the most-recently closed group (submatch). C<$^N> can be used in
441extended patterns (see below), for example to assign a submatch to a
442variable.
443X<$+> X<$^N> X<$&> X<$`> X<$'>
444
445The numbered match variables ($1, $2, $3, etc.) and the related punctuation
446set (C<$+>, C<$&>, C<$`>, C<$'>, and C<$^N>) are all dynamically scoped
447until the end of the enclosing block or until the next successful
448match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
449X<$+> X<$^N> X<$&> X<$`> X<$'>
450X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9>
451
452
453B<NOTE>: failed matches in Perl do not reset the match variables,
454which makes it easier to write code that tests for a series of more
455specific cases and remembers the best match.
456
457B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or
458C<$'> anywhere in the program, it has to provide them for every
459pattern match. This may substantially slow your program. Perl
460uses the same mechanism to produce $1, $2, etc, so you also pay a
461price for each pattern that contains capturing parentheses. (To
462avoid this cost while retaining the grouping behaviour, use the
463extended regular expression C<(?: ... )> instead.) But if you never
464use C<$&>, C<$`> or C<$'>, then patterns I<without> capturing
465parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`>
466if you can, but if you can't (and some algorithms really appreciate
467them), once you've used them once, use them at will, because you've
468already paid the price. As of 5.005, C<$&> is not so costly as the
469other two.
470X<$&> X<$`> X<$'>
471
472Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
473C<\w>, C<\n>. Unlike some other regular expression languages, there
474are no backslashed symbols that aren't alphanumeric. So anything
475that looks like \\, \(, \), \<, \>, \{, or \} is always
476interpreted as a literal character, not a metacharacter. This was
477once used in a common idiom to disable or quote the special meanings
478of regular expression metacharacters in a string that you want to
479use for a pattern. Simply quote all non-"word" characters:
480
481 $pattern =~ s/(\W)/\\$1/g;
482
483(If C<use locale> is set, then this depends on the current locale.)
484Today it is more common to use the quotemeta() function or the C<\Q>
485metaquoting escape sequence to disable all metacharacters' special
486meanings like this:
487
488 /$unquoted\Q$quoted\E$unquoted/
489
490Beware that if you put literal backslashes (those not inside
491interpolated variables) between C<\Q> and C<\E>, double-quotish
492backslash interpolation may lead to confusing results. If you
493I<need> to use literal backslashes within C<\Q...\E>,
494consult L<perlop/"Gory details of parsing quoted constructs">.
495
496=head2 Extended Patterns
497
498Perl also defines a consistent extension syntax for features not
499found in standard tools like B<awk> and B<lex>. The syntax is a