| 1 | =head1 NAME
|
|---|
| 2 | X<regular expression> X<regex> X<regexp>
|
|---|
| 3 |
|
|---|
| 4 | perlre - Perl regular expressions
|
|---|
| 5 |
|
|---|
| 6 | =head1 DESCRIPTION
|
|---|
| 7 |
|
|---|
| 8 | This page describes the syntax of regular expressions in Perl.
|
|---|
| 9 |
|
|---|
| 10 | If you haven't used regular expressions before, a quick-start
|
|---|
| 11 | introduction is available in L<perlrequick>, and a longer tutorial
|
|---|
| 12 | introduction is available in L<perlretut>.
|
|---|
| 13 |
|
|---|
| 14 | For reference on how regular expressions are used in matching
|
|---|
| 15 | operations, plus various examples of the same, see discussions of
|
|---|
| 16 | C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
|
|---|
| 17 | Operators">.
|
|---|
| 18 |
|
|---|
| 19 | Matching operations can have various modifiers. Modifiers
|
|---|
| 20 | that relate to the interpretation of the regular expression inside
|
|---|
| 21 | are listed below. Modifiers that alter the way a regular expression
|
|---|
| 22 | is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
|
|---|
| 23 | L<perlop/"Gory details of parsing quoted constructs">.
|
|---|
| 24 |
|
|---|
| 25 | =over 4
|
|---|
| 26 |
|
|---|
| 27 | =item i
|
|---|
| 28 | X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
|
|---|
| 29 | X<regular expression, case-insensitive>
|
|---|
| 30 |
|
|---|
| 31 | Do case-insensitive pattern matching.
|
|---|
| 32 |
|
|---|
| 33 | If C<use locale> is in effect, the case map is taken from the current
|
|---|
| 34 | locale. See L<perllocale>.
|
|---|
| 35 |
|
|---|
| 36 | =item m
|
|---|
| 37 | X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
|
|---|
| 38 |
|
|---|
| 39 | Treat string as multiple lines. That is, change "^" and "$" from matching
|
|---|
| 40 | the start or end of the string to matching the start or end of any
|
|---|
| 41 | line anywhere within the string.
|
|---|
| 42 |
|
|---|
| 43 | =item s
|
|---|
| 44 | X</s> X<regex, single-line> X<regexp, single-line>
|
|---|
| 45 | X<regular expression, single-line>
|
|---|
| 46 |
|
|---|
| 47 | Treat string as single line. That is, change "." to match any character
|
|---|
| 48 | whatsoever, even a newline, which normally it would not match.
|
|---|
| 49 |
|
|---|
| 50 | The C</s> and C</m> modifiers both override the C<$*> setting. That
|
|---|
| 51 | is, no matter what C<$*> contains, C</s> without C</m> will force
|
|---|
| 52 | "^" to match only at the beginning of the string and "$" to match
|
|---|
| 53 | only at the end (or just before a newline at the end) of the string.
|
|---|
| 54 | Together, as /ms, they let the "." match any character whatsoever,
|
|---|
| 55 | while still allowing "^" and "$" to match, respectively, just after
|
|---|
| 56 | and just before newlines within the string.
|
|---|
| 57 |
|
|---|
| 58 | =item x
|
|---|
| 59 | X</x>
|
|---|
| 60 |
|
|---|
| 61 | Extend your pattern's legibility by permitting whitespace and comments.
|
|---|
| 62 |
|
|---|
| 63 | =back
|
|---|
| 64 |
|
|---|
| 65 | These are usually written as "the C</x> modifier", even though the delimiter
|
|---|
| 66 | in question might not really be a slash. Any of these
|
|---|
| 67 | modifiers may also be embedded within the regular expression itself using
|
|---|
| 68 | the C<(?...)> construct. See below.
|
|---|
| 69 |
|
|---|
| 70 | The C</x> modifier itself needs a little more explanation. It tells
|
|---|
| 71 | the regular expression parser to ignore whitespace that is neither
|
|---|
| 72 | backslashed nor within a character class. You can use this to break up
|
|---|
| 73 | your regular expression into (slightly) more readable parts. The C<#>
|
|---|
| 74 | character is also treated as a metacharacter introducing a comment,
|
|---|
| 75 | just as in ordinary Perl code. This also means that if you want real
|
|---|
| 76 | whitespace or C<#> characters in the pattern (outside a character
|
|---|
| 77 | class, where they are unaffected by C</x>), that you'll either have to
|
|---|
| 78 | escape them or encode them using octal or hex escapes. Taken together,
|
|---|
| 79 | these features go a long way towards making Perl's regular expressions
|
|---|
| 80 | more readable. Note that you have to be careful not to include the
|
|---|
| 81 | pattern delimiter in the comment--perl has no way of knowing you did
|
|---|
| 82 | not intend to close the pattern early. See the C-comment deletion code
|
|---|
| 83 | in L<perlop>.
|
|---|
| 84 | X</x>
|
|---|
| 85 |
|
|---|
| 86 | =head2 Regular Expressions
|
|---|
| 87 |
|
|---|
| 88 | The patterns used in Perl pattern matching derive from supplied in
|
|---|
| 89 | the Version 8 regex routines. (The routines are derived
|
|---|
| 90 | (distantly) from Henry Spencer's freely redistributable reimplementation
|
|---|
| 91 | of the V8 routines.) See L<Version 8 Regular Expressions> for
|
|---|
| 92 | details.
|
|---|
| 93 |
|
|---|
| 94 | In particular the following metacharacters have their standard I<egrep>-ish
|
|---|
| 95 | meanings:
|
|---|
| 96 | X<metacharacter>
|
|---|
| 97 | X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]>
|
|---|
| 98 |
|
|---|
| 99 |
|
|---|
| 100 | \ Quote the next metacharacter
|
|---|
| 101 | ^ Match the beginning of the line
|
|---|
| 102 | . Match any character (except newline)
|
|---|
| 103 | $ Match the end of the line (or before newline at the end)
|
|---|
| 104 | | Alternation
|
|---|
| 105 | () Grouping
|
|---|
| 106 | [] Character class
|
|---|
| 107 |
|
|---|
| 108 | By default, the "^" character is guaranteed to match only the
|
|---|
| 109 | beginning of the string, the "$" character only the end (or before the
|
|---|
| 110 | newline at the end), and Perl does certain optimizations with the
|
|---|
| 111 | assumption that the string contains only one line. Embedded newlines
|
|---|
| 112 | will not be matched by "^" or "$". You may, however, wish to treat a
|
|---|
| 113 | string as a multi-line buffer, such that the "^" will match after any
|
|---|
| 114 | newline within the string, and "$" will match before any newline. At the
|
|---|
| 115 | cost of a little more overhead, you can do this by using the /m modifier
|
|---|
| 116 | on the pattern match operator. (Older programs did this by setting C<$*>,
|
|---|
| 117 | but this practice is now deprecated.)
|
|---|
| 118 | X<^> X<$> X</m>
|
|---|
| 119 |
|
|---|
| 120 | To simplify multi-line substitutions, the "." character never matches a
|
|---|
| 121 | newline unless you use the C</s> modifier, which in effect tells Perl to pretend
|
|---|
| 122 | the string is a single line--even if it isn't. The C</s> modifier also
|
|---|
| 123 | overrides the setting of C<$*>, in case you have some (badly behaved) older
|
|---|
| 124 | code that sets it in another module.
|
|---|
| 125 | X<.> X</s>
|
|---|
| 126 |
|
|---|
| 127 | The following standard quantifiers are recognized:
|
|---|
| 128 | X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
|
|---|
| 129 |
|
|---|
| 130 | * Match 0 or more times
|
|---|
| 131 | + Match 1 or more times
|
|---|
| 132 | ? Match 1 or 0 times
|
|---|
| 133 | {n} Match exactly n times
|
|---|
| 134 | {n,} Match at least n times
|
|---|
| 135 | {n,m} Match at least n but not more than m times
|
|---|
| 136 |
|
|---|
| 137 | (If a curly bracket occurs in any other context, it is treated
|
|---|
| 138 | as a regular character. In particular, the lower bound
|
|---|
| 139 | is not optional.) The "*" modifier is equivalent to C<{0,}>, the "+"
|
|---|
| 140 | modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
|
|---|
| 141 | to integral values less than a preset limit defined when perl is built.
|
|---|
| 142 | This is usually 32766 on the most common platforms. The actual limit can
|
|---|
| 143 | be seen in the error message generated by code such as this:
|
|---|
| 144 |
|
|---|
| 145 | $_ **= $_ , / {$_} / for 2 .. 42;
|
|---|
| 146 |
|
|---|
| 147 | By default, a quantified subpattern is "greedy", that is, it will match as
|
|---|
| 148 | many times as possible (given a particular starting location) while still
|
|---|
| 149 | allowing the rest of the pattern to match. If you want it to match the
|
|---|
| 150 | minimum number of times possible, follow the quantifier with a "?". Note
|
|---|
| 151 | that the meanings don't change, just the "greediness":
|
|---|
| 152 | X<metacharacter> X<greedy> X<greedyness>
|
|---|
| 153 | X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
|
|---|
| 154 |
|
|---|
| 155 | *? Match 0 or more times
|
|---|
| 156 | +? Match 1 or more times
|
|---|
| 157 | ?? Match 0 or 1 time
|
|---|
| 158 | {n}? Match exactly n times
|
|---|
| 159 | {n,}? Match at least n times
|
|---|
| 160 | {n,m}? Match at least n but not more than m times
|
|---|
| 161 |
|
|---|
| 162 | Because patterns are processed as double quoted strings, the following
|
|---|
| 163 | also work:
|
|---|
| 164 | X<\t> X<\n> X<\r> X<\f> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
|
|---|
| 165 | X<\0> X<\c> X<\N> X<\x>
|
|---|
| 166 |
|
|---|
| 167 | \t tab (HT, TAB)
|
|---|
| 168 | \n newline (LF, NL)
|
|---|
| 169 | \r return (CR)
|
|---|
| 170 | \f form feed (FF)
|
|---|
| 171 | \a alarm (bell) (BEL)
|
|---|
| 172 | \e escape (think troff) (ESC)
|
|---|
| 173 | \033 octal char (think of a PDP-11)
|
|---|
| 174 | \x1B hex char
|
|---|
| 175 | \x{263a} wide hex char (Unicode SMILEY)
|
|---|
| 176 | \c[ control char
|
|---|
| 177 | \N{name} named char
|
|---|
| 178 | \l lowercase next char (think vi)
|
|---|
| 179 | \u uppercase next char (think vi)
|
|---|
| 180 | \L lowercase till \E (think vi)
|
|---|
| 181 | \U uppercase till \E (think vi)
|
|---|
| 182 | \E end case modification (think vi)
|
|---|
| 183 | \Q quote (disable) pattern metacharacters till \E
|
|---|
| 184 |
|
|---|
| 185 | If C<use locale> is in effect, the case map used by C<\l>, C<\L>, C<\u>
|
|---|
| 186 | and C<\U> is taken from the current locale. See L<perllocale>. For
|
|---|
| 187 | documentation of C<\N{name}>, see L<charnames>.
|
|---|
| 188 |
|
|---|
| 189 | You cannot include a literal C<$> or C<@> within a C<\Q> sequence.
|
|---|
| 190 | An unescaped C<$> or C<@> interpolates the corresponding variable,
|
|---|
| 191 | while escaping will cause the literal string C<\$> to be matched.
|
|---|
| 192 | You'll need to write something like C<m/\Quser\E\@\Qhost/>.
|
|---|
| 193 |
|
|---|
| 194 | In addition, Perl defines the following:
|
|---|
| 195 | X<metacharacter>
|
|---|
| 196 | X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C>
|
|---|
| 197 | X<word> X<whitespace>
|
|---|
| 198 |
|
|---|
| 199 | \w Match a "word" character (alphanumeric plus "_")
|
|---|
| 200 | \W Match a non-"word" character
|
|---|
| 201 | \s Match a whitespace character
|
|---|
| 202 | \S Match a non-whitespace character
|
|---|
| 203 | \d Match a digit character
|
|---|
| 204 | \D Match a non-digit character
|
|---|
| 205 | \pP Match P, named property. Use \p{Prop} for longer names.
|
|---|
| 206 | \PP Match non-P
|
|---|
| 207 | \X Match eXtended Unicode "combining character sequence",
|
|---|
| 208 | equivalent to (?:\PM\pM*)
|
|---|
| 209 | \C Match a single C char (octet) even under Unicode.
|
|---|
| 210 | NOTE: breaks up characters into their UTF-8 bytes,
|
|---|
| 211 | so you may end up with malformed pieces of UTF-8.
|
|---|
| 212 | Unsupported in lookbehind.
|
|---|
| 213 |
|
|---|
| 214 | A C<\w> matches a single alphanumeric character (an alphabetic
|
|---|
| 215 | character, or a decimal digit) or C<_>, not a whole word. Use C<\w+>
|
|---|
| 216 | to match a string of Perl-identifier characters (which isn't the same
|
|---|
| 217 | as matching an English word). If C<use locale> is in effect, the list
|
|---|
| 218 | of alphabetic characters generated by C<\w> is taken from the current
|
|---|
| 219 | locale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>,
|
|---|
| 220 | C<\d>, and C<\D> within character classes, but if you try to use them
|
|---|
| 221 | as endpoints of a range, that's not a range, the "-" is understood
|
|---|
| 222 | literally. If Unicode is in effect, C<\s> matches also "\x{85}",
|
|---|
| 223 | "\x{2028}, and "\x{2029}", see L<perlunicode> for more details about
|
|---|
| 224 | C<\pP>, C<\PP>, and C<\X>, and L<perluniintro> about Unicode in general.
|
|---|
| 225 | You can define your own C<\p> and C<\P> properties, see L<perlunicode>.
|
|---|
| 226 | X<\w> X<\W> X<word>
|
|---|
| 227 |
|
|---|
| 228 | The POSIX character class syntax
|
|---|
| 229 | X<character class>
|
|---|
| 230 |
|
|---|
| 231 | [:class:]
|
|---|
| 232 |
|
|---|
| 233 | is also available. The available classes and their backslash
|
|---|
| 234 | equivalents (if available) are as follows:
|
|---|
| 235 | X<character class>
|
|---|
| 236 | X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
|
|---|
| 237 | X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
|
|---|
| 238 |
|
|---|
| 239 | alpha
|
|---|
| 240 | alnum
|
|---|
| 241 | ascii
|
|---|
| 242 | blank [1]
|
|---|
| 243 | cntrl
|
|---|
| 244 | digit \d
|
|---|
| 245 | graph
|
|---|
| 246 | lower
|
|---|
| 247 | print
|
|---|
| 248 | punct
|
|---|
| 249 | space \s [2]
|
|---|
| 250 | upper
|
|---|
| 251 | word \w [3]
|
|---|
| 252 | xdigit
|
|---|
| 253 |
|
|---|
| 254 | =over
|
|---|
| 255 |
|
|---|
| 256 | =item [1]
|
|---|
| 257 |
|
|---|
| 258 | A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace".
|
|---|
| 259 |
|
|---|
| 260 | =item [2]
|
|---|
| 261 |
|
|---|
| 262 | Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
|
|---|
| 263 | also the (very rare) "vertical tabulator", "\ck", chr(11).
|
|---|
| 264 |
|
|---|
| 265 | =item [3]
|
|---|
| 266 |
|
|---|
| 267 | A Perl extension, see above.
|
|---|
| 268 |
|
|---|
| 269 | =back
|
|---|
| 270 |
|
|---|
| 271 | For example use C<[:upper:]> to match all the uppercase characters.
|
|---|
| 272 | Note that the C<[]> are part of the C<[::]> construct, not part of the
|
|---|
| 273 | whole character class. For example:
|
|---|
| 274 |
|
|---|
| 275 | [01[:alpha:]%]
|
|---|
| 276 |
|
|---|
| 277 | matches zero, one, any alphabetic character, and the percentage sign.
|
|---|
| 278 |
|
|---|
| 279 | The following equivalences to Unicode \p{} constructs and equivalent
|
|---|
| 280 | backslash character classes (if available), will hold:
|
|---|
| 281 | X<character class> X<\p> X<\p{}>
|
|---|
| 282 |
|
|---|
| 283 | [:...:] \p{...} backslash
|
|---|
| 284 |
|
|---|
| 285 | alpha IsAlpha
|
|---|
| 286 | alnum IsAlnum
|
|---|
| 287 | ascii IsASCII
|
|---|
| 288 | blank IsSpace
|
|---|
| 289 | cntrl IsCntrl
|
|---|
| 290 | digit IsDigit \d
|
|---|
| 291 | graph IsGraph
|
|---|
| 292 | lower IsLower
|
|---|
| 293 | print IsPrint
|
|---|
| 294 | punct IsPunct
|
|---|
| 295 | space IsSpace
|
|---|
| 296 | IsSpacePerl \s
|
|---|
| 297 | upper IsUpper
|
|---|
| 298 | word IsWord
|
|---|
| 299 | xdigit IsXDigit
|
|---|
| 300 |
|
|---|
| 301 | For example C<[:lower:]> and C<\p{IsLower}> are equivalent.
|
|---|
| 302 |
|
|---|
| 303 | If the C<utf8> pragma is not used but the C<locale> pragma is, the
|
|---|
| 304 | classes correlate with the usual isalpha(3) interface (except for
|
|---|
| 305 | "word" and "blank").
|
|---|
| 306 |
|
|---|
| 307 | The assumedly non-obviously named classes are:
|
|---|
| 308 |
|
|---|
| 309 | =over 4
|
|---|
| 310 |
|
|---|
| 311 | =item cntrl
|
|---|
| 312 | X<cntrl>
|
|---|
| 313 |
|
|---|
| 314 | Any control character. Usually characters that don't produce output as
|
|---|
| 315 | such but instead control the terminal somehow: for example newline and
|
|---|
| 316 | backspace are control characters. All characters with ord() less than
|
|---|
| 317 | 32 are most often classified as control characters (assuming ASCII,
|
|---|
| 318 | the ISO Latin character sets, and Unicode), as is the character with
|
|---|
| 319 | the ord() value of 127 (C<DEL>).
|
|---|
| 320 |
|
|---|
| 321 | =item graph
|
|---|
| 322 | X<graph>
|
|---|
| 323 |
|
|---|
| 324 | Any alphanumeric or punctuation (special) character.
|
|---|
| 325 |
|
|---|
| 326 | =item print
|
|---|
| 327 | X<print>
|
|---|
| 328 |
|
|---|
| 329 | Any alphanumeric or punctuation (special) character or the space character.
|
|---|
| 330 |
|
|---|
| 331 | =item punct
|
|---|
| 332 | X<punct>
|
|---|
| 333 |
|
|---|
| 334 | Any punctuation (special) character.
|
|---|
| 335 |
|
|---|
| 336 | =item xdigit
|
|---|
| 337 | X<xdigit>
|
|---|
| 338 |
|
|---|
| 339 | Any hexadecimal digit. Though this may feel silly ([0-9A-Fa-f] would
|
|---|
| 340 | work just fine) it is included for completeness.
|
|---|
| 341 |
|
|---|
| 342 | =back
|
|---|
| 343 |
|
|---|
| 344 | You can negate the [::] character classes by prefixing the class name
|
|---|
| 345 | with a '^'. This is a Perl extension. For example:
|
|---|
| 346 | X<character class, negation>
|
|---|
| 347 |
|
|---|
| 348 | POSIX traditional Unicode
|
|---|
| 349 |
|
|---|
| 350 | [:^digit:] \D \P{IsDigit}
|
|---|
| 351 | [:^space:] \S \P{IsSpace}
|
|---|
| 352 | [:^word:] \W \P{IsWord}
|
|---|
| 353 |
|
|---|
| 354 | Perl respects the POSIX standard in that POSIX character classes are
|
|---|
| 355 | only supported within a character class. The POSIX character classes
|
|---|
| 356 | [.cc.] and [=cc=] are recognized but B<not> supported and trying to
|
|---|
| 357 | use them will cause an error.
|
|---|
| 358 |
|
|---|
| 359 | Perl defines the following zero-width assertions:
|
|---|
| 360 | X<zero-width assertion> X<assertion> X<regex, zero-width assertion>
|
|---|
| 361 | X<regexp, zero-width assertion>
|
|---|
| 362 | X<regular expression, zero-width assertion>
|
|---|
| 363 | X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
|
|---|
| 364 |
|
|---|
| 365 | \b Match a word boundary
|
|---|
| 366 | \B Match a non-(word boundary)
|
|---|
| 367 | \A Match only at beginning of string
|
|---|
| 368 | \Z Match only at end of string, or before newline at the end
|
|---|
| 369 | \z Match only at end of string
|
|---|
| 370 | \G Match only at pos() (e.g. at the end-of-match position
|
|---|
| 371 | of prior m//g)
|
|---|
| 372 |
|
|---|
| 373 | A word boundary (C<\b>) is a spot between two characters
|
|---|
| 374 | that has a C<\w> on one side of it and a C<\W> on the other side
|
|---|
| 375 | of it (in either order), counting the imaginary characters off the
|
|---|
| 376 | beginning and end of the string as matching a C<\W>. (Within
|
|---|
| 377 | character classes C<\b> represents backspace rather than a word
|
|---|
|
|---|