| 1 | =head1 NAME
|
|---|
| 2 |
|
|---|
| 3 | perlretut - Perl regular expressions tutorial
|
|---|
| 4 |
|
|---|
| 5 | =head1 DESCRIPTION
|
|---|
| 6 |
|
|---|
| 7 | This page provides a basic tutorial on understanding, creating and
|
|---|
| 8 | using regular expressions in Perl. It serves as a complement to the
|
|---|
| 9 | reference page on regular expressions L<perlre>. Regular expressions
|
|---|
| 10 | are an integral part of the C<m//>, C<s///>, C<qr//> and C<split>
|
|---|
| 11 | operators and so this tutorial also overlaps with
|
|---|
| 12 | L<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>.
|
|---|
| 13 |
|
|---|
| 14 | Perl is widely renowned for excellence in text processing, and regular
|
|---|
| 15 | expressions are one of the big factors behind this fame. Perl regular
|
|---|
| 16 | expressions display an efficiency and flexibility unknown in most
|
|---|
| 17 | other computer languages. Mastering even the basics of regular
|
|---|
| 18 | expressions will allow you to manipulate text with surprising ease.
|
|---|
| 19 |
|
|---|
| 20 | What is a regular expression? A regular expression is simply a string
|
|---|
| 21 | that describes a pattern. Patterns are in common use these days;
|
|---|
| 22 | examples are the patterns typed into a search engine to find web pages
|
|---|
| 23 | and the patterns used to list files in a directory, e.g., C<ls *.txt>
|
|---|
| 24 | or C<dir *.*>. In Perl, the patterns described by regular expressions
|
|---|
| 25 | are used to search strings, extract desired parts of strings, and to
|
|---|
| 26 | do search and replace operations.
|
|---|
| 27 |
|
|---|
| 28 | Regular expressions have the undeserved reputation of being abstract
|
|---|
| 29 | and difficult to understand. Regular expressions are constructed using
|
|---|
| 30 | simple concepts like conditionals and loops and are no more difficult
|
|---|
| 31 | to understand than the corresponding C<if> conditionals and C<while>
|
|---|
| 32 | loops in the Perl language itself. In fact, the main challenge in
|
|---|
| 33 | learning regular expressions is just getting used to the terse
|
|---|
| 34 | notation used to express these concepts.
|
|---|
| 35 |
|
|---|
| 36 | This tutorial flattens the learning curve by discussing regular
|
|---|
| 37 | expression concepts, along with their notation, one at a time and with
|
|---|
| 38 | many examples. The first part of the tutorial will progress from the
|
|---|
| 39 | simplest word searches to the basic regular expression concepts. If
|
|---|
| 40 | you master the first part, you will have all the tools needed to solve
|
|---|
| 41 | about 98% of your needs. The second part of the tutorial is for those
|
|---|
| 42 | comfortable with the basics and hungry for more power tools. It
|
|---|
| 43 | discusses the more advanced regular expression operators and
|
|---|
| 44 | introduces the latest cutting edge innovations in 5.6.0.
|
|---|
| 45 |
|
|---|
| 46 | A note: to save time, 'regular expression' is often abbreviated as
|
|---|
| 47 | regexp or regex. Regexp is a more natural abbreviation than regex, but
|
|---|
| 48 | is harder to pronounce. The Perl pod documentation is evenly split on
|
|---|
| 49 | regexp vs regex; in Perl, there is more than one way to abbreviate it.
|
|---|
| 50 | We'll use regexp in this tutorial.
|
|---|
| 51 |
|
|---|
| 52 | =head1 Part 1: The basics
|
|---|
| 53 |
|
|---|
| 54 | =head2 Simple word matching
|
|---|
| 55 |
|
|---|
| 56 | The simplest regexp is simply a word, or more generally, a string of
|
|---|
| 57 | characters. A regexp consisting of a word matches any string that
|
|---|
| 58 | contains that word:
|
|---|
| 59 |
|
|---|
| 60 | "Hello World" =~ /World/; # matches
|
|---|
| 61 |
|
|---|
| 62 | What is this perl statement all about? C<"Hello World"> is a simple
|
|---|
| 63 | double quoted string. C<World> is the regular expression and the
|
|---|
| 64 | C<//> enclosing C</World/> tells perl to search a string for a match.
|
|---|
| 65 | The operator C<=~> associates the string with the regexp match and
|
|---|
| 66 | produces a true value if the regexp matched, or false if the regexp
|
|---|
| 67 | did not match. In our case, C<World> matches the second word in
|
|---|
| 68 | C<"Hello World">, so the expression is true. Expressions like this
|
|---|
| 69 | are useful in conditionals:
|
|---|
| 70 |
|
|---|
| 71 | if ("Hello World" =~ /World/) {
|
|---|
| 72 | print "It matches\n";
|
|---|
| 73 | }
|
|---|
| 74 | else {
|
|---|
| 75 | print "It doesn't match\n";
|
|---|
| 76 | }
|
|---|
| 77 |
|
|---|
| 78 | There are useful variations on this theme. The sense of the match can
|
|---|
| 79 | be reversed by using C<!~> operator:
|
|---|
| 80 |
|
|---|
| 81 | if ("Hello World" !~ /World/) {
|
|---|
| 82 | print "It doesn't match\n";
|
|---|
| 83 | }
|
|---|
| 84 | else {
|
|---|
| 85 | print "It matches\n";
|
|---|
| 86 | }
|
|---|
| 87 |
|
|---|
| 88 | The literal string in the regexp can be replaced by a variable:
|
|---|
| 89 |
|
|---|
| 90 | $greeting = "World";
|
|---|
| 91 | if ("Hello World" =~ /$greeting/) {
|
|---|
| 92 | print "It matches\n";
|
|---|
| 93 | }
|
|---|
| 94 | else {
|
|---|
| 95 | print "It doesn't match\n";
|
|---|
| 96 | }
|
|---|
| 97 |
|
|---|
| 98 | If you're matching against the special default variable C<$_>, the
|
|---|
| 99 | C<$_ =~> part can be omitted:
|
|---|
| 100 |
|
|---|
| 101 | $_ = "Hello World";
|
|---|
| 102 | if (/World/) {
|
|---|
| 103 | print "It matches\n";
|
|---|
| 104 | }
|
|---|
| 105 | else {
|
|---|
| 106 | print "It doesn't match\n";
|
|---|
| 107 | }
|
|---|
| 108 |
|
|---|
| 109 | And finally, the C<//> default delimiters for a match can be changed
|
|---|
| 110 | to arbitrary delimiters by putting an C<'m'> out front:
|
|---|
| 111 |
|
|---|
| 112 | "Hello World" =~ m!World!; # matches, delimited by '!'
|
|---|
| 113 | "Hello World" =~ m{World}; # matches, note the matching '{}'
|
|---|
| 114 | "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
|
|---|
| 115 | # '/' becomes an ordinary char
|
|---|
| 116 |
|
|---|
| 117 | C</World/>, C<m!World!>, and C<m{World}> all represent the
|
|---|
| 118 | same thing. When, e.g., C<""> is used as a delimiter, the forward
|
|---|
| 119 | slash C<'/'> becomes an ordinary character and can be used in a regexp
|
|---|
| 120 | without trouble.
|
|---|
| 121 |
|
|---|
| 122 | Let's consider how different regexps would match C<"Hello World">:
|
|---|
| 123 |
|
|---|
| 124 | "Hello World" =~ /world/; # doesn't match
|
|---|
| 125 | "Hello World" =~ /o W/; # matches
|
|---|
| 126 | "Hello World" =~ /oW/; # doesn't match
|
|---|
| 127 | "Hello World" =~ /World /; # doesn't match
|
|---|
| 128 |
|
|---|
| 129 | The first regexp C<world> doesn't match because regexps are
|
|---|
| 130 | case-sensitive. The second regexp matches because the substring
|
|---|
| 131 | S<C<'o W'> > occurs in the string S<C<"Hello World"> >. The space
|
|---|
| 132 | character ' ' is treated like any other character in a regexp and is
|
|---|
| 133 | needed to match in this case. The lack of a space character is the
|
|---|
| 134 | reason the third regexp C<'oW'> doesn't match. The fourth regexp
|
|---|
| 135 | C<'World '> doesn't match because there is a space at the end of the
|
|---|
| 136 | regexp, but not at the end of the string. The lesson here is that
|
|---|
| 137 | regexps must match a part of the string I<exactly> in order for the
|
|---|
| 138 | statement to be true.
|
|---|
| 139 |
|
|---|
| 140 | If a regexp matches in more than one place in the string, perl will
|
|---|
| 141 | always match at the earliest possible point in the string:
|
|---|
| 142 |
|
|---|
| 143 | "Hello World" =~ /o/; # matches 'o' in 'Hello'
|
|---|
| 144 | "That hat is red" =~ /hat/; # matches 'hat' in 'That'
|
|---|
| 145 |
|
|---|
| 146 | With respect to character matching, there are a few more points you
|
|---|
| 147 | need to know about. First of all, not all characters can be used 'as
|
|---|
| 148 | is' in a match. Some characters, called B<metacharacters>, are reserved
|
|---|
| 149 | for use in regexp notation. The metacharacters are
|
|---|
| 150 |
|
|---|
| 151 | {}[]()^$.|*+?\
|
|---|
| 152 |
|
|---|
| 153 | The significance of each of these will be explained
|
|---|
| 154 | in the rest of the tutorial, but for now, it is important only to know
|
|---|
| 155 | that a metacharacter can be matched by putting a backslash before it:
|
|---|
| 156 |
|
|---|
| 157 | "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
|
|---|
| 158 | "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
|
|---|
| 159 | "The interval is [0,1)." =~ /[0,1)./ # is a syntax error!
|
|---|
| 160 | "The interval is [0,1)." =~ /\[0,1\)\./ # matches
|
|---|
| 161 | "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches
|
|---|
| 162 |
|
|---|
| 163 | In the last regexp, the forward slash C<'/'> is also backslashed,
|
|---|
| 164 | because it is used to delimit the regexp. This can lead to LTS
|
|---|
| 165 | (leaning toothpick syndrome), however, and it is often more readable
|
|---|
| 166 | to change delimiters.
|
|---|
| 167 |
|
|---|
| 168 | "/usr/bin/perl" =~ m!/usr/bin/perl!; # easier to read
|
|---|
| 169 |
|
|---|
| 170 | The backslash character C<'\'> is a metacharacter itself and needs to
|
|---|
| 171 | be backslashed:
|
|---|
| 172 |
|
|---|
| 173 | 'C:\WIN32' =~ /C:\\WIN/; # matches
|
|---|
| 174 |
|
|---|
| 175 | In addition to the metacharacters, there are some ASCII characters
|
|---|
| 176 | which don't have printable character equivalents and are instead
|
|---|
| 177 | represented by B<escape sequences>. Common examples are C<\t> for a
|
|---|
| 178 | tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a
|
|---|
| 179 | bell. If your string is better thought of as a sequence of arbitrary
|
|---|
| 180 | bytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape
|
|---|
| 181 | sequence, e.g., C<\x1B> may be a more natural representation for your
|
|---|
| 182 | bytes. Here are some examples of escapes:
|
|---|
| 183 |
|
|---|
| 184 | "1000\t2000" =~ m(0\t2) # matches
|
|---|
| 185 | "1000\n2000" =~ /0\n20/ # matches
|
|---|
| 186 | "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000"
|
|---|
| 187 | "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat
|
|---|
| 188 |
|
|---|
| 189 | If you've been around Perl a while, all this talk of escape sequences
|
|---|
| 190 | may seem familiar. Similar escape sequences are used in double-quoted
|
|---|
| 191 | strings and in fact the regexps in Perl are mostly treated as
|
|---|
| 192 | double-quoted strings. This means that variables can be used in
|
|---|
| 193 | regexps as well. Just like double-quoted strings, the values of the
|
|---|
| 194 | variables in the regexp will be substituted in before the regexp is
|
|---|
| 195 | evaluated for matching purposes. So we have:
|
|---|
| 196 |
|
|---|
| 197 | $foo = 'house';
|
|---|
| 198 | 'housecat' =~ /$foo/; # matches
|
|---|
| 199 | 'cathouse' =~ /cat$foo/; # matches
|
|---|
| 200 | 'housecat' =~ /${foo}cat/; # matches
|
|---|
| 201 |
|
|---|
| 202 | So far, so good. With the knowledge above you can already perform
|
|---|
| 203 | searches with just about any literal string regexp you can dream up.
|
|---|
| 204 | Here is a I<very simple> emulation of the Unix grep program:
|
|---|
| 205 |
|
|---|
| 206 | % cat > simple_grep
|
|---|
| 207 | #!/usr/bin/perl
|
|---|
| 208 | $regexp = shift;
|
|---|
| 209 | while (<>) {
|
|---|
| 210 | print if /$regexp/;
|
|---|
| 211 | }
|
|---|
| 212 | ^D
|
|---|
| 213 |
|
|---|
| 214 | % chmod +x simple_grep
|
|---|
| 215 |
|
|---|
| 216 | % simple_grep abba /usr/dict/words
|
|---|
| 217 | Babbage
|
|---|
| 218 | cabbage
|
|---|
| 219 | cabbages
|
|---|
| 220 | sabbath
|
|---|
| 221 | Sabbathize
|
|---|
| 222 | Sabbathizes
|
|---|
| 223 | sabbatical
|
|---|
| 224 | scabbard
|
|---|
| 225 | scabbards
|
|---|
| 226 |
|
|---|
| 227 | This program is easy to understand. C<#!/usr/bin/perl> is the standard
|
|---|
| 228 | way to invoke a perl program from the shell.
|
|---|
| 229 | S<C<$regexp = shift;> > saves the first command line argument as the
|
|---|
| 230 | regexp to be used, leaving the rest of the command line arguments to
|
|---|
| 231 | be treated as files. S<C<< while (<>) >> > loops over all the lines in
|
|---|
| 232 | all the files. For each line, S<C<print if /$regexp/;> > prints the
|
|---|
| 233 | line if the regexp matches the line. In this line, both C<print> and
|
|---|
| 234 | C</$regexp/> use the default variable C<$_> implicitly.
|
|---|
| 235 |
|
|---|
| 236 | With all of the regexps above, if the regexp matched anywhere in the
|
|---|
| 237 | string, it was considered a match. Sometimes, however, we'd like to
|
|---|
| 238 | specify I<where> in the string the regexp should try to match. To do
|
|---|
| 239 | this, we would use the B<anchor> metacharacters C<^> and C<$>. The
|
|---|
| 240 | anchor C<^> means match at the beginning of the string and the anchor
|
|---|
| 241 | C<$> means match at the end of the string, or before a newline at the
|
|---|
| 242 | end of the string. Here is how they are used:
|
|---|
| 243 |
|
|---|
| 244 | "housekeeper" =~ /keeper/; # matches
|
|---|
| 245 | "housekeeper" =~ /^keeper/; # doesn't match
|
|---|
| 246 | "housekeeper" =~ /keeper$/; # matches
|
|---|
| 247 | "housekeeper\n" =~ /keeper$/; # matches
|
|---|
| 248 |
|
|---|
| 249 | The second regexp doesn't match because C<^> constrains C<keeper> to
|
|---|
| 250 | match only at the beginning of the string, but C<"housekeeper"> has
|
|---|
| 251 | keeper starting in the middle. The third regexp does match, since the
|
|---|
| 252 | C<$> constrains C<keeper> to match only at the end of the string.
|
|---|
| 253 |
|
|---|
| 254 | When both C<^> and C<$> are used at the same time, the regexp has to
|
|---|
| 255 | match both the beginning and the end of the string, i.e., the regexp
|
|---|
| 256 | matches the whole string. Consider
|
|---|
| 257 |
|
|---|
| 258 | "keeper" =~ /^keep$/; # doesn't match
|
|---|
| 259 | "keeper" =~ /^keeper$/; # matches
|
|---|
| 260 | "" =~ /^$/; # ^$ matches an empty string
|
|---|
| 261 |
|
|---|
| 262 | The first regexp doesn't match because the string has more to it than
|
|---|
| 263 | C<keep>. Since the second regexp is exactly the string, it
|
|---|
| 264 | matches. Using both C<^> and C<$> in a regexp forces the complete
|
|---|
| 265 | string to match, so it gives you complete control over which strings
|
|---|
| 266 | match and which don't. Suppose you are looking for a fellow named
|
|---|
| 267 | bert, off in a string by himself:
|
|---|
| 268 |
|
|---|
| 269 | "dogbert" =~ /bert/; # matches, but not what you want
|
|---|
| 270 |
|
|---|
| 271 | "dilbert" =~ /^bert/; # doesn't match, but ..
|
|---|
| 272 | "bertram" =~ /^bert/; # matches, so still not good enough
|
|---|
| 273 |
|
|---|
| 274 | "bertram" =~ /^bert$/; # doesn't match, good
|
|---|
| 275 | "dilbert" =~ /^bert$/; # doesn't match, good
|
|---|
| 276 | "bert" =~ /^bert$/; # matches, perfect
|
|---|
| 277 |
|
|---|
| 278 | Of course, in the case of a literal string, one could just as easily
|
|---|
| 279 | use the string equivalence S<C<$string eq 'bert'> > and it would be
|
|---|
| 280 | more efficient. The C<^...$> regexp really becomes useful when we
|
|---|
| 281 | add in the more powerful regexp tools below.
|
|---|
| 282 |
|
|---|
| 283 | =head2 Using character classes
|
|---|
| 284 |
|
|---|
| 285 | Although one can already do quite a lot with the literal string
|
|---|
| 286 | regexps above, we've only scratched the surface of regular expression
|
|---|
| 287 | technology. In this and subsequent sections we will introduce regexp
|
|---|
| 288 | concepts (and associated metacharacter notations) that will allow a
|
|---|
| 289 | regexp to not just represent a single character sequence, but a I<whole
|
|---|
| 290 | class> of them.
|
|---|
| 291 |
|
|---|
| 292 | One such concept is that of a B<character class>. A character class
|
|---|
| 293 | allows a set of possible characters, rather than just a single
|
|---|
| 294 | character, to match at a particular point in a regexp. Character
|
|---|
| 295 | classes are denoted by brackets C<[...]>, with the set of characters
|
|---|
| 296 | to be possibly matched inside. Here are some examples:
|
|---|
| 297 |
|
|---|
| 298 | /cat/; # matches 'cat'
|
|---|
| 299 | /[bcr]at/; # matches 'bat, 'cat', or 'rat'
|
|---|
| 300 | /item[0123456789]/; # matches 'item0' or ... or 'item9'
|
|---|
| 301 | "abc" =~ /[cab]/; # matches 'a'
|
|---|
| 302 |
|
|---|
| 303 | In the last statement, even though C<'c'> is the first character in
|
|---|
| 304 | the class, C<'a'> matches because the first character position in the
|
|---|
| 305 | string is the earliest point at which the regexp can match.
|
|---|
| 306 |
|
|---|
| 307 | /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
|
|---|
| 308 | # 'yes', 'Yes', 'YES', etc.
|
|---|
| 309 |
|
|---|
| 310 | This regexp displays a common task: perform a case-insensitive
|
|---|
| 311 | match. Perl provides away of avoiding all those brackets by simply
|
|---|
| 312 | appending an C<'i'> to the end of the match. Then C</[yY][eE][sS]/;>
|
|---|
| 313 | can be rewritten as C</yes/i;>. The C<'i'> stands for
|
|---|
| 314 | case-insensitive and is an example of a B<modifier> of the matching
|
|---|
| 315 | operation. We will meet other modifiers later in the tutorial.
|
|---|
| 316 |
|
|---|
| 317 | We saw in the section above that there were ordinary characters, which
|
|---|
| 318 | represented themselves, and special characters, which needed a
|
|---|
| 319 | backslash C<\> to represent themselves. The same is true in a
|
|---|
| 320 | character class, but the sets of ordinary and special characters
|
|---|
| 321 | inside a character class are different than those outside a character
|
|---|
| 322 | class. The special characters for a character class are C<-]\^$>. C<]>
|
|---|
| 323 | is special because it denotes the end of a character class. C<$> is
|
|---|
| 324 | special because it denotes a scalar variable. C<\> is special because
|
|---|
| 325 | it is used in escape sequences, just like above. Here is how the
|
|---|
| 326 | special characters C<]$\> are handled:
|
|---|
| 327 |
|
|---|
| 328 | /[\]c]def/; # matches ']def' or 'cdef'
|
|---|
| 329 | $x = 'bcr';
|
|---|
| 330 | /[$x]at/; # matches 'bat', 'cat', or 'rat'
|
|---|
| 331 | /[\$x]at/; # matches '$at' or 'xat'
|
|---|
| 332 | /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
|
|---|
| 333 |
|
|---|
| 334 | The last two are a little tricky. in C<[\$x]>, the backslash protects
|
|---|
| 335 | the dollar sign, so the character class has two members C<$> and C<x>.
|
|---|
| 336 | In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a
|
|---|
| 337 | variable and substituted in double quote fashion.
|
|---|
| 338 |
|
|---|
| 339 | The special character C<'-'> acts as a range operator within character
|
|---|
| 340 | classes, so that a contiguous set of characters can be written as a
|
|---|
| 341 | range. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]>
|
|---|
| 342 | become the svelte C<[0-9]> and C<[a-z]>. Some examples are
|
|---|
| 343 |
|
|---|
| 344 | /item[0-9]/; # matches 'item0' or ... or 'item9'
|
|---|
| 345 | /[0-9bx-z]aa/; # matches '0aa', ..., '9aa',
|
|---|
| 346 | # 'baa', 'xaa', 'yaa', or 'zaa'
|
|---|
| 347 | /[0-9a-fA-F]/; # matches a hexadecimal digit
|
|---|
| 348 | /[0-9a-zA-Z_]/; # matches a "word" character,
|
|---|
| 349 | # like those in a perl variable name
|
|---|
| 350 |
|
|---|
| 351 | If C<'-'> is the first or last character in a character class, it is
|
|---|
| 352 | treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are
|
|---|
| 353 | all equivalent.
|
|---|
| 354 |
|
|---|
| 355 | The special character C<^> in the first position of a character class
|
|---|
| 356 | denotes a B<negated character class>, which matches any character but
|
|---|
| 357 | those in the brackets. Both C<[...]> and C<[^...]> must match a
|
|---|
| 358 | character, or the match fails. Then
|
|---|
| 359 |
|
|---|
| 360 | /[^a]at/; # doesn't match 'aat' or 'at', but matches
|
|---|
| 361 | # all other 'bat', 'cat, '0at', '%at', etc.
|
|---|
| 362 | /[^0-9]/; # matches a non-numeric character
|
|---|
| 363 | /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
|
|---|
| 364 |
|
|---|
| 365 | Now, even C<[0-9]> can be a bother the write multiple times, so in the
|
|---|
| 366 | interest of saving keystrokes and making regexps more readable, Perl
|
|---|
| 367 | has several abbreviations for common character classes:
|
|---|
| 368 |
|
|---|
| 369 | =over 4
|
|---|
| 370 |
|
|---|
| 371 | =item *
|
|---|
| 372 |
|
|---|
| 373 | \d is a digit and represents [0-9]
|
|---|
| 374 |
|
|---|
| 375 | =item *
|
|---|
| 376 |
|
|---|
| 377 | \s is a whitespace character and represents [\ \t\r\n\f]
|
|---|
| 378 |
|
|---|
| 379 | =item *
|
|---|
| 380 |
|
|---|
| 381 | \w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]
|
|---|
| 382 |
|
|---|
| 383 | =item *
|
|---|
| 384 |
|
|---|
| 385 | \D is a negated \d; it represents any character but a digit [^0-9]
|
|---|
| 386 |
|
|---|
| 387 | =item *
|
|---|
| 388 |
|
|---|
| 389 | \S is a negated \s; it represents any non-whitespace character [^\s]
|
|---|
| 390 |
|
|---|
| 391 | =item *
|
|---|
| 392 |
|
|---|
| 393 | \W is a negated \w; it represents any non-word character [^\w]
|
|---|
| 394 |
|
|---|
| 395 | =item *
|
|---|
| 396 |
|
|---|
| 397 | The period '.' matches any character but "\n"
|
|---|
| 398 |
|
|---|
| 399 | =back
|
|---|
| 400 |
|
|---|
| 401 | The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
|
|---|
| 402 | of character classes. Here are some in use:
|
|---|
| 403 |
|
|---|
| 404 | /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
|
|---|
| 405 | /[\d\s]/; # matches any digit or whitespace character
|
|---|
| 406 | /\w\W\w/; # matches a word char, followed by a
|
|---|
| 407 | # non-word char, followed by a word char
|
|---|
| 408 | /..rt/; # matches any two chars, followed by 'rt'
|
|---|
| 409 | /end\./; # matches 'end.'
|
|---|
| 410 | /end[.]/; # same thing, matches 'end.'
|
|---|
| 411 |
|
|---|
| 412 | Because a period is a metacharacter, it needs to be escaped to match
|
|---|
| 413 | as an ordinary period. Because, for example, C<\d> and C<\w> are sets
|
|---|
| 414 | of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in
|
|---|
| 415 | fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as
|
|---|
| 416 | C<[\W]>. Think DeMorgan's laws.
|
|---|
| 417 |
|
|---|
| 418 | An anchor useful in basic regexps is the S<B<word anchor> >
|
|---|
| 419 | C<\b>. This matches a boundary between a word character and a non-word
|
|---|
| 420 | character C<\w\W> or C<\W\w>:
|
|---|
| 421 |
|
|---|
| 422 | $x = "Housecat catenates house and cat";
|
|---|
| 423 | $x =~ /cat/; # matches cat in 'housecat'
|
|---|
| 424 | $x =~ /\bcat/; # matches cat in 'catenates'
|
|---|
| 425 | $x =~ /cat\b/; # matches cat in 'housecat'
|
|---|
| 426 | $x =~ /\bcat\b/; # matches 'cat' at end of string
|
|---|
| 427 |
|
|---|
| 428 | Note in the last example, the end of the string is considered a word
|
|---|
| 429 | boundary.
|
|---|
| 430 |
|
|---|
| 431 | You might wonder why C<'.'> matches everything but C<"\n"> - why not
|
|---|
| 432 | every character? The reason is that often one is matching against
|
|---|
| 433 | lines and would like to ignore the newline characters. For instance,
|
|---|
| 434 | while the string C<"\n"> represents one line, we would like to think
|
|---|
| 435 | of as empty. Then
|
|---|
| 436 |
|
|---|
| 437 | "" =~ /^$/; # matches
|
|---|
| 438 | "\n" =~ /^$/; # matches, "\n" is ignored
|
|---|
| 439 |
|
|---|
| 440 | "" =~ /./; # doesn't match; it needs a char
|
|---|
| 441 | "" =~ /^.$/; # doesn't match; it needs a char
|
|---|
| 442 | "\n" =~ /^.$/; # doesn't match; it needs a char other than "\n"
|
|---|
| 443 | "a" =~ /^.$/; # matches
|
|---|
| 444 | "a\n" =~ /^.$/; # matches, ignores the "\n"
|
|---|
| 445 |
|
|---|
| 446 | This behavior is convenient, because we usually want to ignore
|
|---|
| 447 | newlines when we count and match characters in a line. Sometimes,
|
|---|
| 448 | however, we want to keep track of newlines. We might even want C<^>
|
|---|
| 449 | and C<$> to anchor at the beginning and end of lines within the
|
|---|
| 450 | string, rather than just the beginning and end of the string. Perl
|
|---|
| 451 | allows us to choose between ignoring and paying attention to newlines
|
|---|
| 452 | by using the C<//s> and C<//m> modifiers. C<//s> and C<//m> stand for
|
|---|
| 453 | single line and multi-line and they determine whether a string is to
|
|---|
| 454 | be treated as one continuous string, or as a set of lines. The two
|
|---|
| 455 | modifiers affect two aspects of how the regexp is interpreted: 1) how
|
|---|
| 456 | the C<'.'> character class is defined, and 2) where the anchors C<^>
|
|---|
| 457 | and C<$> are able to match. Here are the four possible combinations:
|
|---|
| 458 |
|
|---|
| 459 | =over 4
|
|---|
| 460 |
|
|---|
| 461 | =item *
|
|---|
| 462 |
|
|---|
| 463 | no modifiers (//): Default behavior. C<'.'> matches any character
|
|---|
| 464 | except C<"\n">. C<^> matches only at the beginning of the string and
|
|---|
| 465 | C<$> matches only at the end or before a newline at the end.
|
|---|
| 466 |
|
|---|
| 467 | =item *
|
|---|
| 468 |
|
|---|
| 469 | s modifier (//s): Treat string as a single long line. C<'.'> matches
|
|---|
| 470 | any character, even C<"\n">. C<^> matches only at the beginning of
|
|---|
| 471 | the string and C<$> matches only at the end or before a newline at the
|
|---|
| 472 | end.
|
|---|
| 473 |
|
|---|
| 474 | =item *
|
|---|
| 475 |
|
|---|
| 476 | m modifier (//m): Treat string as a set of multiple lines. C<'.'>
|
|---|
| 477 | matches any character except C<"\n">. C<^> and C<$> are able to match
|
|---|
| 478 | at the start or end of I<any> line within the string.
|
|---|
| 479 |
|
|---|
| 480 | =item *
|
|---|
| 481 |
|
|---|
| 482 | both s and m modifiers (//sm): Treat string as a single long line, but
|
|---|
| 483 | detect multiple lines. C<'.'> matches any character, even
|
|---|
| 484 | C<"\n">. C<^> and C<$>, however, are able to match at the start or end
|
|---|
| 485 | of I<any> line within the string.
|
|---|
| 486 |
|
|---|
| 487 | =back
|
|---|
| 488 |
|
|---|
| 489 | Here are examples of C<//s> and C<//m> in action:
|
|---|
| 490 |
|
|---|
| 491 | $x = "There once was a girl\nWho programmed in Perl\n";
|
|---|
| 492 |
|
|---|
| 493 | $x =~ /^Who/; # doesn't match, "Who" not at start of string
|
|---|
| 494 | $x =~ /^Who/s; # doesn't match, "Who" not at start of string
|
|---|
| 495 | $x =~ /^Who/m; # matches, "Who" at start of second line
|
|---|
| 496 | $x =~ /^Who/sm; # matches, "Who" at start of second line
|
|---|
| 497 |
|
|---|
| 498 | $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n"
|
|---|
| 499 | $x =~ /girl.Who/s; # matches, "." matches "\n"
|
|---|
| 500 | $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n"
|
|---|
| 501 | $x =~ /girl.Who/sm; # matches, "." matches "\n"
|
|---|
| 502 |
|
|---|
| 503 | Most of the time, the default behavior is what is want, but C<//s> and
|
|---|
| 504 | C<//m> are occasionally very useful. If C<//m> is being used, the start
|
|---|
| 505 | of the string can still be matched with C<\A> and the end of string
|
|---|
| 506 | can still be matched with the anchors C<\Z> (matches both the end and
|
|---|
| 507 | the newline before, like C<$>), and C<\z> (matches only the end):
|
|---|
| 508 |
|
|---|
| 509 | $x =~ /^Who/m; # matches, "Who" at start of second line
|
|---|
| 510 | $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string
|
|---|
| 511 |
|
|---|
| 512 | $x =~ /girl$/m; # matches, "girl" at end of first line
|
|---|
| 513 | $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string
|
|---|
| 514 |
|
|---|
| 515 | $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end
|
|---|
| 516 | $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string
|
|---|
| 517 |
|
|---|
| 518 | We now know how to create choices among classes of characters in a
|
|---|
| 519 | regexp. What about choices among words or character strings? Such
|
|---|
| 520 | choices are described in the next section.
|
|---|
| 521 |
|
|---|
| 522 | =head2 Matching this or that
|
|---|
| 523 |
|
|---|
| 524 | Sometimes we would like to our regexp to be able to match different
|
|---|
| 525 | possible words or character strings. This is accomplished by using
|
|---|
| 526 | the B<alternation> metacharacter C<|>. To match C<dog> or C<cat>, we
|
|---|
| 527 | form the regexp C<dog|cat>. As before, perl will try to match the
|
|---|
| 528 | regexp at the earliest possible point in the string. At each
|
|---|
| 529 | character position, perl will first try to match the first
|
|---|
| 530 | alternative, C<dog>. If C<dog> doesn't match, perl will then try the
|
|---|
| 531 | next alternative, C<cat>. If C<cat> doesn't match either, then the
|
|---|
| 532 | match fails and perl moves to the next position in the string. Some
|
|---|
| 533 | examples:
|
|---|
| 534 |
|
|---|
| 535 | "cats and dogs" =~ /cat|dog|bird/; # matches "cat"
|
|---|
| 536 | "cats and dogs" =~ /dog|cat|bird/; # matches "cat"
|
|---|
| 537 |
|
|---|
| 538 | Even though C<dog> is the first alternative in the second regexp,
|
|---|
| 539 | C<cat> is able to match earlier in the string.
|
|---|
| 540 |
|
|---|
| 541 | "cats" =~ /c|ca|cat|cats/; # matches "c"
|
|---|
| 542 | "cats" =~ /cats|cat|ca|c/; # matches "cats"
|
|---|
| 543 |
|
|---|
| 544 | Here, all the alternatives match at the first string position, so the
|
|---|
| 545 | first alternative is the one that matches. If some of the
|
|---|
| 546 | alternatives are truncations of the others, put the longest ones first
|
|---|
| 547 | to give them a chance to match.
|
|---|
| 548 |
|
|---|
| 549 | "cab" =~ /a|b|c/ # matches "c"
|
|---|
| 550 | # /a|b|c/ == /[abc]/
|
|---|
| 551 |
|
|---|
| 552 | The last example points out that character classes are like
|
|---|
| 553 | alternations of characters. At a given character position, the first
|
|---|
| 554 | alternative that allows the regexp match to succeed will be the one
|
|---|
| 555 | that matches.
|
|---|
| 556 |
|
|---|
| 557 | =head2 Grouping things and hierarchical matching
|
|---|
| 558 |
|
|---|
| 559 | Alternation allows a regexp to choose among alternatives, but by
|
|---|
| 560 | itself it unsatisfying. The reason is that each alternative is a whole
|
|---|
| 561 | regexp, but sometime we want alternatives for just part of a
|
|---|
| 562 | regexp. For instance, suppose we want to search for housecats or
|
|---|
| 563 | housekeepers. The regexp C<housecat|housekeeper> fits the bill, but is
|
|---|
| 564 | inefficient because we had to type C<house> twice. It would be nice to
|
|---|
| 565 | have parts of the regexp be constant, like C<house>, and some
|
|---|
| 566 | parts have alternatives, like C<cat|keeper>.
|
|---|
| 567 |
|
|---|
| 568 | The B<grouping> metacharacters C<()> solve this problem. Grouping
|
|---|
| 569 | allows parts of a regexp to be treated as a single unit. Parts of a
|
|---|
| 570 | regexp are grouped by enclosing them in parentheses. Thus we could solve
|
|---|
| 571 | the C<housecat|housekeeper> by forming the regexp as
|
|---|
| 572 | C<house(cat|keeper)>. The regexp C<house(cat|keeper)> means match
|
|---|
| 573 | C<house> followed by either C<cat> or C<keeper>. Some more examples
|
|---|
| 574 | are
|
|---|
| 575 |
|
|---|
| 576 | /(a|b)b/; # matches 'ab' or 'bb'
|
|---|
| 577 | /(ac|b)b/; # matches 'acb' or 'bb'
|
|---|
| 578 | /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere
|
|---|
| 579 | /(a|[bc])d/; # matches 'ad', 'bd', or 'cd'
|
|---|
| 580 |
|
|---|
| 581 | /house(cat|)/; # matches either 'housecat' or 'house'
|
|---|
| 582 | /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or
|
|---|
| 583 | # 'house'. Note groups can be nested.
|
|---|
| 584 |
|
|---|
| 585 | /(19|20|)\d\d/; # match years 19xx, 20xx, or the Y2K problem, xx
|
|---|
| 586 | "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d',
|
|---|
| 587 | # because '20\d\d' can't match
|
|---|
| 588 |
|
|---|
| 589 | Alternations behave the same way in groups as out of them: at a given
|
|---|
| 590 | string position, the leftmost alternative that allows the regexp to
|
|---|
| 591 | match is taken. So in the last example at the first string position,
|
|---|
| 592 | C<"20"> matches the second alternative, but there is nothing left over
|
|---|
| 593 | to match the next two digits C<\d\d>. So perl moves on to the next
|
|---|
| 594 | alternative, which is the null alternative and that works, since
|
|---|
| 595 | C<"20"> is two digits.
|
|---|
| 596 |
|
|---|
| 597 | The process of trying one alternative, seeing if it matches, and
|
|---|
| 598 | moving on to the next alternative if it doesn't, is called
|
|---|
| 599 | B<backtracking>. The term 'backtracking' comes from the idea that
|
|---|
| 600 | matching a regexp is like a walk in the woods. Successfully matching
|
|---|
| 601 | a regexp is like arriving at a destination. There are many possible
|
|---|
| 602 | trailheads, one for each string position, and each one is tried in
|
|---|
| 603 | order, left to right. From each trailhead there may be many paths,
|
|---|
| 604 | some of which get you there, and some which are dead ends. When you
|
|---|
| 605 | walk along a trail and hit a dead end, you have to backtrack along the
|
|---|
| 606 | trail to an earlier point to try another trail. If you hit your
|
|---|
| 607 | destination, you stop immediately and forget about trying all the
|
|---|
| 608 | other trails. You are persistent, and only if you have tried all the
|
|---|
| 609 | trails from all the trailheads and not arrived at your destination, do
|
|---|
| 610 | you declare failure. To be concrete, here is a step-by-step analysis
|
|---|
| 611 | of what perl does when it tries to match the regexp
|
|---|
| 612 |
|
|---|
| 613 | "abcde" =~ /(abd|abc)(df|d|de)/;
|
|---|
| 614 |
|
|---|
| 615 | =over 4
|
|---|
| 616 |
|
|---|
| 617 | =item 0
|
|---|
| 618 |
|
|---|
| 619 | Start with the first letter in the string 'a'.
|
|---|
| 620 |
|
|---|
| 621 | =item 1
|
|---|
| 622 |
|
|---|
| 623 | Try the first alternative in the first group 'abd'.
|
|---|
| 624 |
|
|---|
| 625 | =item 2
|
|---|
| 626 |
|
|---|
| 627 | Match 'a' followed by 'b'. So far so good.
|
|---|
| 628 |
|
|---|
| 629 | =item 3
|
|---|
| 630 |
|
|---|
| 631 | 'd' in the regexp doesn't match 'c' in the string - a dead
|
|---|
| 632 | end. So backtrack two characters and pick the second alternative in
|
|---|
| 633 | the first group 'abc'.
|
|---|
| 634 |
|
|---|
| 635 | =item 4
|
|---|
| 636 |
|
|---|
| 637 | Match 'a' followed by 'b' followed by 'c'. We are on a roll
|
|---|
| 638 | and have satisfied the first group. Set $1 to 'abc'.
|
|---|
| 639 |
|
|---|
| 640 | =item 5
|
|---|
| 641 |
|
|---|
| 642 | Move on to the second group and pick the first alternative
|
|---|
| 643 | 'df'.
|
|---|
| 644 |
|
|---|
| 645 | =item 6
|
|---|
| 646 |
|
|---|
| 647 | Match the 'd'.
|
|---|
| 648 |
|
|---|
| 649 | =item 7
|
|---|
| 650 |
|
|---|
| 651 | 'f' in the regexp doesn't match 'e' in the string, so a dead
|
|---|
| 652 | end. Backtrack one character and pick the second alternative in the
|
|---|
| 653 | second group 'd'.
|
|---|
| 654 |
|
|---|
| 655 | =item 8
|
|---|
| 656 |
|
|---|
| 657 | 'd' matches. The second grouping is satisfied, so set $2 to
|
|---|
| 658 | 'd'.
|
|---|
| 659 |
|
|---|
| 660 | =item 9
|
|---|
| 661 |
|
|---|
| 662 | We are at the end of the regexp, so we are done! We have
|
|---|
| 663 | matched 'abcd' out of the string "abcde".
|
|---|
| 664 |
|
|---|
| 665 | =back
|
|---|
| 666 |
|
|---|
| 667 | There are a couple of things to note about this analysis. First, the
|
|---|
| 668 | third alternative in the second group 'de' also allows a match, but we
|
|---|
| 669 | stopped before we got to it - at a given character position, leftmost
|
|---|
| 670 | wins. Second, we were able to get a match at the first character
|
|---|
| 671 | position of the string 'a'. If there were no matches at the first
|
|---|
| 672 | position, perl would move to the second character position 'b' and
|
|---|
| 673 | attempt the match all over again. Only when all possible paths at all
|
|---|
| 674 | possible character positions have been exhausted does perl give
|
|---|
| 675 | up and declare S<C<$string =~ /(abd|abc)(df|d|de)/;> > to be false.
|
|---|
| 676 |
|
|---|
| 677 | Even with all this work, regexp matching happens remarkably fast. To
|
|---|
| 678 | speed things up, during compilation stage, perl compiles the regexp
|
|---|
| 679 | into a compact sequence of opcodes that can often fit inside a
|
|---|
| 680 | processor cache. When the code is executed, these opcodes can then run
|
|---|
| 681 | at full throttle and search very quickly.
|
|---|
| 682 |
|
|---|
| 683 | =head2 Extracting matches
|
|---|
| 684 |
|
|---|
| 685 | The grouping metacharacters C<()> also serve another completely
|
|---|
| 686 | different function: they allow the extraction of the parts of a string
|
|---|
| 687 | that matched. This is very useful to find out what matched and for
|
|---|
| 688 | text processing in general. For each grouping, the part that matched
|
|---|
| 689 | inside goes into the special variables C<$1>, C<$2>, etc. They can be
|
|---|
| 690 | used just as ordinary variables:
|
|---|
| 691 |
|
|---|
| 692 | # extract hours, minutes, seconds
|
|---|
| 693 | if ($time =~ /(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format
|
|---|
| 694 | $hours = $1;
|
|---|
| 695 | $minutes = $2;
|
|---|
| 696 | $seconds = $3;
|
|---|
| 697 | }
|
|---|
| 698 |
|
|---|
| 699 | Now, we know that in scalar context,
|
|---|
| 700 | S<C<$time =~ /(\d\d):(\d\d):(\d\d)/> > returns a true or false
|
|---|
| 701 | value. In list context, however, it returns the list of matched values
|
|---|
| 702 | C<($1,$2,$3)>. So we could write the code more compactly as
|
|---|
| 703 |
|
|---|
| 704 | # extract hours, minutes, seconds
|
|---|
| 705 | ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
|
|---|
| 706 |
|
|---|
| 707 | If the groupings in a regexp are nested, C<$1> gets the group with the
|
|---|
| 708 | leftmost opening parenthesis, C<$2> the next opening parenthesis,
|
|---|
| 709 | etc. For example, here is a complex regexp and the matching variables
|
|---|
| 710 | indicated below it:
|
|---|
| 711 |
|
|---|
| 712 | /(ab(cd|ef)((gi)|j))/;
|
|---|
| 713 | 1 2 34
|
|---|
| 714 |
|
|---|
| 715 | so that if the regexp matched, e.g., C<$2> would contain 'cd' or 'ef'. For
|
|---|
| 716 | convenience, perl sets C<$+> to the string held by the highest numbered
|
|---|
| 717 | C<$1>, C<$2>, ... that got assigned (and, somewhat related, C<$^N> to the
|
|---|
| 718 | value of the C<$1>, C<$2>, ... most-recently assigned; i.e. the C<$1>,
|
|---|
| 719 | C<$2>, ... associated with the rightmost closing parenthesis used in the
|
|---|
| 720 | match).
|
|---|
| 721 |
|
|---|
| 722 | Closely associated with the matching variables C<$1>, C<$2>, ... are
|
|---|
| 723 | the B<backreferences> C<\1>, C<\2>, ... . Backreferences are simply
|
|---|
| 724 | matching variables that can be used I<inside> a regexp. This is a
|
|---|
| 725 | really nice feature - what matches later in a regexp can depend on
|
|---|
| 726 | what matched earlier in the regexp. Suppose we wanted to look
|
|---|
| 727 | for doubled words in text, like 'the the'. The following regexp finds
|
|---|
| 728 | all 3-letter doubles with a space in between:
|
|---|
| 729 |
|
|---|
| 730 | /(\w\w\w)\s\1/;
|
|---|
| 731 |
|
|---|
| 732 | The grouping assigns a value to \1, so that the same 3 letter sequence
|
|---|
| 733 | is used for both parts. Here are some words with repeated parts:
|
|---|
| 734 |
|
|---|
| 735 | % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\1$' /usr/dict/words
|
|---|
| 736 | beriberi
|
|---|
| 737 | booboo
|
|---|
| 738 | coco
|
|---|
| 739 | mama
|
|---|
| 740 | murmur
|
|---|
| 741 | papa
|
|---|
| 742 |
|
|---|
| 743 | The regexp has a single grouping which considers 4-letter
|
|---|
| 744 | combinations, then 3-letter combinations, etc. and uses C<\1> to look for
|
|---|
| 745 | a repeat. Although C<$1> and C<\1> represent the same thing, care should be
|
|---|
| 746 | taken to use matched variables C<$1>, C<$2>, ... only outside a regexp
|
|---|
| 747 | and backreferences C<\1>, C<\2>, ... only inside a regexp; not doing
|
|---|
| 748 | so may lead to surprising and/or undefined results.
|
|---|
| 749 |
|
|---|
| 750 | In addition to what was matched, Perl 5.6.0 also provides the
|
|---|
| 751 | positions of what was matched with the C<@-> and C<@+>
|
|---|
| 752 | arrays. C<$-[0]> is the position of the start of the entire match and
|
|---|
| 753 | C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the
|
|---|
| 754 | position of the start of the C<$n> match and C<$+[n]> is the position
|
|---|
| 755 | of the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then
|
|---|
| 756 | this code
|
|---|
| 757 |
|
|---|
| 758 | $x = "Mmm...donut, thought Homer";
|
|---|
| 759 | $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches
|
|---|
| 760 | foreach $expr (1..$#-) {
|
|---|
| 761 | print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n";
|
|---|
| 762 | }
|
|---|
| 763 |
|
|---|
| 764 | prints
|
|---|
| 765 |
|
|---|
| 766 | Match 1: 'Mmm' at position (0,3)
|
|---|
| 767 | Match 2: 'donut' at position (6,11)
|
|---|
| 768 |
|
|---|
| 769 | Even if there are no groupings in a regexp, it is still possible to
|
|---|
| 770 | find out what exactly matched in a string. If you use them, perl
|
|---|
| 771 | will set C<$`> to the part of the string before the match, will set C<$&>
|
|---|
| 772 | to the part of the string that matched, and will set C<$'> to the part
|
|---|
| 773 | of the string after the match. An example:
|
|---|
| 774 |
|
|---|
| 775 | $x = "the cat caught the mouse";
|
|---|
| 776 | $x =~ /cat/; # $` = 'the ', $& = 'cat', $' = ' caught the mouse'
|
|---|
| 777 | $x =~ /the/; # $` = '', $& = 'the', $' = ' cat caught the mouse'
|
|---|
| 778 |
|
|---|
| 779 | In the second match, S<C<$` = ''> > because the regexp matched at the
|
|---|
| 780 | first character position in the string and stopped, it never saw the
|
|---|
| 781 | second 'the'. It is important to note that using C<$`> and C<$'>
|
|---|
| 782 | slows down regexp matching quite a bit, and C< $& > slows it down to a
|
|---|
| 783 | lesser extent, because if they are used in one regexp in a program,
|
|---|
| 784 | they are generated for <all> regexps in the program. So if raw
|
|---|
| 785 | performance is a goal of your application, they should be avoided.
|
|---|
| 786 | If you need them, use C<@-> and C<@+> instead:
|
|---|
| 787 |
|
|---|
| 788 | $` is the same as substr( $x, 0, $-[0] )
|
|---|
| 789 | $& is the same as substr( $x, $-[0], $+[0]-$-[0] )
|
|---|
| 790 | $' is the same as substr( $x, $+[0] )
|
|---|
| 791 |
|
|---|
| 792 | =head2 Matching repetitions
|
|---|
| 793 |
|
|---|
| 794 | The examples in the previous section display an annoying weakness. We
|
|---|
| 795 | were only matching 3-letter words, or syllables of 4 letters or
|
|---|
| 796 | less. We'd like to be able to match words or syllables of any length,
|
|---|
| 797 | without writing out tedious alternatives like
|
|---|
| 798 | C<\w\w\w\w|\w\w\w|\w\w|\w>.
|
|---|
| 799 |
|
|---|
| 800 | This is exactly the problem the B<quantifier> metacharacters C<?>,
|
|---|
| 801 | C<*>, C<+>, and C<{}> were created for. They allow us to determine the
|
|---|
| 802 | number of repeats of a portion of a regexp we consider to be a
|
|---|
| 803 | match. Quantifiers are put immediately after the character, character
|
|---|
| 804 | class, or grouping that we want to specify. They have the following
|
|---|
| 805 | meanings:
|
|---|
| 806 |
|
|---|
| 807 | =over 4
|
|---|
| 808 |
|
|---|
| 809 | =item *
|
|---|
| 810 |
|
|---|
| 811 | C<a?> = match 'a' 1 or 0 times
|
|---|
| 812 |
|
|---|
| 813 | =item *
|
|---|
| 814 |
|
|---|
| 815 | C<a*> = match 'a' 0 or more times, i.e., any number of times
|
|---|
| 816 |
|
|---|
| 817 | =item *
|
|---|
| 818 |
|
|---|
| 819 | C<a+> = match 'a' 1 or more times, i.e., at least once
|
|---|
| 820 |
|
|---|
| 821 | =item *
|
|---|
| 822 |
|
|---|
| 823 | C<a{n,m}> = match at least C<n> times, but not more than C<m>
|
|---|
| 824 | times.
|
|---|
| 825 |
|
|---|
| 826 | =item *
|
|---|
| 827 |
|
|---|
| 828 | C<a{n,}> = match at least C<n> or more times
|
|---|
| 829 |
|
|---|
| 830 | =item *
|
|---|
| 831 |
|
|---|
| 832 | C<a{n}> = match exactly C<n> times
|
|---|
| 833 |
|
|---|
| 834 | =back
|
|---|
| 835 |
|
|---|
| 836 | Here are some examples:
|
|---|
| 837 |
|
|---|
| 838 | /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
|
|---|
| 839 | # any number of digits
|
|---|
| 840 | /(\w+)\s+\1/; # match doubled words of arbitrary length
|
|---|
| 841 | /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes'
|
|---|
| 842 | $year =~ /\d{2,4}/; # make sure year is at least 2 but not more
|
|---|
| 843 | # than 4 digits
|
|---|
| 844 | $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates
|
|---|
| 845 | $year =~ /\d{2}(\d{2})?/; # same thing written differently. However,
|
|---|
| 846 | # this produces $1 and the other does not.
|
|---|
| 847 |
|
|---|
| 848 | % simple_grep '^(\w+)\1$' /usr/dict/words # isn't this easier?
|
|---|
| 849 | beriberi
|
|---|
| 850 | booboo
|
|---|
| 851 | coco
|
|---|
| 852 | mama
|
|---|
| 853 | murmur
|
|---|
| 854 | papa
|
|---|
| 855 |
|
|---|
| 856 | For all of these quantifiers, perl will try to match as much of the
|
|---|
| 857 | string as possible, while still allowing the regexp to succeed. Thus
|
|---|
| 858 | with C</a?.../>, perl will first try to match the regexp with the C<a>
|
|---|
| 859 | present; if that fails, perl will try to match the regexp without the
|
|---|
| 860 | C<a> present. For the quantifier C<*>, we get the following:
|
|---|
| 861 |
|
|---|
| 862 | $x = "the cat in the hat";
|
|---|
| 863 | $x =~ /^(.*)(cat)(.*)$/; # matches,
|
|---|
| 864 | # $1 = 'the '
|
|---|
| 865 | # $2 = 'cat'
|
|---|
| 866 | # $3 = ' in the hat'
|
|---|
| 867 |
|
|---|
| 868 | Which is what we might expect, the match finds the only C<cat> in the
|
|---|
| 869 | string and locks onto it. Consider, however, this regexp:
|
|---|
| 870 |
|
|---|
| 871 | $x =~ /^(.*)(at)(.*)$/; # matches,
|
|---|
| 872 | # $1 = 'the cat in the h'
|
|---|
| 873 | # $2 = 'at'
|
|---|
| 874 | # $3 = '' (0 matches)
|
|---|
| 875 |
|
|---|
| 876 | One might initially guess that perl would find the C<at> in C<cat> and
|
|---|
| 877 | stop there, but that wouldn't give the longest possible string to the
|
|---|
| 878 | first quantifier C<.*>. Instead, the first quantifier C<.*> grabs as
|
|---|
| 879 | much of the string as possible while still having the regexp match. In
|
|---|
| 880 | this example, that means having the C<at> sequence with the final C<at>
|
|---|
| 881 | in the string. The other important principle illustrated here is that
|
|---|
| 882 | when there are two or more elements in a regexp, the I<leftmost>
|
|---|
| 883 | quantifier, if there is one, gets to grab as much the string as
|
|---|
| 884 | possible, leaving the rest of the regexp to fight over scraps. Thus in
|
|---|
| 885 | our example, the first quantifier C<.*> grabs most of the string, while
|
|---|
| 886 | the second quantifier C<.*> gets the empty string. Quantifiers that
|
|---|
| 887 | grab as much of the string as possible are called B<maximal match> or
|
|---|
| 888 | B<greedy> quantifiers.
|
|---|
| 889 |
|
|---|
| 890 | When a regexp can match a string in several different ways, we can use
|
|---|
| 891 | the principles above to predict which way the regexp will match:
|
|---|
| 892 |
|
|---|
| 893 | =over 4
|
|---|
| 894 |
|
|---|
| 895 | =item *
|
|---|
| 896 |
|
|---|
| 897 | Principle 0: Taken as a whole, any regexp will be matched at the
|
|---|
| 898 | earliest possible position in the string.
|
|---|
| 899 |
|
|---|
| 900 | =item *
|
|---|
| 901 |
|
|---|
| 902 | Principle 1: In an alternation C<a|b|c...>, the leftmost alternative
|
|---|
| 903 | that allows a match for the whole regexp will be the one used.
|
|---|
| 904 |
|
|---|
| 905 | =item *
|
|---|
| 906 |
|
|---|
| 907 | Principle 2: The maximal matching quantifiers C<?>, C<*>, C<+> and
|
|---|
| 908 | C<{n,m}> will in general match as much of the string as possible while
|
|---|
| 909 | still allowing the whole regexp to match.
|
|---|
| 910 |
|
|---|
| 911 | =item *
|
|---|
| 912 |
|
|---|
| 913 | Principle 3: If there are two or more elements in a regexp, the
|
|---|
| 914 | leftmost greedy quantifier, if any, will match as much of the string
|
|---|
| 915 | as possible while still allowing the whole regexp to match. The next
|
|---|
| 916 | leftmost greedy quantifier, if any, will try to match as much of the
|
|---|
| 917 | string remaining available to it as possible, while still allowing the
|
|---|
| 918 | whole regexp to match. And so on, until all the regexp elements are
|
|---|
| 919 | satisfied.
|
|---|
| 920 |
|
|---|
| 921 | =back
|
|---|
| 922 |
|
|---|
| 923 | As we have seen above, Principle 0 overrides the others - the regexp
|
|---|
| 924 | will be matched as early as possible, with the other principles
|
|---|
| 925 | determining how the regexp matches at that earliest character
|
|---|
| 926 | position.
|
|---|
| 927 |
|
|---|
| 928 | Here is an example of these principles in action:
|
|---|
| 929 |
|
|---|
| 930 | $x = "The programming republic of Perl";
|
|---|
| 931 | $x =~ /^(.+)(e|r)(.*)$/; # matches,
|
|---|
| 932 | # $1 = 'The programming republic of Pe'
|
|---|
| 933 | # $2 = 'r'
|
|---|
| 934 | # $3 = 'l'
|
|---|
| 935 |
|
|---|
| 936 | This regexp matches at the earliest string position, C<'T'>. One
|
|---|
| 937 | might think that C<e>, being leftmost in the alternation, would be
|
|---|
| 938 | matched, but C<r> produces the longest string in the first quantifier.
|
|---|
| 939 |
|
|---|
| 940 | $x =~ /(m{1,2})(.*)$/; # matches,
|
|---|
| 941 | # $1 = 'mm'
|
|---|
| 942 | # $2 = 'ing republic of Perl'
|
|---|
| 943 |
|
|---|
| 944 | Here, The earliest possible match is at the first C<'m'> in
|
|---|
| 945 | C<programming>. C<m{1,2}> is the first quantifier, so it gets to match
|
|---|
| 946 | a maximal C<mm>.
|
|---|
| 947 |
|
|---|
| 948 | $x =~ /.*(m{1,2})(.*)$/; # matches,
|
|---|
| 949 | # $1 = 'm'
|
|---|
| 950 | # $2 = 'ing republic of Perl'
|
|---|
| 951 |
|
|---|
| 952 | Here, the regexp matches at the start of the string. The first
|
|---|
| 953 | quantifier C<.*> grabs as much as possible, leaving just a single
|
|---|
| 954 | C<'m'> for the second quantifier C<m{1,2}>.
|
|---|
| 955 |
|
|---|
| 956 | $x =~ /(.?)(m{1,2})(.*)$/; # matches,
|
|---|
| 957 | # $1 = 'a'
|
|---|
| 958 | # $2 = 'mm'
|
|---|
| 959 | # $3 = 'ing republic of Perl'
|
|---|
| 960 |
|
|---|
| 961 | Here, C<.?> eats its maximal one character at the earliest possible
|
|---|
| 962 | position in the string, C<'a'> in C<programming>, leaving C<m{1,2}>
|
|---|
| 963 | the opportunity to match both C<m>'s. Finally,
|
|---|
| 964 |
|
|---|
| 965 | "aXXXb" =~ /(X*)/; # matches with $1 = ''
|
|---|
| 966 |
|
|---|
| 967 | because it can match zero copies of C<'X'> at the beginning of the
|
|---|
| 968 | string. If you definitely want to match at least one C<'X'>, use
|
|---|
| 969 | C<X+>, not C<X*>.
|
|---|
| 970 |
|
|---|
| 971 | Sometimes greed is not good. At times, we would like quantifiers to
|
|---|
| 972 | match a I<minimal> piece of string, rather than a maximal piece. For
|
|---|
| 973 | this purpose, Larry Wall created the S<B<minimal match> > or
|
|---|
| 974 | B<non-greedy> quantifiers C<??>,C<*?>, C<+?>, and C<{}?>. These are
|
|---|
| 975 | the usual quantifiers with a C<?> appended to them. They have the
|
|---|
| 976 | following meanings:
|
|---|
| 977 |
|
|---|
| 978 | =over 4
|
|---|
| 979 |
|
|---|
| 980 | =item *
|
|---|
| 981 |
|
|---|
| 982 | C<a??> = match 'a' 0 or 1 times. Try 0 first, then 1.
|
|---|
| 983 |
|
|---|
| 984 | =item *
|
|---|
| 985 |
|
|---|
| 986 | C<a*?> = match 'a' 0 or more times, i.e., any number of times,
|
|---|
| 987 | but as few times as possible
|
|---|
| 988 |
|
|---|
| 989 | =item *
|
|---|
| 990 |
|
|---|
| 991 | C<a+?> = match 'a' 1 or more times, i.e., at least once, but
|
|---|
| 992 | as few times as possible
|
|---|
| 993 |
|
|---|
| 994 | =item *
|
|---|
| 995 |
|
|---|
| 996 | C<a{n,m}?> = match at least C<n> times, not more than C<m>
|
|---|
| 997 | times, as few times as possible
|
|---|
| 998 |
|
|---|
| 999 | =item *
|
|---|
| 1000 |
|
|---|
| 1001 | C<a{n,}?> = match at least C<n> times, but as few times as
|
|---|
| 1002 | possible
|
|---|
| 1003 |
|
|---|
| 1004 | =item *
|
|---|
| 1005 |
|
|---|
| 1006 | C<a{n}?> = match exactly C<n> times. Because we match exactly
|
|---|
| 1007 | C<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for
|
|---|
| 1008 | notational consistency.
|
|---|
| 1009 |
|
|---|
| 1010 | =back
|
|---|
| 1011 |
|
|---|
| 1012 | Let's look at the example above, but with minimal quantifiers:
|
|---|
| 1013 |
|
|---|
| 1014 | $x = "The programming republic of Perl";
|
|---|
| 1015 | $x =~ /^(.+?)(e|r)(.*)$/; # matches,
|
|---|
| 1016 | # $1 = 'Th'
|
|---|
| 1017 | # $2 = 'e'
|
|---|
| 1018 | # $3 = ' programming republic of Perl'
|
|---|
| 1019 |
|
|---|
| 1020 | The minimal string that will allow both the start of the string C<^>
|
|---|
| 1021 | and the alternation to match is C<Th>, with the alternation C<e|r>
|
|---|
| 1022 | matching C<e>. The second quantifier C<.*> is free to gobble up the
|
|---|
| 1023 | rest of the string.
|
|---|
| 1024 |
|
|---|
| 1025 | $x =~ /(m{1,2}?)(.*?)$/; # matches,
|
|---|
| 1026 | # $1 = 'm'
|
|---|
| 1027 | # $2 = 'ming republic of Perl'
|
|---|
| 1028 |
|
|---|
| 1029 | The first string position that this regexp can match is at the first
|
|---|
| 1030 | C<'m'> in C<programming>. At this position, the minimal C<m{1,2}?>
|
|---|
| 1031 | matches just one C<'m'>. Although the second quantifier C<.*?> would
|
|---|
| 1032 | prefer to match no characters, it is constrained by the end-of-string
|
|---|
| 1033 | anchor C<$> to match the rest of the string.
|
|---|
| 1034 |
|
|---|
| 1035 | $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches,
|
|---|
| 1036 | # $1 = 'The progra'
|
|---|
| 1037 | # $2 = 'm'
|
|---|
| 1038 | # $3 = 'ming republic of Perl'
|
|---|
| 1039 |
|
|---|
| 1040 | In this regexp, you might expect the first minimal quantifier C<.*?>
|
|---|
| 1041 | to match the empty string, because it is not constrained by a C<^>
|
|---|
| 1042 | anchor to match the beginning of the word. Principle 0 applies here,
|
|---|
| 1043 | however. Because it is possible for the whole regexp to match at the
|
|---|
| 1044 | start of the string, it I<will> match at the start of the string. Thus
|
|---|
| 1045 | the first quantifier has to match everything up to the first C<m>. The
|
|---|
| 1046 | second minimal quantifier matches just one C<m> and the third
|
|---|
| 1047 | quantifier matches the rest of the string.
|
|---|
| 1048 |
|
|---|
| 1049 | $x =~ /(.??)(m{1,2})(.*)$/; # matches,
|
|---|
| 1050 | # $1 = 'a'
|
|---|
| 1051 | # $2 = 'mm'
|
|---|
| 1052 | # $3 = 'ing republic of Perl'
|
|---|
| 1053 |
|
|---|
| 1054 | Just as in the previous regexp, the first quantifier C<.??> can match
|
|---|
| 1055 | earliest at position C<'a'>, so it does. The second quantifier is
|
|---|
| 1056 | greedy, so it matches C<mm>, and the third matches the rest of the
|
|---|
| 1057 | string.
|
|---|
| 1058 |
|
|---|
| 1059 | We can modify principle 3 above to take into account non-greedy
|
|---|
| 1060 | quantifiers:
|
|---|
| 1061 |
|
|---|
| 1062 | =over 4
|
|---|
| 1063 |
|
|---|
| 1064 | =item *
|
|---|
| 1065 |
|
|---|
| 1066 | Principle 3: If there are two or more elements in a regexp, the
|
|---|
| 1067 | leftmost greedy (non-greedy) quantifier, if any, will match as much
|
|---|
| 1068 | (little) of the string as possible while still allowing the whole
|
|---|
| 1069 | regexp to match. The next leftmost greedy (non-greedy) quantifier, if
|
|---|
| 1070 | any, will try to match as much (little) of the string remaining
|
|---|
| 1071 | available to it as possible, while still allowing the whole regexp to
|
|---|
| 1072 | match. And so on, until all the regexp elements are satisfied.
|
|---|
| 1073 |
|
|---|
| 1074 | =back
|
|---|
| 1075 |
|
|---|
| 1076 | Just like alternation, quantifiers are also susceptible to
|
|---|
| 1077 | backtracking. Here is a step-by-step analysis of the example
|
|---|
| 1078 |
|
|---|
| 1079 | $x = "the cat in the hat";
|
|---|
| 1080 | $x =~ /^(.*)(at)(.*)$/; # matches,
|
|---|
| 1081 | # $1 = 'the cat in the h'
|
|---|
| 1082 | # $2 = 'at'
|
|---|
| 1083 | # $3 = '' (0 matches)
|
|---|
| 1084 |
|
|---|
| 1085 | =over 4
|
|---|
| 1086 |
|
|---|
| 1087 | =item 0
|
|---|
| 1088 |
|
|---|
| 1089 | Start with the first letter in the string 't'.
|
|---|
| 1090 |
|
|---|
| 1091 | =item 1
|
|---|
| 1092 |
|
|---|
| 1093 | The first quantifier '.*' starts out by matching the whole
|
|---|
| 1094 | string 'the cat in the hat'.
|
|---|
| 1095 |
|
|---|
| 1096 | =item 2
|
|---|
| 1097 |
|
|---|
| 1098 | 'a' in the regexp element 'at' doesn't match the end of the
|
|---|
| 1099 | string. Backtrack one character.
|
|---|
| 1100 |
|
|---|
| 1101 | =item 3
|
|---|
| 1102 |
|
|---|
| 1103 | 'a' in the regexp element 'at' still doesn't match the last
|
|---|
| 1104 | letter of the string 't', so backtrack one more character.
|
|---|
| 1105 |
|
|---|
| 1106 | =item 4
|
|---|
| 1107 |
|
|---|
| 1108 | Now we can match the 'a' and the 't'.
|
|---|
| 1109 |
|
|---|
| 1110 | =item 5
|
|---|
| 1111 |
|
|---|
| 1112 | Move on to the third element '.*'. Since we are at the end of
|
|---|
| 1113 | the string and '.*' can match 0 times, assign it the empty string.
|
|---|
| 1114 |
|
|---|
| 1115 | =item 6
|
|---|
| 1116 |
|
|---|
| 1117 | We are done!
|
|---|
| 1118 |
|
|---|
| 1119 | =back
|
|---|
| 1120 |
|
|---|
| 1121 | Most of the time, all this moving forward and backtracking happens
|
|---|
| 1122 | quickly and searching is fast. There are some pathological regexps,
|
|---|
| 1123 | however, whose execution time exponentially grows with the size of the
|
|---|
| 1124 | string. A typical structure that blows up in your face is of the form
|
|---|
| 1125 |
|
|---|
| 1126 | /(a|b+)*/;
|
|---|
| 1127 |
|
|---|
| 1128 | The problem is the nested indeterminate quantifiers. There are many
|
|---|
| 1129 | different ways of partitioning a string of length n between the C<+>
|
|---|
| 1130 | and C<*>: one repetition with C<b+> of length n, two repetitions with
|
|---|
| 1131 | the first C<b+> length k and the second with length n-k, m repetitions
|
|---|
| 1132 | whose bits add up to length n, etc. In fact there are an exponential
|
|---|
| 1133 | number of ways to partition a string as a function of length. A
|
|---|
| 1134 | regexp may get lucky and match early in the process, but if there is
|
|---|
| 1135 | no match, perl will try I<every> possibility before giving up. So be
|
|---|
| 1136 | careful with nested C<*>'s, C<{n,m}>'s, and C<+>'s. The book
|
|---|
| 1137 | I<Mastering regular expressions> by Jeffrey Friedl gives a wonderful
|
|---|
| 1138 | discussion of this and other efficiency issues.
|
|---|
| 1139 |
|
|---|
| 1140 | =head2 Building a regexp
|
|---|
| 1141 |
|
|---|
| 1142 | At this point, we have all the basic regexp concepts covered, so let's
|
|---|
| 1143 | give a more involved example of a regular expression. We will build a
|
|---|
| 1144 | regexp that matches numbers.
|
|---|
| 1145 |
|
|---|
| 1146 | The first task in building a regexp is to decide what we want to match
|
|---|
| 1147 | and what we want to exclude. In our case, we want to match both
|
|---|
| 1148 | integers and floating point numbers and we want to reject any string
|
|---|
| 1149 | that isn't a number.
|
|---|
| 1150 |
|
|---|
| 1151 | The next task is to break the problem down into smaller problems that
|
|---|
| 1152 | are easily converted into a regexp.
|
|---|
| 1153 |
|
|---|
| 1154 | The simplest case is integers. These consist of a sequence of digits,
|
|---|
| 1155 | with an optional sign in front. The digits we can represent with
|
|---|
| 1156 | C<\d+> and the sign can be matched with C<[+-]>. Thus the integer
|
|---|
| 1157 | regexp is
|
|---|
| 1158 |
|
|---|
| 1159 | /[+-]?\d+/; # matches integers
|
|---|
| 1160 |
|
|---|
| 1161 | A floating point number potentially has a sign, an integral part, a
|
|---|
| 1162 | decimal point, a fractional part, and an exponent. One or more of these
|
|---|
| 1163 | parts is optional, so we need to check out the different
|
|---|
| 1164 | possibilities. Floating point numbers which are in proper form include
|
|---|
| 1165 | 123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out
|
|---|
| 1166 | front is completely optional and can be matched by C<[+-]?>. We can
|
|---|
| 1167 | see that if there is no exponent, floating point numbers must have a
|
|---|
| 1168 | decimal point, otherwise they are integers. We might be tempted to
|
|---|
| 1169 | model these with C<\d*\.\d*>, but this would also match just a single
|
|---|
| 1170 | decimal point, which is not a number. So the three cases of floating
|
|---|
| 1171 | point number sans exponent are
|
|---|
| 1172 |
|
|---|
| 1173 | /[+-]?\d+\./; # 1., 321., etc.
|
|---|
| 1174 | /[+-]?\.\d+/; # .1, .234, etc.
|
|---|
| 1175 | /[+-]?\d+\.\d+/; # 1.0, 30.56, etc.
|
|---|
| 1176 |
|
|---|
| 1177 | These can be combined into a single regexp with a three-way alternation:
|
|---|
| 1178 |
|
|---|
| 1179 | /[+-]?(\d+\.\d+|\d+\.|\.\d+)/; # floating point, no exponent
|
|---|
| 1180 |
|
|---|
|
|---|