| 1 | =head1 NAME
|
|---|
| 2 |
|
|---|
| 3 | perlretut - Perl regular expressions tutorial
|
|---|
| 4 |
|
|---|
| 5 | =head1 DESCRIPTION
|
|---|
| 6 |
|
|---|
| 7 | This page provides a basic tutorial on understanding, creating and
|
|---|
| 8 | using regular expressions in Perl. It serves as a complement to the
|
|---|
| 9 | reference page on regular expressions L<perlre>. Regular expressions
|
|---|
| 10 | are an integral part of the C<m//>, C<s///>, C<qr//> and C<split>
|
|---|
| 11 | operators and so this tutorial also overlaps with
|
|---|
| 12 | L<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>.
|
|---|
| 13 |
|
|---|
| 14 | Perl is widely renowned for excellence in text processing, and regular
|
|---|
| 15 | expressions are one of the big factors behind this fame. Perl regular
|
|---|
| 16 | expressions display an efficiency and flexibility unknown in most
|
|---|
| 17 | other computer languages. Mastering even the basics of regular
|
|---|
| 18 | expressions will allow you to manipulate text with surprising ease.
|
|---|
| 19 |
|
|---|
| 20 | What is a regular expression? A regular expression is simply a string
|
|---|
| 21 | that describes a pattern. Patterns are in common use these days;
|
|---|
| 22 | examples are the patterns typed into a search engine to find web pages
|
|---|
| 23 | and the patterns used to list files in a directory, e.g., C<ls *.txt>
|
|---|
| 24 | or C<dir *.*>. In Perl, the patterns described by regular expressions
|
|---|
| 25 | are used to search strings, extract desired parts of strings, and to
|
|---|
| 26 | do search and replace operations.
|
|---|
| 27 |
|
|---|
| 28 | Regular expressions have the undeserved reputation of being abstract
|
|---|
| 29 | and difficult to understand. Regular expressions are constructed using
|
|---|
| 30 | simple concepts like conditionals and loops and are no more difficult
|
|---|
| 31 | to understand than the corresponding C<if> conditionals and C<while>
|
|---|
| 32 | loops in the Perl language itself. In fact, the main challenge in
|
|---|
| 33 | learning regular expressions is just getting used to the terse
|
|---|
| 34 | notation used to express these concepts.
|
|---|
| 35 |
|
|---|
| 36 | This tutorial flattens the learning curve by discussing regular
|
|---|
| 37 | expression concepts, along with their notation, one at a time and with
|
|---|
| 38 | many examples. The first part of the tutorial will progress from the
|
|---|
| 39 | simplest word searches to the basic regular expression concepts. If
|
|---|
| 40 | you master the first part, you will have all the tools needed to solve
|
|---|
| 41 | about 98% of your needs. The second part of the tutorial is for those
|
|---|
| 42 | comfortable with the basics and hungry for more power tools. It
|
|---|
| 43 | discusses the more advanced regular expression operators and
|
|---|
| 44 | introduces the latest cutting edge innovations in 5.6.0.
|
|---|
| 45 |
|
|---|
| 46 | A note: to save time, 'regular expression' is often abbreviated as
|
|---|
| 47 | regexp or regex. Regexp is a more natural abbreviation than regex, but
|
|---|
| 48 | is harder to pronounce. The Perl pod documentation is evenly split on
|
|---|
| 49 | regexp vs regex; in Perl, there is more than one way to abbreviate it.
|
|---|
| 50 | We'll use regexp in this tutorial.
|
|---|
| 51 |
|
|---|
| 52 | =head1 Part 1: The basics
|
|---|
| 53 |
|
|---|
| 54 | =head2 Simple word matching
|
|---|
| 55 |
|
|---|
| 56 | The simplest regexp is simply a word, or more generally, a string of
|
|---|
| 57 | characters. A regexp consisting of a word matches any string that
|
|---|
| 58 | contains that word:
|
|---|
| 59 |
|
|---|
| 60 | "Hello World" =~ /World/; # matches
|
|---|
| 61 |
|
|---|
| 62 | What is this perl statement all about? C<"Hello World"> is a simple
|
|---|
| 63 | double quoted string. C<World> is the regular expression and the
|
|---|
| 64 | C<//> enclosing C</World/> tells perl to search a string for a match.
|
|---|
| 65 | The operator C<=~> associates the string with the regexp match and
|
|---|
| 66 | produces a true value if the regexp matched, or false if the regexp
|
|---|
| 67 | did not match. In our case, C<World> matches the second word in
|
|---|
| 68 | C<"Hello World">, so the expression is true. Expressions like this
|
|---|
| 69 | are useful in conditionals:
|
|---|
| 70 |
|
|---|
| 71 | if ("Hello World" =~ /World/) {
|
|---|
| 72 | print "It matches\n";
|
|---|
| 73 | }
|
|---|
| 74 | else {
|
|---|
| 75 | print "It doesn't match\n";
|
|---|
| 76 | }
|
|---|
| 77 |
|
|---|
| 78 | There are useful variations on this theme. The sense of the match can
|
|---|
| 79 | be reversed by using C<!~> operator:
|
|---|
| 80 |
|
|---|
| 81 | if ("Hello World" !~ /World/) {
|
|---|
| 82 | print "It doesn't match\n";
|
|---|
| 83 | }
|
|---|
| 84 | else {
|
|---|
| 85 | print "It matches\n";
|
|---|
| 86 | }
|
|---|
| 87 |
|
|---|
| 88 | The literal string in the regexp can be replaced by a variable:
|
|---|
| 89 |
|
|---|
| 90 | $greeting = "World";
|
|---|
| 91 | if ("Hello World" =~ /$greeting/) {
|
|---|
| 92 | print "It matches\n";
|
|---|
| 93 | }
|
|---|
| 94 | else {
|
|---|
| 95 | print "It doesn't match\n";
|
|---|
| 96 | }
|
|---|
| 97 |
|
|---|
| 98 | If you're matching against the special default variable C<$_>, the
|
|---|
| 99 | C<$_ =~> part can be omitted:
|
|---|
| 100 |
|
|---|
| 101 | $_ = "Hello World";
|
|---|
| 102 | if (/World/) {
|
|---|
| 103 | print "It matches\n";
|
|---|
| 104 | }
|
|---|
| 105 | else {
|
|---|
| 106 | print "It doesn't match\n";
|
|---|
| 107 | }
|
|---|
| 108 |
|
|---|
| 109 | And finally, the C<//> default delimiters for a match can be changed
|
|---|
| 110 | to arbitrary delimiters by putting an C<'m'> out front:
|
|---|
| 111 |
|
|---|
| 112 | "Hello World" =~ m!World!; # matches, delimited by '!'
|
|---|
| 113 | "Hello World" =~ m{World}; # matches, note the matching '{}'
|
|---|
| 114 | "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
|
|---|
| 115 | # '/' becomes an ordinary char
|
|---|
| 116 |
|
|---|
| 117 | C</World/>, C<m!World!>, and C<m{World}> all represent the
|
|---|
| 118 | same thing. When, e.g., C<""> is used as a delimiter, the forward
|
|---|
| 119 | slash C<'/'> becomes an ordinary character and can be used in a regexp
|
|---|
| 120 | without trouble.
|
|---|
| 121 |
|
|---|
| 122 | Let's consider how different regexps would match C<"Hello World">:
|
|---|
| 123 |
|
|---|
| 124 | "Hello World" =~ /world/; # doesn't match
|
|---|
| 125 | "Hello World" =~ /o W/; # matches
|
|---|
| 126 | "Hello World" =~ /oW/; # doesn't match
|
|---|
| 127 | "Hello World" =~ /World /; # doesn't match
|
|---|
| 128 |
|
|---|
| 129 | The first regexp C<world> doesn't match because regexps are
|
|---|
| 130 | case-sensitive. The second regexp matches because the substring
|
|---|
| 131 | S<C<'o W'> > occurs in the string S<C<"Hello World"> >. The space
|
|---|
| 132 | character ' ' is treated like any other character in a regexp and is
|
|---|
| 133 | needed to match in this case. The lack of a space character is the
|
|---|
| 134 | reason the third regexp C<'oW'> doesn't match. The fourth regexp
|
|---|
| 135 | C<'World '> doesn't match because there is a space at the end of the
|
|---|
| 136 | regexp, but not at the end of the string. The lesson here is that
|
|---|
| 137 | regexps must match a part of the string I<exactly> in order for the
|
|---|
| 138 | statement to be true.
|
|---|
| 139 |
|
|---|
| 140 | If a regexp matches in more than one place in the string, perl will
|
|---|
| 141 | always match at the earliest possible point in the string:
|
|---|
| 142 |
|
|---|
| 143 | "Hello World" =~ /o/; # matches 'o' in 'Hello'
|
|---|
| 144 | "That hat is red" =~ /hat/; # matches 'hat' in 'That'
|
|---|
| 145 |
|
|---|
| 146 | With respect to character matching, there are a few more points you
|
|---|
| 147 | need to know about. First of all, not all characters can be used 'as
|
|---|
| 148 | is' in a match. Some characters, called B<metacharacters>, are reserved
|
|---|
| 149 | for use in regexp notation. The metacharacters are
|
|---|
| 150 |
|
|---|
| 151 | {}[]()^$.|*+?\
|
|---|
| 152 |
|
|---|
| 153 | The significance of each of these will be explained
|
|---|
| 154 | in the rest of the tutorial, but for now, it is important only to know
|
|---|
| 155 | that a metacharacter can be matched by putting a backslash before it:
|
|---|
| 156 |
|
|---|
| 157 | "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
|
|---|
| 158 | "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
|
|---|
| 159 | "The interval is [0,1)." =~ /[0,1)./ # is a syntax error!
|
|---|
| 160 | "The interval is [0,1)." =~ /\[0,1\)\./ # matches
|
|---|
| 161 | "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches
|
|---|
| 162 |
|
|---|
| 163 | In the last regexp, the forward slash C<'/'> is also backslashed,
|
|---|
| 164 | because it is used to delimit the regexp. This can lead to LTS
|
|---|
| 165 | (leaning toothpick syndrome), however, and it is often more readable
|
|---|
| 166 | to change delimiters.
|
|---|
| 167 |
|
|---|
| 168 | "/usr/bin/perl" =~ m!/usr/bin/perl!; # easier to read
|
|---|
| 169 |
|
|---|
| 170 | The backslash character C<'\'> is a metacharacter itself and needs to
|
|---|
| 171 | be backslashed:
|
|---|
| 172 |
|
|---|
| 173 | 'C:\WIN32' =~ /C:\\WIN/; # matches
|
|---|
| 174 |
|
|---|
| 175 | In addition to the metacharacters, there are some ASCII characters
|
|---|
| 176 | which don't have printable character equivalents and are instead
|
|---|
| 177 | represented by B<escape sequences>. Common examples are C<\t> for a
|
|---|
| 178 | tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a
|
|---|
| 179 | bell. If your string is better thought of as a sequence of arbitrary
|
|---|
| 180 | bytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape
|
|---|
| 181 | sequence, e.g., C<\x1B> may be a more natural representation for your
|
|---|
| 182 | bytes. Here are some examples of escapes:
|
|---|
| 183 |
|
|---|
| 184 | "1000\t2000" =~ m(0\t2) # matches
|
|---|
| 185 | "1000\n2000" =~ /0\n20/ # matches
|
|---|
| 186 | "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000"
|
|---|
| 187 | "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat
|
|---|
| 188 |
|
|---|
| 189 | If you've been around Perl a while, all this talk of escape sequences
|
|---|
| 190 | may seem familiar. Similar escape sequences are used in double-quoted
|
|---|
| 191 | strings and in fact the regexps in Perl are mostly treated as
|
|---|
| 192 | double-quoted strings. This means that variables can be used in
|
|---|
| 193 | regexps as well. Just like double-quoted strings, the values of the
|
|---|
| 194 | variables in the regexp will be substituted in before the regexp is
|
|---|
| 195 | evaluated for matching purposes. So we have:
|
|---|
| 196 |
|
|---|
| 197 | $foo = 'house';
|
|---|
| 198 | 'housecat' =~ /$foo/; # matches
|
|---|
| 199 | 'cathouse' =~ /cat$foo/; # matches
|
|---|
| 200 | 'housecat' =~ /${foo}cat/; # matches
|
|---|
| 201 |
|
|---|
| 202 | So far, so good. With the knowledge above you can already perform
|
|---|
| 203 | searches with just about any literal string regexp you can dream up.
|
|---|
| 204 | Here is a I<very simple> emulation of the Unix grep program:
|
|---|
| 205 |
|
|---|
| 206 | % cat > simple_grep
|
|---|
| 207 | #!/usr/bin/perl
|
|---|
| 208 | $regexp = shift;
|
|---|
| 209 | while (<>) {
|
|---|
| 210 | print if /$regexp/;
|
|---|
| 211 | }
|
|---|
| 212 | ^D
|
|---|
| 213 |
|
|---|
| 214 | % chmod +x simple_grep
|
|---|
| 215 |
|
|---|
| 216 | % simple_grep abba /usr/dict/words
|
|---|
| 217 | Babbage
|
|---|
| 218 | cabbage
|
|---|
| 219 | cabbages
|
|---|
| 220 | sabbath
|
|---|
| 221 | Sabbathize
|
|---|
| 222 | Sabbathizes
|
|---|
| 223 | sabbatical
|
|---|
| 224 | scabbard
|
|---|
| 225 | scabbards
|
|---|
| 226 |
|
|---|
| 227 | This program is easy to understand. C<#!/usr/bin/perl> is the standard
|
|---|
| 228 | way to invoke a perl program from the shell.
|
|---|
| 229 | S<C<$regexp = shift;> > saves the first command line argument as the
|
|---|
| 230 | regexp to be used, leaving the rest of the command line arguments to
|
|---|
| 231 | be treated as files. S<C<< while (<>) >> > loops over all the lines in
|
|---|
| 232 | all the files. For each line, S<C<print if /$regexp/;> > prints the
|
|---|
| 233 | line if the regexp matches the line. In this line, both C<print> and
|
|---|
| 234 | C</$regexp/> use the default variable C<$_> implicitly.
|
|---|
| 235 |
|
|---|
| 236 | With all of the regexps above, if the regexp matched anywhere in the
|
|---|
| 237 | string, it was considered a match. Sometimes, however, we'd like to
|
|---|
| 238 | specify I<where> in the string the regexp should try to match. To do
|
|---|
| 239 | this, we would use the B<anchor> metacharacters C<^> and C<$>. The
|
|---|
| 240 | anchor C<^> means match at the beginning of the string and the anchor
|
|---|
| 241 | C<$> means match at the end of the string, or before a newline at the
|
|---|
| 242 | end of the string. Here is how they are used:
|
|---|
| 243 |
|
|---|
| 244 | "housekeeper" =~ /keeper/; # matches
|
|---|
| 245 | "housekeeper" =~ /^keeper/; # doesn't match
|
|---|
| 246 | "housekeeper" =~ /keeper$/; # matches
|
|---|
| 247 | "housekeeper\n" =~ /keeper$/; # matches
|
|---|
| 248 |
|
|---|
| 249 | The second regexp doesn't match because C<^> constrains C<keeper> to
|
|---|
| 250 | match only at the beginning of the string, but C<"housekeeper"> has
|
|---|
| 251 | keeper starting in the middle. The third regexp does match, since the
|
|---|
| 252 | C<$> constrains C<keeper> to match only at the end of the string.
|
|---|
| 253 |
|
|---|
| 254 | When both C<^> and C<$> are used at the same time, the regexp has to
|
|---|
| 255 | match both the beginning and the end of the string, i.e., the regexp
|
|---|
| 256 | matches the whole string. Consider
|
|---|
| 257 |
|
|---|
| 258 | "keeper" =~ /^keep$/; # doesn't match
|
|---|
| 259 | "keeper" =~ /^keeper$/; # matches
|
|---|
| 260 | "" =~ /^$/; # ^$ matches an empty string
|
|---|
| 261 |
|
|---|
| 262 | The first regexp doesn't match because the string has more to it than
|
|---|
| 263 | C<keep>. Since the second regexp is exactly the string, it
|
|---|
| 264 | matches. Using both C<^> and C<$> in a regexp forces the complete
|
|---|
| 265 | string to match, so it gives you complete control over which strings
|
|---|
| 266 | match and which don't. Suppose you are looking for a fellow named
|
|---|
| 267 | bert, off in a string by himself:
|
|---|
| 268 |
|
|---|
| 269 | "dogbert" =~ /bert/; # matches, but not what you want
|
|---|
| 270 |
|
|---|
| 271 | "dilbert" =~ /^bert/; # doesn't match, but ..
|
|---|
| 272 | "bertram" =~ /^bert/; # matches, so still not good enough
|
|---|
| 273 |
|
|---|
| 274 | "bertram" =~ /^bert$/; # doesn't match, good
|
|---|
| 275 | "dilbert" =~ /^bert$/; # doesn't match, good
|
|---|
| 276 | "bert" =~ /^bert$/; # matches, perfect
|
|---|
| 277 |
|
|---|
| 278 | Of course, in the case of a literal string, one could just as easily
|
|---|
| 279 | use the string equivalence S<C<$string eq 'bert'> > and it would be
|
|---|
| 280 | more efficient. The C<^...$> regexp really becomes useful when we
|
|---|
| 281 | add in the more powerful regexp tools below.
|
|---|
| 282 |
|
|---|
| 283 | =head2 Using character classes
|
|---|
| 284 |
|
|---|
| 285 | Although one can already do quite a lot with the literal string
|
|---|
| 286 | regexps above, we've only scratched the surface of regular expression
|
|---|
| 287 | technology. In this and subsequent sections we will introduce regexp
|
|---|
| 288 | concepts (and associated metacharacter notations) that will allow a
|
|---|
| 289 | regexp to not just represent a single character sequence, but a I<whole
|
|---|
| 290 | class> of them.
|
|---|
| 291 |
|
|---|
| 292 | One such concept is that of a B<character class>. A character class
|
|---|
| 293 | allows a set of possible characters, rather than just a single
|
|---|
| 294 | character, to match at a particular point in a regexp. Character
|
|---|
| 295 | classes are denoted by brackets C<[...]>, with the set of characters
|
|---|
| 296 | to be possibly matched inside. Here are some examples:
|
|---|
| 297 |
|
|---|
| 298 | /cat/; # matches 'cat'
|
|---|
| 299 | /[bcr]at/; # matches 'bat, 'cat', or 'rat'
|
|---|
| 300 | /item[0123456789]/; # matches 'item0' or ... or 'item9'
|
|---|
| 301 | "abc" =~ /[cab]/; # matches 'a'
|
|---|
| 302 |
|
|---|
| 303 | In the last statement, even though C<'c'> is the first character in
|
|---|
| 304 | the class, C<'a'> matches because the first character position in the
|
|---|
| 305 | string is the earliest point at which the regexp can match.
|
|---|
| 306 |
|
|---|
| 307 | /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
|
|---|
| 308 | # 'yes', 'Yes', 'YES', etc.
|
|---|
| 309 |
|
|---|
| 310 | This regexp displays a common task: perform a case-insensitive
|
|---|
| 311 | match. Perl provides away of avoiding all those brackets by simply
|
|---|
| 312 | appending an C<'i'> to the end of the match. Then C</[yY][eE][sS]/;>
|
|---|
| 313 | can be rewritten as C</yes/i;>. The C<'i'> stands for
|
|---|
| 314 | case-insensitive and is an example of a B<modifier> of the matching
|
|---|
| 315 | operation. We will meet other modifiers later in the tutorial.
|
|---|
| 316 |
|
|---|
| 317 | We saw in the section above that there were ordinary characters, which
|
|---|
| 318 | represented themselves, and special characters, which needed a
|
|---|
| 319 | backslash C<\> to represent themselves. The same is true in a
|
|---|
| 320 | character class, but the sets of ordinary and special characters
|
|---|
| 321 | inside a character class are different than those outside a character
|
|---|
| 322 | class. The special characters for a character class are C<-]\^$>. C<]>
|
|---|
| 323 | is special because it denotes the end of a character class. C<$> is
|
|---|
| 324 | special because it denotes a scalar variable. C<\> is special because
|
|---|
| 325 | it is used in escape sequences, just like above. Here is how the
|
|---|
| 326 | special characters C<]$\> are handled:
|
|---|
| 327 |
|
|---|
| 328 | /[\]c]def/; # matches ']def' or 'cdef'
|
|---|
| 329 | $x = 'bcr';
|
|---|
| 330 | /[$x]at/; # matches 'bat', 'cat', or 'rat'
|
|---|
| 331 | /[\$x]at/; # matches '$at' or 'xat'
|
|---|
| 332 | /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
|
|---|
| 333 |
|
|---|
| 334 | The last two are a little tricky. in C<[\$x]>, the backslash protects
|
|---|
| 335 | the dollar sign, so the character class has two members C<$> and C<x>.
|
|---|
| 336 | In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a
|
|---|
| 337 | variable and substituted in double quote fashion.
|
|---|
| 338 |
|
|---|
| 339 | The special character C<'-'> acts as a range operator within character
|
|---|
| 340 | classes, so that a contiguous set of characters can be written as a
|
|---|
| 341 | range. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]>
|
|---|
| 342 | become the svelte C<[0-9]> and C<[a-z]>. Some examples are
|
|---|
| 343 |
|
|---|
| 344 | /item[0-9]/; # matches 'item0' or ... or 'item9'
|
|---|
| 345 | /[0-9bx-z]aa/; # matches '0aa', ..., '9aa',
|
|---|
| 346 | # 'baa', 'xaa', 'yaa', or 'zaa'
|
|---|
| 347 | /[0-9a-fA-F]/; # matches a hexadecimal digit
|
|---|
| 348 | /[0-9a-zA-Z_]/; # matches a "word" character,
|
|---|
| 349 | # like those in a perl variable name
|
|---|
| 350 |
|
|---|
| 351 | If C<'-'> is the first or last character in a character class, it is
|
|---|
| 352 | treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are
|
|---|
| 353 | all equivalent.
|
|---|
| 354 |
|
|---|
| 355 | The special character C<^> in the first position of a character class
|
|---|
| 356 | denotes a B<negated character class>, which matches any character but
|
|---|
| 357 | those in the brackets. Both C<[...]> and C<[^...]> must match a
|
|---|
| 358 | character, or the match fails. Then
|
|---|
| 359 |
|
|---|
| 360 | /[^a]at/; # doesn't match 'aat' or 'at', but matches
|
|---|
| 361 | # all other 'bat', 'cat, '0at', '%at', etc.
|
|---|
| 362 | /[^0-9]/; # matches a non-numeric character
|
|---|
| 363 | /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
|
|---|
| 364 |
|
|---|
| 365 | Now, even C<[0-9]> can be a bother the write multiple times, so in the
|
|---|
| 366 | interest of saving keystrokes and making regexps more readable, Perl
|
|---|
| 367 | has several abbreviations for common character classes:
|
|---|
| 368 |
|
|---|
| 369 | =over 4
|
|---|
| 370 |
|
|---|
| 371 | =item *
|
|---|
| 372 |
|
|---|
| 373 | \d is a digit and represents [0-9]
|
|---|
| 374 |
|
|---|
| 375 | =item *
|
|---|
| 376 |
|
|---|
| 377 | \s is a whitespace character and represents [\ \t\r\n\f]
|
|---|
| 378 |
|
|---|
| 379 | =item *
|
|---|
| 380 |
|
|---|
| 381 | \w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]
|
|---|
| 382 |
|
|---|
| 383 | =item *
|
|---|
| 384 |
|
|---|
| 385 | \D is a negated \d; it represents any character but a digit [^0-9]
|
|---|
| 386 |
|
|---|
| 387 | =item *
|
|---|
| 388 |
|
|---|
| 389 | \S is a negated \s; it represents any non-whitespace character [^\s]
|
|---|
| 390 |
|
|---|
| 391 | =item *
|
|---|
| 392 |
|
|---|
| 393 | \W is a negated \w; it represents any non-word character [^\w]
|
|---|
| 394 |
|
|---|
| 395 | =item *
|
|---|
| 396 |
|
|---|
| 397 | The period '.' matches any character but "\n"
|
|---|
| 398 |
|
|---|
| 399 | =back
|
|---|
| 400 |
|
|---|
| 401 | The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
|
|---|
| 402 | of character classes. Here are some in use:
|
|---|
| 403 |
|
|---|
| 404 | /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
|
|---|
| 405 | /[\d\s]/; # matches any digit or whitespace character
|
|---|
| 406 | /\w\W\w/; # matches a word char, followed by a
|
|---|
| 407 | # non-word char, followed by a word char
|
|---|
| 408 | /..rt/; # matches any two chars, followed by 'rt'
|
|---|
| 409 | /end\./; # matches 'end.'
|
|---|
| 410 | /end[.]/; # same thing, matches 'end.'
|
|---|
| 411 |
|
|---|
| 412 | Because a period is a metacharacter, it needs to be escaped to match
|
|---|
| 413 | as an ordinary period. Because, for example, C<\d> and C<\w> are sets
|
|---|
| 414 | of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in
|
|---|
| 415 | fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as
|
|---|
| 416 | C<[\W]>. Think DeMorgan's laws.
|
|---|
| 417 |
|
|---|
| 418 | An anchor useful in basic regexps is the S<B<word anchor> >
|
|---|
| 419 | C<\b>. This matches a boundary between a word character and a non-word
|
|---|
| 420 | character C<\w\W> or C<\W\w>:
|
|---|
| 421 |
|
|---|
| 422 | $x = "Housecat catenates house and cat";
|
|---|
| 423 | $x =~ /cat/; # matches cat in 'housecat'
|
|---|
| 424 | $x =~ /\bcat/; # matches cat in 'catenates'
|
|---|
| 425 | $x =~ /cat\b/; # matches cat in 'housecat'
|
|---|
| 426 | $x =~ /\bcat\b/; # matches 'cat' at end of string
|
|---|
| 427 |
|
|---|
| 428 | Note in the last example, the end of the string is considered a word
|
|---|
| 429 | boundary.
|
|---|
| 430 |
|
|---|
| 431 | You might wonder why C<'.'> matches everything but C<"\n"> - why not
|
|---|
| 432 | every character? The reason is that often one is matching against
|
|---|
| 433 | lines and would like to ignore the newline characters. For instance,
|
|---|
| 434 | while the string C<"\n"> represents one line, we would like to think
|
|---|
| 435 | of as empty. Then
|
|---|
| 436 |
|
|---|
| 437 | "" =~ /^$/; # matches
|
|---|
| 438 | "\n" =~ /^$/; # matches, "\n" is ignored
|
|---|
| 439 |
|
|---|
| 440 | "" =~ /./; # doesn't match; it needs a char
|
|---|
| 441 | "" =~ /^.$/; # doesn't match; it needs a char
|
|---|
| 442 | "\n" =~ /^.$/; # doesn't match; it needs a char other than "\n"
|
|---|
| 443 | "a" =~ /^.$/; # matches
|
|---|
| 444 | "a\n" =~ /^.$/; # matches, ignores the "\n"
|
|---|
| 445 |
|
|---|
| 446 | This behavior is convenient, because we usually want to ignore
|
|---|
| 447 | newlines when we count and match characters in a line. Sometimes,
|
|---|
| 448 | however, we want to keep track of newlines. We might even want C<^>
|
|---|
| 449 | and C<$> to anchor at the beginning and end of lines within the
|
|---|
| 450 | string, rather than just the beginning and end of the string. Perl
|
|---|
| 451 | allows us to choose between ignoring and paying attention to newlines
|
|---|
| 452 | by using the C<//s> and C<//m> modifiers. C<//s> and C<//m> stand for
|
|---|
| 453 | single line and multi-line and they determine whether a string is to
|
|---|
| 454 | be treated as one continuous string, or as a set of lines. The two
|
|---|
| 455 | modifiers affect two aspects of how the regexp is interpreted: 1) how
|
|---|
| 456 | the C<'.'> character class is defined, and 2) where the anchors C<^>
|
|---|
| 457 | and C<$> are able to match. Here are the four possible combinations:
|
|---|
| 458 |
|
|---|
| 459 | =over 4
|
|---|
| 460 |
|
|---|
| 461 | =item *
|
|---|
| 462 |
|
|---|
| 463 | no modifiers (//): Default behavior. C<'.'> matches any character
|
|---|
| 464 | except C<"\n">. C<^> matches only at the beginning of the string and
|
|---|
| 465 | C<$> matches only at the end or before a newline at the end.
|
|---|
| 466 |
|
|---|
| 467 | =item *
|
|---|
| 468 |
|
|---|
| 469 | s modifier (//s): Treat string as a single long line. C<'.'> matches
|
|---|
| 470 | any character, even C<"\n">. C<^> matches only at the beginning of
|
|---|
| 471 | the string and C<$> matches only at the end or before a newline at the
|
|---|
| 472 | end.
|
|---|
| 473 |
|
|---|
| 474 | =item *
|
|---|
| 475 |
|
|---|
| 476 | m modifier (//m): Treat string as a set of multiple lines. C<'.'>
|
|---|
| 477 | matches any character except C<"\n">. C<^> and C<$> are able to match
|
|---|
| 478 | at the start or end of I<any> line within the string.
|
|---|
| 479 |
|
|---|
| 480 | =item *
|
|---|
| 481 |
|
|---|
| 482 | both s and m modifiers (//sm): Treat string as a single long line, but
|
|---|
| 483 | detect multiple lines. C<'.'> matches any character, even
|
|---|
| 484 | C<"\n">. C<^> and C<$>, however, are able to match at the start or end
|
|---|
| 485 | of I<any> line within the string.
|
|---|
| 486 |
|
|---|
| 487 | =back
|
|---|
| 488 |
|
|---|
| 489 | Here are examples of C<//s> and C<//m> in action:
|
|---|
| 490 |
|
|---|
| 491 | $x = "There once was a girl\nWho programmed in Perl\n";
|
|---|
| 492 |
|
|---|
| 493 | $x =~ /^Who/; # doesn't match, "Who" not at start of string
|
|---|
| 494 | $x =~ /^Who/s; # doesn't match, "Who" not at start of string
|
|---|
| 495 | $x =~ /^Who/m; # matches, "Who" at start of second line
|
|---|
| 496 | $x =~ /^Who/sm; # matches, "Who" at start of second line
|
|---|
| 497 |
|
|---|
| 498 | $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n"
|
|---|
| 499 | $x =~ /girl.Who/s; # matches, "." matches "\n"
|
|---|
|
|---|