source: trunk/essentials/dev-lang/perl/pod/perlretut.pod@ 3368

Last change on this file since 3368 was 3181, checked in by bird, 19 years ago

perl 5.8.8

File size: 98.5 KB
Line 
1=head1 NAME
2
3perlretut - Perl regular expressions tutorial
4
5=head1 DESCRIPTION
6
7This page provides a basic tutorial on understanding, creating and
8using regular expressions in Perl. It serves as a complement to the
9reference page on regular expressions L<perlre>. Regular expressions
10are an integral part of the C<m//>, C<s///>, C<qr//> and C<split>
11operators and so this tutorial also overlaps with
12L<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>.
13
14Perl is widely renowned for excellence in text processing, and regular
15expressions are one of the big factors behind this fame. Perl regular
16expressions display an efficiency and flexibility unknown in most
17other computer languages. Mastering even the basics of regular
18expressions will allow you to manipulate text with surprising ease.
19
20What is a regular expression? A regular expression is simply a string
21that describes a pattern. Patterns are in common use these days;
22examples are the patterns typed into a search engine to find web pages
23and the patterns used to list files in a directory, e.g., C<ls *.txt>
24or C<dir *.*>. In Perl, the patterns described by regular expressions
25are used to search strings, extract desired parts of strings, and to
26do search and replace operations.
27
28Regular expressions have the undeserved reputation of being abstract
29and difficult to understand. Regular expressions are constructed using
30simple concepts like conditionals and loops and are no more difficult
31to understand than the corresponding C<if> conditionals and C<while>
32loops in the Perl language itself. In fact, the main challenge in
33learning regular expressions is just getting used to the terse
34notation used to express these concepts.
35
36This tutorial flattens the learning curve by discussing regular
37expression concepts, along with their notation, one at a time and with
38many examples. The first part of the tutorial will progress from the
39simplest word searches to the basic regular expression concepts. If
40you master the first part, you will have all the tools needed to solve
41about 98% of your needs. The second part of the tutorial is for those
42comfortable with the basics and hungry for more power tools. It
43discusses the more advanced regular expression operators and
44introduces the latest cutting edge innovations in 5.6.0.
45
46A note: to save time, 'regular expression' is often abbreviated as
47regexp or regex. Regexp is a more natural abbreviation than regex, but
48is harder to pronounce. The Perl pod documentation is evenly split on
49regexp vs regex; in Perl, there is more than one way to abbreviate it.
50We'll use regexp in this tutorial.
51
52=head1 Part 1: The basics
53
54=head2 Simple word matching
55
56The simplest regexp is simply a word, or more generally, a string of
57characters. A regexp consisting of a word matches any string that
58contains that word:
59
60 "Hello World" =~ /World/; # matches
61
62What is this perl statement all about? C<"Hello World"> is a simple
63double quoted string. C<World> is the regular expression and the
64C<//> enclosing C</World/> tells perl to search a string for a match.
65The operator C<=~> associates the string with the regexp match and
66produces a true value if the regexp matched, or false if the regexp
67did not match. In our case, C<World> matches the second word in
68C<"Hello World">, so the expression is true. Expressions like this
69are useful in conditionals:
70
71 if ("Hello World" =~ /World/) {
72 print "It matches\n";
73 }
74 else {
75 print "It doesn't match\n";
76 }
77
78There are useful variations on this theme. The sense of the match can
79be reversed by using C<!~> operator:
80
81 if ("Hello World" !~ /World/) {
82 print "It doesn't match\n";
83 }
84 else {
85 print "It matches\n";
86 }
87
88The literal string in the regexp can be replaced by a variable:
89
90 $greeting = "World";
91 if ("Hello World" =~ /$greeting/) {
92 print "It matches\n";
93 }
94 else {
95 print "It doesn't match\n";
96 }
97
98If you're matching against the special default variable C<$_>, the
99C<$_ =~> part can be omitted:
100
101 $_ = "Hello World";
102 if (/World/) {
103 print "It matches\n";
104 }
105 else {
106 print "It doesn't match\n";
107 }
108
109And finally, the C<//> default delimiters for a match can be changed
110to arbitrary delimiters by putting an C<'m'> out front:
111
112 "Hello World" =~ m!World!; # matches, delimited by '!'
113 "Hello World" =~ m{World}; # matches, note the matching '{}'
114 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
115 # '/' becomes an ordinary char
116
117C</World/>, C<m!World!>, and C<m{World}> all represent the
118same thing. When, e.g., C<""> is used as a delimiter, the forward
119slash C<'/'> becomes an ordinary character and can be used in a regexp
120without trouble.
121
122Let's consider how different regexps would match C<"Hello World">:
123
124 "Hello World" =~ /world/; # doesn't match
125 "Hello World" =~ /o W/; # matches
126 "Hello World" =~ /oW/; # doesn't match
127 "Hello World" =~ /World /; # doesn't match
128
129The first regexp C<world> doesn't match because regexps are
130case-sensitive. The second regexp matches because the substring
131S<C<'o W'> > occurs in the string S<C<"Hello World"> >. The space
132character ' ' is treated like any other character in a regexp and is
133needed to match in this case. The lack of a space character is the
134reason the third regexp C<'oW'> doesn't match. The fourth regexp
135C<'World '> doesn't match because there is a space at the end of the
136regexp, but not at the end of the string. The lesson here is that
137regexps must match a part of the string I<exactly> in order for the
138statement to be true.
139
140If a regexp matches in more than one place in the string, perl will
141always match at the earliest possible point in the string:
142
143 "Hello World" =~ /o/; # matches 'o' in 'Hello'
144 "That hat is red" =~ /hat/; # matches 'hat' in 'That'
145
146With respect to character matching, there are a few more points you
147need to know about. First of all, not all characters can be used 'as
148is' in a match. Some characters, called B<metacharacters>, are reserved
149for use in regexp notation. The metacharacters are
150
151 {}[]()^$.|*+?\
152
153The significance of each of these will be explained
154in the rest of the tutorial, but for now, it is important only to know
155that a metacharacter can be matched by putting a backslash before it:
156
157 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
158 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
159 "The interval is [0,1)." =~ /[0,1)./ # is a syntax error!
160 "The interval is [0,1)." =~ /\[0,1\)\./ # matches
161 "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches
162
163In the last regexp, the forward slash C<'/'> is also backslashed,
164because it is used to delimit the regexp. This can lead to LTS
165(leaning toothpick syndrome), however, and it is often more readable
166to change delimiters.
167
168 "/usr/bin/perl" =~ m!/usr/bin/perl!; # easier to read
169
170The backslash character C<'\'> is a metacharacter itself and needs to
171be backslashed:
172
173 'C:\WIN32' =~ /C:\\WIN/; # matches
174
175In addition to the metacharacters, there are some ASCII characters
176which don't have printable character equivalents and are instead
177represented by B<escape sequences>. Common examples are C<\t> for a
178tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a
179bell. If your string is better thought of as a sequence of arbitrary
180bytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape
181sequence, e.g., C<\x1B> may be a more natural representation for your
182bytes. Here are some examples of escapes:
183
184 "1000\t2000" =~ m(0\t2) # matches
185 "1000\n2000" =~ /0\n20/ # matches
186 "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000"
187 "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat
188
189If you've been around Perl a while, all this talk of escape sequences
190may seem familiar. Similar escape sequences are used in double-quoted
191strings and in fact the regexps in Perl are mostly treated as
192double-quoted strings. This means that variables can be used in
193regexps as well. Just like double-quoted strings, the values of the
194variables in the regexp will be substituted in before the regexp is
195evaluated for matching purposes. So we have:
196
197 $foo = 'house';
198 'housecat' =~ /$foo/; # matches
199 'cathouse' =~ /cat$foo/; # matches
200 'housecat' =~ /${foo}cat/; # matches
201
202So far, so good. With the knowledge above you can already perform
203searches with just about any literal string regexp you can dream up.
204Here is a I<very simple> emulation of the Unix grep program:
205
206 % cat > simple_grep
207 #!/usr/bin/perl
208 $regexp = shift;
209 while (<>) {
210 print if /$regexp/;
211 }
212 ^D
213
214 % chmod +x simple_grep
215
216 % simple_grep abba /usr/dict/words
217 Babbage
218 cabbage
219 cabbages
220 sabbath
221 Sabbathize
222 Sabbathizes
223 sabbatical
224 scabbard
225 scabbards
226
227This program is easy to understand. C<#!/usr/bin/perl> is the standard
228way to invoke a perl program from the shell.
229S<C<$regexp = shift;> > saves the first command line argument as the
230regexp to be used, leaving the rest of the command line arguments to
231be treated as files. S<C<< while (<>) >> > loops over all the lines in
232all the files. For each line, S<C<print if /$regexp/;> > prints the
233line if the regexp matches the line. In this line, both C<print> and
234C</$regexp/> use the default variable C<$_> implicitly.
235
236With all of the regexps above, if the regexp matched anywhere in the
237string, it was considered a match. Sometimes, however, we'd like to
238specify I<where> in the string the regexp should try to match. To do
239this, we would use the B<anchor> metacharacters C<^> and C<$>. The
240anchor C<^> means match at the beginning of the string and the anchor
241C<$> means match at the end of the string, or before a newline at the
242end of the string. Here is how they are used:
243
244 "housekeeper" =~ /keeper/; # matches
245 "housekeeper" =~ /^keeper/; # doesn't match
246 "housekeeper" =~ /keeper$/; # matches
247 "housekeeper\n" =~ /keeper$/; # matches
248
249The second regexp doesn't match because C<^> constrains C<keeper> to
250match only at the beginning of the string, but C<"housekeeper"> has
251keeper starting in the middle. The third regexp does match, since the
252C<$> constrains C<keeper> to match only at the end of the string.
253
254When both C<^> and C<$> are used at the same time, the regexp has to
255match both the beginning and the end of the string, i.e., the regexp
256matches the whole string. Consider
257
258 "keeper" =~ /^keep$/; # doesn't match
259 "keeper" =~ /^keeper$/; # matches
260 "" =~ /^$/; # ^$ matches an empty string
261
262The first regexp doesn't match because the string has more to it than
263C<keep>. Since the second regexp is exactly the string, it
264matches. Using both C<^> and C<$> in a regexp forces the complete
265string to match, so it gives you complete control over which strings
266match and which don't. Suppose you are looking for a fellow named
267bert, off in a string by himself:
268
269 "dogbert" =~ /bert/; # matches, but not what you want
270
271 "dilbert" =~ /^bert/; # doesn't match, but ..
272 "bertram" =~ /^bert/; # matches, so still not good enough
273
274 "bertram" =~ /^bert$/; # doesn't match, good
275 "dilbert" =~ /^bert$/; # doesn't match, good
276 "bert" =~ /^bert$/; # matches, perfect
277
278Of course, in the case of a literal string, one could just as easily
279use the string equivalence S<C<$string eq 'bert'> > and it would be
280more efficient. The C<^...$> regexp really becomes useful when we
281add in the more powerful regexp tools below.
282
283=head2 Using character classes
284
285Although one can already do quite a lot with the literal string
286regexps above, we've only scratched the surface of regular expression
287technology. In this and subsequent sections we will introduce regexp
288concepts (and associated metacharacter notations) that will allow a
289regexp to not just represent a single character sequence, but a I<whole
290class> of them.
291
292One such concept is that of a B<character class>. A character class
293allows a set of possible characters, rather than just a single
294character, to match at a particular point in a regexp. Character
295classes are denoted by brackets C<[...]>, with the set of characters
296to be possibly matched inside. Here are some examples:
297
298 /cat/; # matches 'cat'
299 /[bcr]at/; # matches 'bat, 'cat', or 'rat'
300 /item[0123456789]/; # matches 'item0' or ... or 'item9'
301 "abc" =~ /[cab]/; # matches 'a'
302
303In the last statement, even though C<'c'> is the first character in
304the class, C<'a'> matches because the first character position in the
305string is the earliest point at which the regexp can match.
306
307 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
308 # 'yes', 'Yes', 'YES', etc.
309
310This regexp displays a common task: perform a case-insensitive
311match. Perl provides away of avoiding all those brackets by simply
312appending an C<'i'> to the end of the match. Then C</[yY][eE][sS]/;>
313can be rewritten as C</yes/i;>. The C<'i'> stands for
314case-insensitive and is an example of a B<modifier> of the matching
315operation. We will meet other modifiers later in the tutorial.
316
317We saw in the section above that there were ordinary characters, which
318represented themselves, and special characters, which needed a
319backslash C<\> to represent themselves. The same is true in a
320character class, but the sets of ordinary and special characters
321inside a character class are different than those outside a character
322class. The special characters for a character class are C<-]\^$>. C<]>
323is special because it denotes the end of a character class. C<$> is
324special because it denotes a scalar variable. C<\> is special because
325it is used in escape sequences, just like above. Here is how the
326special characters C<]$\> are handled:
327
328 /[\]c]def/; # matches ']def' or 'cdef'
329 $x = 'bcr';
330 /[$x]at/; # matches 'bat', 'cat', or 'rat'
331 /[\$x]at/; # matches '$at' or 'xat'
332 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
333
334The last two are a little tricky. in C<[\$x]>, the backslash protects
335the dollar sign, so the character class has two members C<$> and C<x>.
336In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a
337variable and substituted in double quote fashion.
338
339The special character C<'-'> acts as a range operator within character
340classes, so that a contiguous set of characters can be written as a
341range. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]>
342become the svelte C<[0-9]> and C<[a-z]>. Some examples are
343
344 /item[0-9]/; # matches 'item0' or ... or 'item9'
345 /[0-9bx-z]aa/; # matches '0aa', ..., '9aa',
346 # 'baa', 'xaa', 'yaa', or 'zaa'
347 /[0-9a-fA-F]/; # matches a hexadecimal digit
348 /[0-9a-zA-Z_]/; # matches a "word" character,
349 # like those in a perl variable name
350
351If C<'-'> is the first or last character in a character class, it is
352treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are
353all equivalent.
354
355The special character C<^> in the first position of a character class
356denotes a B<negated character class>, which matches any character but
357those in the brackets. Both C<[...]> and C<[^...]> must match a
358character, or the match fails. Then
359
360 /[^a]at/; # doesn't match 'aat' or 'at', but matches
361 # all other 'bat', 'cat, '0at', '%at', etc.
362 /[^0-9]/; # matches a non-numeric character
363 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
364
365Now, even C<[0-9]> can be a bother the write multiple times, so in the
366interest of saving keystrokes and making regexps more readable, Perl
367has several abbreviations for common character classes:
368
369=over 4
370
371=item *
372
373\d is a digit and represents [0-9]
374
375=item *
376
377\s is a whitespace character and represents [\ \t\r\n\f]
378
379=item *
380
381\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]
382
383=item *
384
385\D is a negated \d; it represents any character but a digit [^0-9]
386
387=item *
388
389\S is a negated \s; it represents any non-whitespace character [^\s]
390
391=item *
392
393\W is a negated \w; it represents any non-word character [^\w]
394
395=item *
396
397The period '.' matches any character but "\n"
398
399=back
400
401The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
402of character classes. Here are some in use:
403
404 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
405 /[\d\s]/; # matches any digit or whitespace character
406 /\w\W\w/; # matches a word char, followed by a
407 # non-word char, followed by a word char
408 /..rt/; # matches any two chars, followed by 'rt'
409 /end\./; # matches 'end.'
410 /end[.]/; # same thing, matches 'end.'
411
412Because a period is a metacharacter, it needs to be escaped to match
413as an ordinary period. Because, for example, C<\d> and C<\w> are sets
414of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in
415fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as
416C<[\W]>. Think DeMorgan's laws.
417
418An anchor useful in basic regexps is the S<B<word anchor> >
419C<\b>. This matches a boundary between a word character and a non-word
420character C<\w\W> or C<\W\w>:
421
422 $x = "Housecat catenates house and cat";
423 $x =~ /cat/; # matches cat in 'housecat'
424 $x =~ /\bcat/; # matches cat in 'catenates'
425 $x =~ /cat\b/; # matches cat in 'housecat'
426 $x =~ /\bcat\b/; # matches 'cat' at end of string
427
428Note in the last example, the end of the string is considered a word
429boundary.
430
431You might wonder why C<'.'> matches everything but C<"\n"> - why not
432every character? The reason is that often one is matching against
433lines and would like to ignore the newline characters. For instance,
434while the string C<"\n"> represents one line, we would like to think
435of as empty. Then
436
437 "" =~ /^$/; # matches
438 "\n" =~ /^$/; # matches, "\n" is ignored
439
440 "" =~ /./; # doesn't match; it needs a char
441 "" =~ /^.$/; # doesn't match; it needs a char
442 "\n" =~ /^.$/; # doesn't match; it needs a char other than "\n"
443 "a" =~ /^.$/; # matches
444 "a\n" =~ /^.$/; # matches, ignores the "\n"
445
446This behavior is convenient, because we usually want to ignore
447newlines when we count and match characters in a line. Sometimes,
448however, we want to keep track of newlines. We might even want C<^>
449and C<$> to anchor at the beginning and end of lines within the
450string, rather than just the beginning and end of the string. Perl
451allows us to choose between ignoring and paying attention to newlines
452by using the C<//s> and C<//m> modifiers. C<//s> and C<//m> stand for
453single line and multi-line and they determine whether a string is to
454be treated as one continuous string, or as a set of lines. The two
455modifiers affect two aspects of how the regexp is interpreted: 1) how
456the C<'.'> character class is defined, and 2) where the anchors C<^>
457and C<$> are able to match. Here are the four possible combinations:
458
459=over 4
460
461=item *
462
463no modifiers (//): Default behavior. C<'.'> matches any character
464except C<"\n">. C<^> matches only at the beginning of the string and
465C<$> matches only at the end or before a newline at the end.
466
467=item *
468
469s modifier (//s): Treat string as a single long line. C<'.'> matches
470any character, even C<"\n">. C<^> matches only at the beginning of
471the string and C<$> matches only at the end or before a newline at the
472end.
473
474=item *
475
476m modifier (//m): Treat string as a set of multiple lines. C<'.'>
477matches any character except C<"\n">. C<^> and C<$> are able to match
478at the start or end of I<any> line within the string.
479
480=item *
481
482both s and m modifiers (//sm): Treat string as a single long line, but
483detect multiple lines. C<'.'> matches any character, even
484C<"\n">. C<^> and C<$>, however, are able to match at the start or end
485of I<any> line within the string.
486
487=back
488
489Here are examples of C<//s> and C<//m> in action:
490
491 $x = "There once was a girl\nWho programmed in Perl\n";
492
493 $x =~ /^Who/; # doesn't match, "Who" not at start of string
494 $x =~ /^Who/s; # doesn't match, "Who" not at start of string
495 $x =~ /^Who/m; # matches, "Who" at start of second line
496 $x =~ /^Who/sm; # matches, "Who" at start of second line
497
498 $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n"
499 $x =~ /girl.Who/s; # matches, "." matches "\n"