Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

perlretut.pod@ 3368

Visit:

Last change on this file since 3368 was 3181, checked in by bird, 19 years ago
perl 5.8.8
File size: 98.5 KB

Line
1	=head1 NAME
2
3	perlretut - Perl regular expressions tutorial
4
5	=head1 DESCRIPTION
6
7	This page provides a basic tutorial on understanding, creating and
8	using regular expressions in Perl. It serves as a complement to the
9	reference page on regular expressions L<perlre>. Regular expressions
10	are an integral part of the C<m//>, C<s///>, C<qr//> and C<split>
11	operators and so this tutorial also overlaps with
12	L<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>.
13
14	Perl is widely renowned for excellence in text processing, and regular
15	expressions are one of the big factors behind this fame. Perl regular
16	expressions display an efficiency and flexibility unknown in most
17	other computer languages. Mastering even the basics of regular
18	expressions will allow you to manipulate text with surprising ease.
19
20	What is a regular expression? A regular expression is simply a string
21	that describes a pattern. Patterns are in common use these days;
22	examples are the patterns typed into a search engine to find web pages
23	and the patterns used to list files in a directory, e.g., C<ls *.txt>
24	or C<dir .>. In Perl, the patterns described by regular expressions
25	are used to search strings, extract desired parts of strings, and to
26	do search and replace operations.
27
28	Regular expressions have the undeserved reputation of being abstract
29	and difficult to understand. Regular expressions are constructed using
30	simple concepts like conditionals and loops and are no more difficult
31	to understand than the corresponding C<if> conditionals and C<while>
32	loops in the Perl language itself. In fact, the main challenge in
33	learning regular expressions is just getting used to the terse
34	notation used to express these concepts.
35
36	This tutorial flattens the learning curve by discussing regular
37	expression concepts, along with their notation, one at a time and with
38	many examples. The first part of the tutorial will progress from the
39	simplest word searches to the basic regular expression concepts. If
40	you master the first part, you will have all the tools needed to solve
41	about 98% of your needs. The second part of the tutorial is for those
42	comfortable with the basics and hungry for more power tools. It
43	discusses the more advanced regular expression operators and
44	introduces the latest cutting edge innovations in 5.6.0.
45
46	A note: to save time, 'regular expression' is often abbreviated as
47	regexp or regex. Regexp is a more natural abbreviation than regex, but
48	is harder to pronounce. The Perl pod documentation is evenly split on
49	regexp vs regex; in Perl, there is more than one way to abbreviate it.
50	We'll use regexp in this tutorial.
51
52	=head1 Part 1: The basics
53
54	=head2 Simple word matching
55
56	The simplest regexp is simply a word, or more generally, a string of
57	characters. A regexp consisting of a word matches any string that
58	contains that word:
59
60	"Hello World" =~ /World/; # matches
61
62	What is this perl statement all about? C<"Hello World"> is a simple
63	double quoted string. C<World> is the regular expression and the
64	C<//> enclosing C</World/> tells perl to search a string for a match.
65	The operator C<=~> associates the string with the regexp match and
66	produces a true value if the regexp matched, or false if the regexp
67	did not match. In our case, C<World> matches the second word in
68	C<"Hello World">, so the expression is true. Expressions like this
69	are useful in conditionals:
70
71	if ("Hello World" =~ /World/) {
72	print "It matches\n";
73	}
74	else {
75	print "It doesn't match\n";
76	}
77
78	There are useful variations on this theme. The sense of the match can
79	be reversed by using C<!~> operator:
80
81	if ("Hello World" !~ /World/) {
82	print "It doesn't match\n";
83	}
84	else {
85	print "It matches\n";
86	}
87
88	The literal string in the regexp can be replaced by a variable:
89
90	$greeting = "World";
91	if ("Hello World" =~ /$greeting/) {
92	print "It matches\n";
93	}
94	else {
95	print "It doesn't match\n";
96	}
97
98	If you're matching against the special default variable C<$_>, the
99	C<$_ =~> part can be omitted:
100
101	$_ = "Hello World";
102	if (/World/) {
103	print "It matches\n";
104	}
105	else {
106	print "It doesn't match\n";
107	}
108
109	And finally, the C<//> default delimiters for a match can be changed
110	to arbitrary delimiters by putting an C<'m'> out front:
111
112	"Hello World" =~ m!World!; # matches, delimited by '!'
113	"Hello World" =~ m{World}; # matches, note the matching '{}'
114	"/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
115	# '/' becomes an ordinary char
116
117	C</World/>, C<m!World!>, and C<m{World}> all represent the
118	same thing. When, e.g., C<""> is used as a delimiter, the forward
119	slash C<'/'> becomes an ordinary character and can be used in a regexp
120	without trouble.
121
122	Let's consider how different regexps would match C<"Hello World">:
123
124	"Hello World" =~ /world/; # doesn't match
125	"Hello World" =~ /o W/; # matches
126	"Hello World" =~ /oW/; # doesn't match
127	"Hello World" =~ /World /; # doesn't match
128
129	The first regexp C<world> doesn't match because regexps are
130	case-sensitive. The second regexp matches because the substring
131	S<C<'o W'> > occurs in the string S<C<"Hello World"> >. The space
132	character ' ' is treated like any other character in a regexp and is
133	needed to match in this case. The lack of a space character is the
134	reason the third regexp C<'oW'> doesn't match. The fourth regexp
135	C<'World '> doesn't match because there is a space at the end of the
136	regexp, but not at the end of the string. The lesson here is that
137	regexps must match a part of the string I<exactly> in order for the
138	statement to be true.
139
140	If a regexp matches in more than one place in the string, perl will
141	always match at the earliest possible point in the string:
142
143	"Hello World" =~ /o/; # matches 'o' in 'Hello'
144	"That hat is red" =~ /hat/; # matches 'hat' in 'That'
145
146	With respect to character matching, there are a few more points you
147	need to know about. First of all, not all characters can be used 'as
148	is' in a match. Some characters, called B<metacharacters>, are reserved
149	for use in regexp notation. The metacharacters are
150
151	{}[]()^$.\|*+?\
152
153	The significance of each of these will be explained
154	in the rest of the tutorial, but for now, it is important only to know
155	that a metacharacter can be matched by putting a backslash before it:
156
157	"2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
158	"2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
159	"The interval is [0,1)." =~ /[0,1)./ # is a syntax error!
160	"The interval is [0,1)." =~ /\[0,1\)\./ # matches
161	"/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches
162
163	In the last regexp, the forward slash C<'/'> is also backslashed,
164	because it is used to delimit the regexp. This can lead to LTS
165	(leaning toothpick syndrome), however, and it is often more readable
166	to change delimiters.
167
168	"/usr/bin/perl" =~ m!/usr/bin/perl!; # easier to read
169
170	The backslash character C<'\'> is a metacharacter itself and needs to
171	be backslashed:
172
173	'C:\WIN32' =~ /C:\\WIN/; # matches
174
175	In addition to the metacharacters, there are some ASCII characters
176	which don't have printable character equivalents and are instead
177	represented by B<escape sequences>. Common examples are C<\t> for a
178	tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a
179	bell. If your string is better thought of as a sequence of arbitrary
180	bytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape
181	sequence, e.g., C<\x1B> may be a more natural representation for your
182	bytes. Here are some examples of escapes:
183
184	"1000\t2000" =~ m(0\t2) # matches
185	"1000\n2000" =~ /0\n20/ # matches
186	"1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000"
187	"cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat
188
189	If you've been around Perl a while, all this talk of escape sequences
190	may seem familiar. Similar escape sequences are used in double-quoted
191	strings and in fact the regexps in Perl are mostly treated as
192	double-quoted strings. This means that variables can be used in
193	regexps as well. Just like double-quoted strings, the values of the
194	variables in the regexp will be substituted in before the regexp is
195	evaluated for matching purposes. So we have:
196
197	$foo = 'house';
198	'housecat' =~ /$foo/; # matches
199	'cathouse' =~ /cat$foo/; # matches
200	'housecat' =~ /${foo}cat/; # matches
201
202	So far, so good. With the knowledge above you can already perform
203	searches with just about any literal string regexp you can dream up.
204	Here is a I<very simple> emulation of the Unix grep program:
205
206	% cat > simple_grep
207	#!/usr/bin/perl
208	$regexp = shift;
209	while (<>) {
210	print if /$regexp/;
211	}
212	^D
213
214	% chmod +x simple_grep
215
216	% simple_grep abba /usr/dict/words
217	Babbage
218	cabbage
219	cabbages
220	sabbath
221	Sabbathize
222	Sabbathizes
223	sabbatical
224	scabbard
225	scabbards
226
227	This program is easy to understand. C<#!/usr/bin/perl> is the standard
228	way to invoke a perl program from the shell.
229	S<C<$regexp = shift;> > saves the first command line argument as the
230	regexp to be used, leaving the rest of the command line arguments to
231	be treated as files. S<C<< while (<>) >> > loops over all the lines in
232	all the files. For each line, S<C<print if /$regexp/;> > prints the
233	line if the regexp matches the line. In this line, both C<print> and
234	C</$regexp/> use the default variable C<$_> implicitly.
235
236	With all of the regexps above, if the regexp matched anywhere in the
237	string, it was considered a match. Sometimes, however, we'd like to
238	specify I<where> in the string the regexp should try to match. To do
239	this, we would use the B<anchor> metacharacters C<^> and C<$>. The
240	anchor C<^> means match at the beginning of the string and the anchor
241	C<$> means match at the end of the string, or before a newline at the
242	end of the string. Here is how they are used:
243
244	"housekeeper" =~ /keeper/; # matches
245	"housekeeper" =~ /^keeper/; # doesn't match
246	"housekeeper" =~ /keeper$/; # matches
247	"housekeeper\n" =~ /keeper$/; # matches
248
249	The second regexp doesn't match because C<^> constrains C<keeper> to
250	match only at the beginning of the string, but C<"housekeeper"> has
251	keeper starting in the middle. The third regexp does match, since the
252	C<$> constrains C<keeper> to match only at the end of the string.
253
254	When both C<^> and C<$> are used at the same time, the regexp has to
255	match both the beginning and the end of the string, i.e., the regexp
256	matches the whole string. Consider
257
258	"keeper" =~ /^keep$/; # doesn't match
259	"keeper" =~ /^keeper$/; # matches
260	"" =~ /^$/; # ^$ matches an empty string
261
262	The first regexp doesn't match because the string has more to it than
263	C<keep>. Since the second regexp is exactly the string, it
264	matches. Using both C<^> and C<$> in a regexp forces the complete
265	string to match, so it gives you complete control over which strings
266	match and which don't. Suppose you are looking for a fellow named
267	bert, off in a string by himself:
268
269	"dogbert" =~ /bert/; # matches, but not what you want
270
271	"dilbert" =~ /^bert/; # doesn't match, but ..
272	"bertram" =~ /^bert/; # matches, so still not good enough
273
274	"bertram" =~ /^bert$/; # doesn't match, good
275	"dilbert" =~ /^bert$/; # doesn't match, good
276	"bert" =~ /^bert$/; # matches, perfect
277
278	Of course, in the case of a literal string, one could just as easily
279	use the string equivalence S<C<$string eq 'bert'> > and it would be
280	more efficient. The C<^...$> regexp really becomes useful when we
281	add in the more powerful regexp tools below.
282
283	=head2 Using character classes
284
285	Although one can already do quite a lot with the literal string
286	regexps above, we've only scratched the surface of regular expression
287	technology. In this and subsequent sections we will introduce regexp
288	concepts (and associated metacharacter notations) that will allow a
289	regexp to not just represent a single character sequence, but a I<whole
290	class> of them.
291
292	One such concept is that of a B<character class>. A character class
293	allows a set of possible characters, rather than just a single
294	character, to match at a particular point in a regexp. Character
295	classes are denoted by brackets C<[...]>, with the set of characters
296	to be possibly matched inside. Here are some examples:
297
298	/cat/; # matches 'cat'
299	/[bcr]at/; # matches 'bat, 'cat', or 'rat'
300	/item[0123456789]/; # matches 'item0' or ... or 'item9'
301	"abc" =~ /[cab]/; # matches 'a'
302
303	In the last statement, even though C<'c'> is the first character in
304	the class, C<'a'> matches because the first character position in the
305	string is the earliest point at which the regexp can match.
306
307	/[yY][eE][sS]/; # match 'yes' in a case-insensitive way
308	# 'yes', 'Yes', 'YES', etc.
309
310	This regexp displays a common task: perform a case-insensitive
311	match. Perl provides away of avoiding all those brackets by simply
312	appending an C<'i'> to the end of the match. Then C</[yY][eE][sS]/;>
313	can be rewritten as C</yes/i;>. The C<'i'> stands for
314	case-insensitive and is an example of a B<modifier> of the matching
315	operation. We will meet other modifiers later in the tutorial.
316
317	We saw in the section above that there were ordinary characters, which
318	represented themselves, and special characters, which needed a
319	backslash C<\> to represent themselves. The same is true in a
320	character class, but the sets of ordinary and special characters
321	inside a character class are different than those outside a character
322	class. The special characters for a character class are C<-]\^$>. C<]>
323	is special because it denotes the end of a character class. C<$> is
324	special because it denotes a scalar variable. C<\> is special because
325	it is used in escape sequences, just like above. Here is how the
326	special characters C<]$\> are handled:
327
328	/[\]c]def/; # matches ']def' or 'cdef'
329	$x = 'bcr';
330	/[$x]at/; # matches 'bat', 'cat', or 'rat'
331	/[\$x]at/; # matches '$at' or 'xat'
332	/[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
333
334	The last two are a little tricky. in C<[\$x]>, the backslash protects
335	the dollar sign, so the character class has two members C<$> and C<x>.
336	In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a
337	variable and substituted in double quote fashion.
338
339	The special character C<'-'> acts as a range operator within character
340	classes, so that a contiguous set of characters can be written as a
341	range. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]>
342	become the svelte C<[0-9]> and C<[a-z]>. Some examples are
343
344	/item[0-9]/; # matches 'item0' or ... or 'item9'
345	/[0-9bx-z]aa/; # matches '0aa', ..., '9aa',
346	# 'baa', 'xaa', 'yaa', or 'zaa'
347	/[0-9a-fA-F]/; # matches a hexadecimal digit
348	/[0-9a-zA-Z_]/; # matches a "word" character,
349	# like those in a perl variable name
350
351	If C<'-'> is the first or last character in a character class, it is
352	treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are
353	all equivalent.
354
355	The special character C<^> in the first position of a character class
356	denotes a B<negated character class>, which matches any character but
357	those in the brackets. Both C<[...]> and C<[^...]> must match a
358	character, or the match fails. Then
359
360	/[^a]at/; # doesn't match 'aat' or 'at', but matches
361	# all other 'bat', 'cat, '0at', '%at', etc.
362	/[^0-9]/; # matches a non-numeric character
363	/[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
364
365	Now, even C<[0-9]> can be a bother the write multiple times, so in the
366	interest of saving keystrokes and making regexps more readable, Perl
367	has several abbreviations for common character classes:
368
369	=over 4
370
371	=item *
372
373	\d is a digit and represents [0-9]
374
375	=item *
376
377	\s is a whitespace character and represents [\ \t\r\n\f]
378
379	=item *
380
381	\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]
382
383	=item *
384
385	\D is a negated \d; it represents any character but a digit [^0-9]
386
387	=item *
388
389	\S is a negated \s; it represents any non-whitespace character [^\s]
390
391	=item *
392
393	\W is a negated \w; it represents any non-word character [^\w]
394
395	=item *
396
397	The period '.' matches any character but "\n"
398
399	=back
400
401	The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
402	of character classes. Here are some in use:
403
404	/\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
405	/[\d\s]/; # matches any digit or whitespace character
406	/\w\W\w/; # matches a word char, followed by a
407	# non-word char, followed by a word char
408	/..rt/; # matches any two chars, followed by 'rt'
409	/end\./; # matches 'end.'
410	/end[.]/; # same thing, matches 'end.'
411
412	Because a period is a metacharacter, it needs to be escaped to match
413	as an ordinary period. Because, for example, C<\d> and C<\w> are sets
414	of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in
415	fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as
416	C<[\W]>. Think DeMorgan's laws.
417
418	An anchor useful in basic regexps is the S<B<word anchor> >
419	C<\b>. This matches a boundary between a word character and a non-word
420	character C<\w\W> or C<\W\w>:
421
422	$x = "Housecat catenates house and cat";
423	$x =~ /cat/; # matches cat in 'housecat'
424	$x =~ /\bcat/; # matches cat in 'catenates'
425	$x =~ /cat\b/; # matches cat in 'housecat'
426	$x =~ /\bcat\b/; # matches 'cat' at end of string
427
428	Note in the last example, the end of the string is considered a word
429	boundary.
430
431	You might wonder why C<'.'> matches everything but C<"\n"> - why not
432	every character? The reason is that often one is matching against
433	lines and would like to ignore the newline characters. For instance,
434	while the string C<"\n"> represents one line, we would like to think
435	of as empty. Then
436
437	"" =~ /^$/; # matches
438	"\n" =~ /^$/; # matches, "\n" is ignored
439
440	"" =~ /./; # doesn't match; it needs a char
441	"" =~ /^.$/; # doesn't match; it needs a char
442	"\n" =~ /^.$/; # doesn't match; it needs a char other than "\n"
443	"a" =~ /^.$/; # matches
444	"a\n" =~ /^.$/; # matches, ignores the "\n"
445
446	This behavior is convenient, because we usually want to ignore
447	newlines when we count and match characters in a line. Sometimes,
448	however, we want to keep track of newlines. We might even want C<^>
449	and C<$> to anchor at the beginning and end of lines within the
450	string, rather than just the beginning and end of the string. Perl
451	allows us to choose between ignoring and paying attention to newlines
452	by using the C<//s> and C<//m> modifiers. C<//s> and C<//m> stand for
453	single line and multi-line and they determine whether a string is to
454	be treated as one continuous string, or as a set of lines. The two
455	modifiers affect two aspects of how the regexp is interpreted: 1) how
456	the C<'.'> character class is defined, and 2) where the anchors C<^>
457	and C<$> are able to match. Here are the four possible combinations:
458
459	=over 4
460
461	=item *
462
463	no modifiers (//): Default behavior. C<'.'> matches any character
464	except C<"\n">. C<^> matches only at the beginning of the string and
465	C<$> matches only at the end or before a newline at the end.
466
467	=item *
468
469	s modifier (//s): Treat string as a single long line. C<'.'> matches
470	any character, even C<"\n">. C<^> matches only at the beginning of
471	the string and C<$> matches only at the end or before a newline at the
472	end.
473
474	=item *
475
476	m modifier (//m): Treat string as a set of multiple lines. C<'.'>
477	matches any character except C<"\n">. C<^> and C<$> are able to match
478	at the start or end of I<any> line within the string.
479
480	=item *
481
482	both s and m modifiers (//sm): Treat string as a single long line, but
483	detect multiple lines. C<'.'> matches any character, even
484	C<"\n">. C<^> and C<$>, however, are able to match at the start or end
485	of I<any> line within the string.
486
487	=back
488
489	Here are examples of C<//s> and C<//m> in action:
490
491	$x = "There once was a girl\nWho programmed in Perl\n";
492
493	$x =~ /^Who/; # doesn't match, "Who" not at start of string
494	$x =~ /^Who/s; # doesn't match, "Who" not at start of string
495	$x =~ /^Who/m; # matches, "Who" at start of second line
496	$x =~ /^Who/sm; # matches, "Who" at start of second line
497
498	$x =~ /girl.Who/; # doesn't match, "." doesn't match "\n"
499	$x =~ /girl.Who/s; # matches, "." matches "\n"