Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

libre.tex@ 3398

Visit:

Last change on this file since 3398 was 3225, checked in by bird, 19 years ago
Python 2.5
File size: 41.2 KB

Line
1	\section{\module{re} ---
2	Regular expression operations}
3	\declaremodule{standard}{re}
4	\moduleauthor{Fredrik Lundh}{[email protected]}
5	\sectionauthor{Andrew M. Kuchling}{[email protected]}
6
7
8	\modulesynopsis{Regular expression search and match operations with a
9	Perl-style expression syntax.}
10
11
12	This module provides regular expression matching operations similar to
13	those found in Perl. Regular expression pattern strings may not
14	contain null bytes, but can specify the null byte using the
15	\code{\e\var{number}} notation. Both patterns and strings to be
16	searched can be Unicode strings as well as 8-bit strings. The
17	\module{re} module is always available.
18
19	Regular expressions use the backslash character (\character{\e}) to
20	indicate special forms or to allow special characters to be used
21	without invoking their special meaning. This collides with Python's
22	usage of the same character for the same purpose in string literals;
23	for example, to match a literal backslash, one might have to write
24	\code{'\e\e\e\e'} as the pattern string, because the regular expression
25	must be \samp{\e\e}, and each backslash must be expressed as
26	\samp{\e\e} inside a regular Python string literal.
27
28	The solution is to use Python's raw string notation for regular
29	expression patterns; backslashes are not handled in any special way in
30	a string literal prefixed with \character{r}. So \code{r"\e n"} is a
31	two-character string containing \character{\e} and \character{n},
32	while \code{"\e n"} is a one-character string containing a newline.
33	Usually patterns will be expressed in Python code using this raw
34	string notation.
35
36	\begin{seealso}
37	\seetitle{Mastering Regular Expressions}{Book on regular expressions
38	by Jeffrey Friedl, published by O'Reilly. The second
39	edition of the book no longer covers Python at all,
40	but the first edition covered writing good regular expression
41	patterns in great detail.}
42	\end{seealso}
43
44
45	\subsection{Regular Expression Syntax \label{re-syntax}}
46
47	A regular expression (or RE) specifies a set of strings that matches
48	it; the functions in this module let you check if a particular string
49	matches a given regular expression (or if a given regular expression
50	matches a particular string, which comes down to the same thing).
51
52	Regular expressions can be concatenated to form new regular
53	expressions; if \emph{A} and \emph{B} are both regular expressions,
54	then \emph{AB} is also a regular expression. In general, if a string
55	\emph{p} matches \emph{A} and another string \emph{q} matches \emph{B},
56	the string \emph{pq} will match AB. This holds unless \emph{A} or
57	\emph{B} contain low precedence operations; boundary conditions between
58	\emph{A} and \emph{B}; or have numbered group references. Thus, complex
59	expressions can easily be constructed from simpler primitive
60	expressions like the ones described here. For details of the theory
61	and implementation of regular expressions, consult the Friedl book
62	referenced above, or almost any textbook about compiler construction.
63
64	A brief explanation of the format of regular expressions follows. For
65	further information and a gentler presentation, consult the Regular
66	Expression HOWTO, accessible from \url{http://www.python.org/doc/howto/}.
67
68	Regular expressions can contain both special and ordinary characters.
69	Most ordinary characters, like \character{A}, \character{a}, or
70	\character{0}, are the simplest regular expressions; they simply match
71	themselves. You can concatenate ordinary characters, so \regexp{last}
72	matches the string \code{'last'}. (In the rest of this section, we'll
73	write RE's in \regexp{this special style}, usually without quotes, and
74	strings to be matched \code{'in single quotes'}.)
75
76	Some characters, like \character{\|} or \character{(}, are special.
77	Special characters either stand for classes of ordinary characters, or
78	affect how the regular expressions around them are interpreted.
79
80	The special characters are:
81	%
82	\begin{description}
83
84	\item[\character{.}] (Dot.) In the default mode, this matches any
85	character except a newline. If the \constant{DOTALL} flag has been
86	specified, this matches any character including a newline.
87
88	\item[\character{\textasciicircum}] (Caret.) Matches the start of the
89	string, and in \constant{MULTILINE} mode also matches immediately
90	after each newline.
91
92	\item[\character{\$}] Matches the end of the string or just before the
93	newline at the end of the string, and in \constant{MULTILINE} mode
94	also matches before a newline. \regexp{foo} matches both 'foo' and
95	'foobar', while the regular expression \regexp{foo\$} matches only
96	'foo'. More interestingly, searching for \regexp{foo.\$} in
97	'foo1\textbackslash nfoo2\textbackslash n' matches 'foo2' normally,
98	but 'foo1' in \constant{MULTILINE} mode.
99
100	\item[\character{*}] Causes the resulting RE to
101	match 0 or more repetitions of the preceding RE, as many repetitions
102	as are possible. \regexp{ab*} will
103	match 'a', 'ab', or 'a' followed by any number of 'b's.
104
105	\item[\character{+}] Causes the
106	resulting RE to match 1 or more repetitions of the preceding RE.
107	\regexp{ab+} will match 'a' followed by any non-zero number of 'b's; it
108	will not match just 'a'.
109
110	\item[\character{?}] Causes the resulting RE to
111	match 0 or 1 repetitions of the preceding RE. \regexp{ab?} will
112	match either 'a' or 'ab'.
113
114	\item[\code{?}, \code{+?}, \code{??}] The \character{},
115	\character{+}, and \character{?} qualifiers are all \dfn{greedy}; they
116	match as much text as possible. Sometimes this behaviour isn't
117	desired; if the RE \regexp{<.*>} is matched against
118	\code{'<H1>title</H1>'}, it will match the entire string, and not just
119	\code{'<H1>'}. Adding \character{?} after the qualifier makes it
120	perform the match in \dfn{non-greedy} or \dfn{minimal} fashion; as
121	\emph{few} characters as possible will be matched. Using \regexp{.*?}
122	in the previous expression will match only \code{'<H1>'}.
123
124	\item[\code{\{\var{m}\}}]
125	Specifies that exactly \var{m} copies of the previous RE should be
126	matched; fewer matches cause the entire RE not to match. For example,
127	\regexp{a\{6\}} will match exactly six \character{a} characters, but
128	not five.
129
130	\item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from
131	\var{m} to \var{n} repetitions of the preceding RE, attempting to
132	match as many repetitions as possible. For example, \regexp{a\{3,5\}}
133	will match from 3 to 5 \character{a} characters. Omitting \var{m}
134	specifies a lower bound of zero,
135	and omitting \var{n} specifies an infinite upper bound. As an
136	example, \regexp{a\{4,\}b} will match \code{aaaab} or a thousand
137	\character{a} characters followed by a \code{b}, but not \code{aaab}.
138	The comma may not be omitted or the modifier would be confused with
139	the previously described form.
140
141	\item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to
142	match from \var{m} to \var{n} repetitions of the preceding RE,
143	attempting to match as \emph{few} repetitions as possible. This is
144	the non-greedy version of the previous qualifier. For example, on the
145	6-character string \code{'aaaaaa'}, \regexp{a\{3,5\}} will match 5
146	\character{a} characters, while \regexp{a\{3,5\}?} will only match 3
147	characters.
148
149	\item[\character{\e}] Either escapes special characters (permitting
150	you to match characters like \character{*}, \character{?}, and so
151	forth), or signals a special sequence; special sequences are discussed
152	below.
153
154	If you're not using a raw string to
155	express the pattern, remember that Python also uses the
156	backslash as an escape sequence in string literals; if the escape
157	sequence isn't recognized by Python's parser, the backslash and
158	subsequent character are included in the resulting string. However,
159	if Python would recognize the resulting sequence, the backslash should
160	be repeated twice. This is complicated and hard to understand, so
161	it's highly recommended that you use raw strings for all but the
162	simplest expressions.
163
164	\item[\code{[]}] Used to indicate a set of characters. Characters can
165	be listed individually, or a range of characters can be indicated by
166	giving two characters and separating them by a \character{-}. Special
167	characters are not active inside sets. For example, \regexp{[akm\$]}
168	will match any of the characters \character{a}, \character{k},
169	\character{m}, or \character{\$}; \regexp{[a-z]}
170	will match any lowercase letter, and \code{[a-zA-Z0-9]} matches any
171	letter or digit. Character classes such as \code{\e w} or \code{\e S}
172	(defined below) are also acceptable inside a range. If you want to
173	include a \character{]} or a \character{-} inside a set, precede it with a
174	backslash, or place it as the first character. The
175	pattern \regexp{[]]} will match \code{']'}, for example.
176
177	You can match the characters not within a range by \dfn{complementing}
178	the set. This is indicated by including a
179	\character{\textasciicircum} as the first character of the set;
180	\character{\textasciicircum} elsewhere will simply match the
181	\character{\textasciicircum} character. For example,
182	\regexp{[{\textasciicircum}5]} will match
183	any character except \character{5}, and
184	\regexp{[\textasciicircum\code{\textasciicircum}]} will match any character
185	except \character{\textasciicircum}.
186
187	\item[\character{\|}]\code{A\|B}, where A and B can be arbitrary REs,
188	creates a regular expression that will match either A or B. An
189	arbitrary number of REs can be separated by the \character{\|} in this
190	way. This can be used inside groups (see below) as well. As the target
191	string is scanned, REs separated by \character{\|} are tried from left to
192	right. When one pattern completely matches, that branch is accepted.
193	This means that once \code{A} matches, \code{B} will not be tested further,
194	even if it would produce a longer overall match. In other words, the
195	\character{\|} operator is never greedy. To match a literal \character{\|},
196	use \regexp{\e\|}, or enclose it inside a character class, as in \regexp{[\|]}.
197
198	\item[\code{(...)}] Matches whatever regular expression is inside the
199	parentheses, and indicates the start and end of a group; the contents
200	of a group can be retrieved after a match has been performed, and can
201	be matched later in the string with the \regexp{\e \var{number}} special
202	sequence, described below. To match the literals \character{(} or
203	\character{)}, use \regexp{\e(} or \regexp{\e)}, or enclose them
204	inside a character class: \regexp{[(] [)]}.
205
206	\item[\code{(?...)}] This is an extension notation (a \character{?}
207	following a \character{(} is not meaningful otherwise). The first
208	character after the \character{?}
209	determines what the meaning and further syntax of the construct is.
210	Extensions usually do not create a new group;
211	\regexp{(?P<\var{name}>...)} is the only exception to this rule.
212	Following are the currently supported extensions.
213
214	\item[\code{(?iLmsux)}] (One or more letters from the set \character{i},
215	\character{L}, \character{m}, \character{s}, \character{u},
216	\character{x}.) The group matches the empty string; the letters set
217	the corresponding flags (\constant{re.I}, \constant{re.L},
218	\constant{re.M}, \constant{re.S}, \constant{re.U}, \constant{re.X})
219	for the entire regular expression. This is useful if you wish to
220	include the flags as part of the regular expression, instead of
221	passing a \var{flag} argument to the \function{compile()} function.
222
223	Note that the \regexp{(?x)} flag changes how the expression is parsed.
224	It should be used first in the expression string, or after one or more
225	whitespace characters. If there are non-whitespace characters before
226	the flag, the results are undefined.
227
228	\item[\code{(?:...)}] A non-grouping version of regular parentheses.
229	Matches whatever regular expression is inside the parentheses, but the
230	substring matched by the
231	group \emph{cannot} be retrieved after performing a match or
232	referenced later in the pattern.
233
234	\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
235	the substring matched by the group is accessible via the symbolic group
236	name \var{name}. Group names must be valid Python identifiers, and
237	each group name must be defined only once within a regular expression. A
238	symbolic group is also a numbered group, just as if the group were not
239	named. So the group named 'id' in the example above can also be
240	referenced as the numbered group 1.
241
242	For example, if the pattern is
243	\regexp{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its
244	name in arguments to methods of match objects, such as
245	\code{m.group('id')} or \code{m.end('id')}, and also by name in
246	pattern text (for example, \regexp{(?P=id)}) and replacement text
247	(such as \code{\e g<id>}).
248
249	\item[\code{(?P=\var{name})}] Matches whatever text was matched by the
250	earlier group named \var{name}.
251
252	\item[\code{(?\#...)}] A comment; the contents of the parentheses are
253	simply ignored.
254
255	\item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't
256	consume any of the string. This is called a lookahead assertion. For
257	example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's
258	followed by \code{'Asimov'}.
259
260	\item[\code{(?!...)}] Matches if \regexp{...} doesn't match next. This
261	is a negative lookahead assertion. For example,
262	\regexp{Isaac (?!Asimov)} will match \code{'Isaac~'} only if it's \emph{not}
263	followed by \code{'Asimov'}.
264
265	\item[\code{(?<=...)}] Matches if the current position in the string
266	is preceded by a match for \regexp{...} that ends at the current
267	position. This is called a \dfn{positive lookbehind assertion}.
268	\regexp{(?<=abc)def} will find a match in \samp{abcdef}, since the
269	lookbehind will back up 3 characters and check if the contained
270	pattern matches. The contained pattern must only match strings of
271	some fixed length, meaning that \regexp{abc} or \regexp{a\|b} are
272	allowed, but \regexp{a*} and \regexp{a\{3,4\}} are not. Note that
273	patterns which start with positive lookbehind assertions will never
274	match at the beginning of the string being searched; you will most
275	likely want to use the \function{search()} function rather than the
276	\function{match()} function:
277
278	\begin{verbatim}
279	>>> import re
280	>>> m = re.search('(?<=abc)def', 'abcdef')
281	>>> m.group(0)
282	'def'
283	\end{verbatim}
284
285	This example looks for a word following a hyphen:
286
287	\begin{verbatim}
288	>>> m = re.search('(?<=-)\w+', 'spam-egg')
289	>>> m.group(0)
290	'egg'
291	\end{verbatim}
292
293	\item[\code{(?<!...)}] Matches if the current position in the string
294	is not preceded by a match for \regexp{...}. This is called a
295	\dfn{negative lookbehind assertion}. Similar to positive lookbehind
296	assertions, the contained pattern must only match strings of some
297	fixed length. Patterns which start with negative lookbehind
298	assertions may match at the beginning of the string being searched.
299
300	\item[\code{(?(\var{id/name})yes-pattern\|no-pattern)}] Will try to match
301	with \regexp{yes-pattern} if the group with given \var{id} or \var{name}
302	exists, and with \regexp{no-pattern} if it doesn't. \regexp{\|no-pattern}
303	is optional and can be omitted. For example,
304	\regexp{(<)?(\e w+@\e w+(?:\e .\e w+)+)(?(1)>)} is a poor email matching
305	pattern, which will match with \code{'<[email protected]>'} as well as
306	\code{'[email protected]'}, but not with \code{'<[email protected]'}.
307	\versionadded{2.4}
308
309	\end{description}
310
311	The special sequences consist of \character{\e} and a character from the
312	list below. If the ordinary character is not on the list, then the
313	resulting RE will match the second character. For example,
314	\regexp{\e\$} matches the character \character{\$}.
315	%
316	\begin{description}
317
318	\item[\code{\e \var{number}}] Matches the contents of the group of the
319	same number. Groups are numbered starting from 1. For example,
320	\regexp{(.+) \e 1} matches \code{'the the'} or \code{'55 55'}, but not
321	\code{'the end'} (note
322	the space after the group). This special sequence can only be used to
323	match one of the first 99 groups. If the first digit of \var{number}
324	is 0, or \var{number} is 3 octal digits long, it will not be interpreted
325	as a group match, but as the character with octal value \var{number}.
326	Inside the \character{[} and \character{]} of a character class, all numeric
327	escapes are treated as characters.
328
329	\item[\code{\e A}] Matches only at the start of the string.
330
331	\item[\code{\e b}] Matches the empty string, but only at the
332	beginning or end of a word. A word is defined as a sequence of
333	alphanumeric or underscore characters, so the end of a word is indicated by
334	whitespace or a non-alphanumeric, non-underscore character. Note that
335	{}\code{\e b} is defined as the boundary between \code{\e w} and \code{\e
336	W}, so the precise set of characters deemed to be alphanumeric depends on the
337	values of the \code{UNICODE} and \code{LOCALE} flags. Inside a character
338	range, \regexp{\e b} represents the backspace character, for compatibility
339	with Python's string literals.
340
341	\item[\code{\e B}] Matches the empty string, but only when it is \emph{not}
342	at the beginning or end of a word. This is just the opposite of {}\code{\e
343	b}, so is also subject to the settings of \code{LOCALE} and \code{UNICODE}.
344
345	\item[\code{\e d}]When the \constant{UNICODE} flag is not specified, matches
346	any decimal digit; this is equivalent to the set \regexp{[0-9]}.
347	With \constant{UNICODE}, it will match whatever is classified as a digit
348	in the Unicode character properties database.
349
350	\item[\code{\e D}]When the \constant{UNICODE} flag is not specified, matches
351	any non-digit character; this is equivalent to the set
352	\regexp{[{\textasciicircum}0-9]}. With \constant{UNICODE}, it will match
353	anything other than character marked as digits in the Unicode character
354	properties database.
355
356	\item[\code{\e s}]When the \constant{LOCALE} and \constant{UNICODE}
357	flags are not specified, matches any whitespace character; this is
358	equivalent to the set \regexp{[ \e t\e n\e r\e f\e v]}.
359	With \constant{LOCALE}, it will match this set plus whatever characters
360	are defined as space for the current locale. If \constant{UNICODE} is set,
361	this will match the characters \regexp{[ \e t\e n\e r\e f\e v]} plus
362	whatever is classified as space in the Unicode character properties
363	database.
364
365	\item[\code{\e S}]When the \constant{LOCALE} and \constant{UNICODE}
366	flags are not specified, matches any non-whitespace character; this is
367	equivalent to the set \regexp{[\textasciicircum\ \e t\e n\e r\e f\e v]}
368	With \constant{LOCALE}, it will match any character not in this set,
369	and not defined as space in the current locale. If \constant{UNICODE}
370	is set, this will match anything other than \regexp{[ \e t\e n\e r\e f\e v]}
371	and characters marked as space in the Unicode character properties database.
372
373	\item[\code{\e w}]When the \constant{LOCALE} and \constant{UNICODE}
374	flags are not specified, matches any alphanumeric character and the
375	underscore; this is equivalent to the set
376	\regexp{[a-zA-Z0-9_]}. With \constant{LOCALE}, it will match the set
377	\regexp{[0-9_]} plus whatever characters are defined as alphanumeric for
378	the current locale. If \constant{UNICODE} is set, this will match the
379	characters \regexp{[0-9_]} plus whatever is classified as alphanumeric
380	in the Unicode character properties database.
381
382	\item[\code{\e W}]When the \constant{LOCALE} and \constant{UNICODE}
383	flags are not specified, matches any non-alphanumeric character; this
384	is equivalent to the set \regexp{[{\textasciicircum}a-zA-Z0-9_]}. With
385	\constant{LOCALE}, it will match any character not in the set
386	\regexp{[0-9_]}, and not defined as alphanumeric for the current locale.
387	If \constant{UNICODE} is set, this will match anything other than
388	\regexp{[0-9_]} and characters marked as alphanumeric in the Unicode
389	character properties database.
390
391	\item[\code{\e Z}]Matches only at the end of the string.
392
393	\end{description}
394
395	Most of the standard escapes supported by Python string literals are
396	also accepted by the regular expression parser:
397
398	\begin{verbatim}
399	\a \b \f \n
400	\r \t \v \x
401	\\
402	\end{verbatim}
403
404	Octal escapes are included in a limited form: If the first digit is a
405	0, or if there are three octal digits, it is considered an octal
406	escape. Otherwise, it is a group reference. As for string literals,
407	octal escapes are always at most three digits in length.
408
409
410	% Note the lack of a period in the section title; it causes problems
411	% with readers of the GNU info version. See http://www.python.org/sf/581414.
412	\subsection{Matching vs Searching \label{matching-searching}}
413	\sectionauthor{Fred L. Drake, Jr.}{[email protected]}
414
415	Python offers two different primitive operations based on regular
416	expressions: match and search. If you are accustomed to Perl's
417	semantics, the search operation is what you're looking for. See the
418	\function{search()} function and corresponding method of compiled
419	regular expression objects.
420
421	Note that match may differ from search using a regular expression
422	beginning with \character{\textasciicircum}:
423	\character{\textasciicircum} matches only at the
424	start of the string, or in \constant{MULTILINE} mode also immediately
425	following a newline. The ``match'' operation succeeds only if the
426	pattern matches at the start of the string regardless of mode, or at
427	the starting position given by the optional \var{pos} argument
428	regardless of whether a newline precedes it.
429
430	% Examples from Tim Peters:
431	\begin{verbatim}
432	re.compile("a").match("ba", 1) # succeeds
433	re.compile("^a").search("ba", 1) # fails; 'a' not at start
434	re.compile("^a").search("\na", 1) # fails; 'a' not at start
435	re.compile("^a", re.M).search("\na", 1) # succeeds
436	re.compile("^a", re.M).search("ba", 1) # fails; no preceding \n
437	\end{verbatim}
438
439
440	\subsection{Module Contents}
441	\nodename{Contents of Module re}
442
443	The module defines several functions, constants, and an exception. Some of the
444	functions are simplified versions of the full featured methods for compiled
445	regular expressions. Most non-trivial applications always use the compiled
446	form.
447
448	\begin{funcdesc}{compile}{pattern\optional{, flags}}
449	Compile a regular expression pattern into a regular expression
450	object, which can be used for matching using its \function{match()} and
451	\function{search()} methods, described below.
452
453	The expression's behaviour can be modified by specifying a
454	\var{flags} value. Values can be any of the following variables,
455	combined using bitwise OR (the \code{\|} operator).
456
457	The sequence
458
459	\begin{verbatim}
460	prog = re.compile(pat)
461	result = prog.match(str)
462	\end{verbatim}
463
464	is equivalent to
465
466	\begin{verbatim}
467	result = re.match(pat, str)
468	\end{verbatim}
469
470	but the version using \function{compile()} is more efficient when the
471	expression will be used several times in a single program.
472	%(The compiled version of the last pattern passed to
473	%\function{re.match()} or \function{re.search()} is cached, so
474	%programs that use only a single regular expression at a time needn't
475	%worry about compiling regular expressions.)
476	\end{funcdesc}
477
478	\begin{datadesc}{I}
479	\dataline{IGNORECASE}
480	Perform case-insensitive matching; expressions like \regexp{[A-Z]}
481	will match lowercase letters, too. This is not affected by the
482	current locale.
483	\end{datadesc}
484
485	\begin{datadesc}{L}
486	\dataline{LOCALE}
487	Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, \regexp{\e B},
488	\regexp{\e s} and \regexp{\e S} dependent on the current locale.
489	\end{datadesc}
490
491	\begin{datadesc}{M}
492	\dataline{MULTILINE}
493	When specified, the pattern character \character{\textasciicircum}
494	matches at the beginning of the string and at the beginning of each
495	line (immediately following each newline); and the pattern character
496	\character{\$} matches at the end of the string and at the end of each
497	line (immediately preceding each newline). By default,
498	\character{\textasciicircum} matches only at the beginning of the
499	string, and \character{\$} only at the end of the string and
500	immediately before the newline (if any) at the end of the string.
501	\end{datadesc}
502
503	\begin{datadesc}{S}
504	\dataline{DOTALL}
505	Make the \character{.} special character match any character at all,
506	including a newline; without this flag, \character{.} will match
507	anything \emph{except} a newline.
508	\end{datadesc}
509
510	\begin{datadesc}{U}
511	\dataline{UNICODE}
512	Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, \regexp{\e B},
513	\regexp{\e d}, \regexp{\e D}, \regexp{\e s} and \regexp{\e S}
514	dependent on the Unicode character properties database.
515	\versionadded{2.0}
516	\end{datadesc}
517
518	\begin{datadesc}{X}
519	\dataline{VERBOSE}
520	This flag allows you to write regular expressions that look nicer.
521	Whitespace within the pattern is ignored,
522	except when in a character class or preceded by an unescaped
523	backslash, and, when a line contains a \character{\#} neither in a
524	character class or preceded by an unescaped backslash, all characters
525	from the leftmost such \character{\#} through the end of the line are
526	ignored.
527	% XXX should add an example here
528	\end{datadesc}
529
530
531	\begin{funcdesc}{search}{pattern, string\optional{, flags}}
532	Scan through \var{string} looking for a location where the regular
533	expression \var{pattern} produces a match, and return a
534	corresponding \class{MatchObject} instance.
535	Return \code{None} if no
536	position in the string matches the pattern; note that this is
537	different from finding a zero-length match at some point in the string.
538	\end{funcdesc}
539
540	\begin{funcdesc}{match}{pattern, string\optional{, flags}}
541	If zero or more characters at the beginning of \var{string} match
542	the regular expression \var{pattern}, return a corresponding
543	\class{MatchObject} instance. Return \code{None} if the string does not
544	match the pattern; note that this is different from a zero-length
545	match.
546
547	\note{If you want to locate a match anywhere in
548	\var{string}, use \method{search()} instead.}
549	\end{funcdesc}
550
551	\begin{funcdesc}{split}{pattern, string\optional{, maxsplit\code{ = 0}}}
552	Split \var{string} by the occurrences of \var{pattern}. If
553	capturing parentheses are used in \var{pattern}, then the text of all
554	groups in the pattern are also returned as part of the resulting list.
555	If \var{maxsplit} is nonzero, at most \var{maxsplit} splits
556	occur, and the remainder of the string is returned as the final
557	element of the list. (Incompatibility note: in the original Python
558	1.5 release, \var{maxsplit} was ignored. This has been fixed in
559	later releases.)
560
561	\begin{verbatim}
562	>>> re.split('\W+', 'Words, words, words.')
563	['Words', 'words', 'words', '']
564	>>> re.split('(\W+)', 'Words, words, words.')
565	['Words', ', ', 'words', ', ', 'words', '.', '']
566	>>> re.split('\W+', 'Words, words, words.', 1)
567	['Words', 'words, words.']
568	\end{verbatim}
569	\end{funcdesc}
570
571	\begin{funcdesc}{findall}{pattern, string\optional{, flags}}
572	Return a list of all non-overlapping matches of \var{pattern} in
573	\var{string}. If one or more groups are present in the pattern,
574	return a list of groups; this will be a list of tuples if the
575	pattern has more than one group. Empty matches are included in the
576	result unless they touch the beginning of another match.
577	\versionadded{1.5.2}
578	\versionchanged[Added the optional flags argument]{2.4}
579	\end{funcdesc}
580
581	\begin{funcdesc}{finditer}{pattern, string\optional{, flags}}
582	Return an iterator over all non-overlapping matches for the RE
583	\var{pattern} in \var{string}. For each match, the iterator returns
584	a match object. Empty matches are included in the result unless they
585	touch the beginning of another match.
586	\versionadded{2.2}
587	\versionchanged[Added the optional flags argument]{2.4}
588	\end{funcdesc}
589
590	\begin{funcdesc}{sub}{pattern, repl, string\optional{, count}}
591	Return the string obtained by replacing the leftmost non-overlapping
592	occurrences of \var{pattern} in \var{string} by the replacement
593	\var{repl}. If the pattern isn't found, \var{string} is returned
594	unchanged. \var{repl} can be a string or a function; if it is a
595	string, any backslash escapes in it are processed. That is,
596	\samp{\e n} is converted to a single newline character, \samp{\e r}
597	is converted to a linefeed, and so forth. Unknown escapes such as
598	\samp{\e j} are left alone. Backreferences, such as \samp{\e6}, are
599	replaced with the substring matched by group 6 in the pattern. For
600	example:
601
602	\begin{verbatim}
603	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
604	... r'static PyObject*\npy_\1(void)\n{',
605	... 'def myfunc():')
606	'static PyObject*\npy_myfunc(void)\n{'
607	\end{verbatim}
608
609	If \var{repl} is a function, it is called for every non-overlapping
610	occurrence of \var{pattern}. The function takes a single match
611	object argument, and returns the replacement string. For example:
612
613	\begin{verbatim}
614	>>> def dashrepl(matchobj):
615	... if matchobj.group(0) == '-': return ' '
616	... else: return '-'
617	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
618	'pro--gram files'
619	\end{verbatim}
620
621	The pattern may be a string or an RE object; if you need to specify
622	regular expression flags, you must use a RE object, or use embedded
623	modifiers in a pattern; for example, \samp{sub("(?i)b+", "x", "bbbb
624	BBBB")} returns \code{'x x'}.
625
626	The optional argument \var{count} is the maximum number of pattern
627	occurrences to be replaced; \var{count} must be a non-negative
628	integer. If omitted or zero, all occurrences will be replaced.
629	Empty matches for the pattern are replaced only when not adjacent to
630	a previous match, so \samp{sub('x*', '-', 'abc')} returns
631	\code{'-a-b-c-'}.
632
633	In addition to character escapes and backreferences as described
634	above, \samp{\e g<name>} will use the substring matched by the group
635	named \samp{name}, as defined by the \regexp{(?P<name>...)} syntax.
636	\samp{\e g<number>} uses the corresponding group number;
637	\samp{\e g<2>} is therefore equivalent to \samp{\e 2}, but isn't
638	ambiguous in a replacement such as \samp{\e g<2>0}. \samp{\e 20}
639	would be interpreted as a reference to group 20, not a reference to
640	group 2 followed by the literal character \character{0}. The
641	backreference \samp{\e g<0>} substitutes in the entire substring
642	matched by the RE.
643	\end{funcdesc}
644
645	\begin{funcdesc}{subn}{pattern, repl, string\optional{, count}}
646	Perform the same operation as \function{sub()}, but return a tuple
647	\code{(\var{new_string}, \var{number_of_subs_made})}.
648	\end{funcdesc}
649
650	\begin{funcdesc}{escape}{string}
651	Return \var{string} with all non-alphanumerics backslashed; this is
652	useful if you want to match an arbitrary literal string that may have
653	regular expression metacharacters in it.
654	\end{funcdesc}
655
656	\begin{excdesc}{error}
657	Exception raised when a string passed to one of the functions here
658	is not a valid regular expression (for example, it might contain
659	unmatched parentheses) or when some other error occurs during
660	compilation or matching. It is never an error if a string contains
661	no match for a pattern.
662	\end{excdesc}
663
664
665	\subsection{Regular Expression Objects \label{re-objects}}
666
667	Compiled regular expression objects support the following methods and
668	attributes:
669
670	\begin{methoddesc}[RegexObject]{match}{string\optional{, pos\optional{,
671	endpos}}}
672	If zero or more characters at the beginning of \var{string} match
673	this regular expression, return a corresponding
674	\class{MatchObject} instance. Return \code{None} if the string does not
675	match the pattern; note that this is different from a zero-length
676	match.
677
678	\note{If you want to locate a match anywhere in
679	\var{string}, use \method{search()} instead.}
680
681	The optional second parameter \var{pos} gives an index in the string
682	where the search is to start; it defaults to \code{0}. This is not
683	completely equivalent to slicing the string; the
684	\code{'\textasciicircum'} pattern
685	character matches at the real beginning of the string and at positions
686	just after a newline, but not necessarily at the index where the search
687	is to start.
688
689	The optional parameter \var{endpos} limits how far the string will
690	be searched; it will be as if the string is \var{endpos} characters
691	long, so only the characters from \var{pos} to \code{\var{endpos} -
692	1} will be searched for a match. If \var{endpos} is less than
693	\var{pos}, no match will be found, otherwise, if \var{rx} is a
694	compiled regular expression object,
695	\code{\var{rx}.match(\var{string}, 0, 50)} is equivalent to
696	\code{\var{rx}.match(\var{string}[:50], 0)}.
697	\end{methoddesc}
698
699	\begin{methoddesc}[RegexObject]{search}{string\optional{, pos\optional{,
700	endpos}}}
701	Scan through \var{string} looking for a location where this regular
702	expression produces a match, and return a
703	corresponding \class{MatchObject} instance. Return \code{None} if no
704	position in the string matches the pattern; note that this is
705	different from finding a zero-length match at some point in the string.
706
707	The optional \var{pos} and \var{endpos} parameters have the same
708	meaning as for the \method{match()} method.
709	\end{methoddesc}
710
711	\begin{methoddesc}[RegexObject]{split}{string\optional{,
712	maxsplit\code{ = 0}}}
713	Identical to the \function{split()} function, using the compiled pattern.
714	\end{methoddesc}
715
716	\begin{methoddesc}[RegexObject]{findall}{string\optional{, pos\optional{,
717	endpos}}}
718	Identical to the \function{findall()} function, using the compiled pattern.
719	\end{methoddesc}
720
721	\begin{methoddesc}[RegexObject]{finditer}{string\optional{, pos\optional{,
722	endpos}}}
723	Identical to the \function{finditer()} function, using the compiled pattern.
724	\end{methoddesc}
725
726	\begin{methoddesc}[RegexObject]{sub}{repl, string\optional{, count\code{ = 0}}}
727	Identical to the \function{sub()} function, using the compiled pattern.
728	\end{methoddesc}
729
730	\begin{methoddesc}[RegexObject]{subn}{repl, string\optional{,
731	count\code{ = 0}}}
732	Identical to the \function{subn()} function, using the compiled pattern.
733	\end{methoddesc}
734
735
736	\begin{memberdesc}[RegexObject]{flags}
737	The flags argument used when the RE object was compiled, or
738	\code{0} if no flags were provided.
739	\end{memberdesc}
740
741	\begin{memberdesc}[RegexObject]{groupindex}
742	A dictionary mapping any symbolic group names defined by
743	\regexp{(?P<\var{id}>)} to group numbers. The dictionary is empty if no
744	symbolic groups were used in the pattern.
745	\end{memberdesc}
746
747	\begin{memberdesc}[RegexObject]{pattern}
748	The pattern string from which the RE object was compiled.
749	\end{memberdesc}
750
751
752	\subsection{Match Objects \label{match-objects}}
753
754	\class{MatchObject} instances support the following methods and
755	attributes:
756
757	\begin{methoddesc}[MatchObject]{expand}{template}
758	Return the string obtained by doing backslash substitution on the
759	template string \var{template}, as done by the \method{sub()} method.
760	Escapes such as \samp{\e n} are converted to the appropriate
761	characters, and numeric backreferences (\samp{\e 1}, \samp{\e 2}) and
762	named backreferences (\samp{\e g<1>}, \samp{\e g<name>}) are replaced
763	by the contents of the corresponding group.
764	\end{methoddesc}
765
766	\begin{methoddesc}[MatchObject]{group}{\optional{group1, \moreargs}}
767	Returns one or more subgroups of the match. If there is a single
768	argument, the result is a single string; if there are
769	multiple arguments, the result is a tuple with one item per argument.
770	Without arguments, \var{group1} defaults to zero (the whole match
771	is returned).
772	If a \var{groupN} argument is zero, the corresponding return value is the
773	entire matching string; if it is in the inclusive range [1..99], it is
774	the string matching the corresponding parenthesized group. If a
775	group number is negative or larger than the number of groups defined
776	in the pattern, an \exception{IndexError} exception is raised.
777	If a group is contained in a part of the pattern that did not match,
778	the corresponding result is \code{None}. If a group is contained in a
779	part of the pattern that matched multiple times, the last match is
780	returned.
781
782	If the regular expression uses the \regexp{(?P<\var{name}>...)} syntax,
783	the \var{groupN} arguments may also be strings identifying groups by
784	their group name. If a string argument is not used as a group name in
785	the pattern, an \exception{IndexError} exception is raised.
786
787	A moderately complicated example:
788
789	\begin{verbatim}
790	m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
791	\end{verbatim}
792
793	After performing this match, \code{m.group(1)} is \code{'3'}, as is
794	\code{m.group('int')}, and \code{m.group(2)} is \code{'14'}.
795	\end{methoddesc}
796
797	\begin{methoddesc}[MatchObject]{groups}{\optional{default}}
798	Return a tuple containing all the subgroups of the match, from 1 up to
799	however many groups are in the pattern. The \var{default} argument is
800	used for groups that did not participate in the match; it defaults to
801	\code{None}. (Incompatibility note: in the original Python 1.5
802	release, if the tuple was one element long, a string would be returned
803	instead. In later versions (from 1.5.1 on), a singleton tuple is
804	returned in such cases.)
805	\end{methoddesc}
806
807	\begin{methoddesc}[MatchObject]{groupdict}{\optional{default}}
808	Return a dictionary containing all the \emph{named} subgroups of the
809	match, keyed by the subgroup name. The \var{default} argument is
810	used for groups that did not participate in the match; it defaults to
811	\code{None}.
812	\end{methoddesc}
813
814	\begin{methoddesc}[MatchObject]{start}{\optional{group}}
815	\methodline{end}{\optional{group}}
816	Return the indices of the start and end of the substring
817	matched by \var{group}; \var{group} defaults to zero (meaning the whole
818	matched substring).
819	Return \code{-1} if \var{group} exists but
820	did not contribute to the match. For a match object
821	\var{m}, and a group \var{g} that did contribute to the match, the
822	substring matched by group \var{g} (equivalent to
823	\code{\var{m}.group(\var{g})}) is
824
825	\begin{verbatim}
826	m.string[m.start(g):m.end(g)]
827	\end{verbatim}
828
829	Note that
830	\code{m.start(\var{group})} will equal \code{m.end(\var{group})} if
831	\var{group} matched a null string. For example, after \code{\var{m} =
832	re.search('b(c?)', 'cba')}, \code{\var{m}.start(0)} is 1,
833	\code{\var{m}.end(0)} is 2, \code{\var{m}.start(1)} and
834	\code{\var{m}.end(1)} are both 2, and \code{\var{m}.start(2)} raises
835	an \exception{IndexError} exception.
836	\end{methoddesc}
837
838	\begin{methoddesc}[MatchObject]{span}{\optional{group}}
839	For \class{MatchObject} \var{m}, return the 2-tuple
840	\code{(\var{m}.start(\var{group}), \var{m}.end(\var{group}))}.
841	Note that if \var{group} did not contribute to the match, this is
842	\code{(-1, -1)}. Again, \var{group} defaults to zero.
843	\end{methoddesc}
844
845	\begin{memberdesc}[MatchObject]{pos}
846	The value of \var{pos} which was passed to the \function{search()} or
847	\function{match()} method of the \class{RegexObject}. This is the
848	index into the string at which the RE engine started looking for a
849	match.
850	\end{memberdesc}
851
852	\begin{memberdesc}[MatchObject]{endpos}
853	The value of \var{endpos} which was passed to the \function{search()}
854	or \function{match()} method of the \class{RegexObject}. This is the
855	index into the string beyond which the RE engine will not go.
856	\end{memberdesc}
857
858	\begin{memberdesc}[MatchObject]{lastindex}
859	The integer index of the last matched capturing group, or \code{None}
860	if no group was matched at all. For example, the expressions
861	\regexp{(a)b}, \regexp{((a)(b))}, and \regexp{((ab))} will have
862	\code{lastindex == 1} if applied to the string \code{'ab'},
863	while the expression \regexp{(a)(b)} will have \code{lastindex == 2},
864	if applied to the same string.
865	\end{memberdesc}
866
867	\begin{memberdesc}[MatchObject]{lastgroup}
868	The name of the last matched capturing group, or \code{None} if the
869	group didn't have a name, or if no group was matched at all.
870	\end{memberdesc}
871
872	\begin{memberdesc}[MatchObject]{re}
873	The regular expression object whose \method{match()} or
874	\method{search()} method produced this \class{MatchObject} instance.
875	\end{memberdesc}
876
877	\begin{memberdesc}[MatchObject]{string}
878	The string passed to \function{match()} or \function{search()}.
879	\end{memberdesc}
880
881	\subsection{Examples}
882
883	\leftline{\strong{Simulating \cfunction{scanf()}}}
884
885	Python does not currently have an equivalent to \cfunction{scanf()}.
886	\ttindex{scanf()}
887	Regular expressions are generally more powerful, though also more
888	verbose, than \cfunction{scanf()} format strings. The table below
889	offers some more-or-less equivalent mappings between
890	\cfunction{scanf()} format tokens and regular expressions.
891
892	\begin{tableii}{l\|l}{textrm}{\cfunction{scanf()} Token}{Regular Expression}
893	\lineii{\code{\%c}}
894	{\regexp{.}}
895	\lineii{\code{\%5c}}
896	{\regexp{.\{5\}}}
897	\lineii{\code{\%d}}
898	{\regexp{[-+]?\e d+}}
899	\lineii{\code{\%e}, \code{\%E}, \code{\%f}, \code{\%g}}
900	{\regexp{[-+]?(\e d+(\e.\e d*)?\|\e.\e d+)([eE][-+]?\e d+)?}}
901	\lineii{\code{\%i}}
902	{\regexp{[-+]?(0[xX][\e dA-Fa-f]+\|0[0-7]*\|\e d+)}}
903	\lineii{\code{\%o}}
904	{\regexp{0[0-7]*}}
905	\lineii{\code{\%s}}
906	{\regexp{\e S+}}
907	\lineii{\code{\%u}}
908	{\regexp{\e d+}}
909	\lineii{\code{\%x}, \code{\%X}}
910	{\regexp{0[xX][\e dA-Fa-f]+}}
911	\end{tableii}
912
913	To extract the filename and numbers from a string like
914
915	\begin{verbatim}
916	/usr/sbin/sendmail - 0 errors, 4 warnings
917	\end{verbatim}
918
919	you would use a \cfunction{scanf()} format like
920
921	\begin{verbatim}
922	%s - %d errors, %d warnings
923	\end{verbatim}
924
925	The equivalent regular expression would be
926
927	\begin{verbatim}
928	(\S+) - (\d+) errors, (\d+) warnings
929	\end{verbatim}
930
931	\leftline{\strong{Avoiding recursion}}
932
933	If you create regular expressions that require the engine to perform a
934	lot of recursion, you may encounter a \exception{RuntimeError} exception with
935	the message \code{maximum recursion limit} exceeded. For example,
936
937	\begin{verbatim}
938	>>> import re
939	>>> s = 'Begin ' + 1000*'a very long string ' + 'end'
940	>>> re.match('Begin (\w\| )*? end', s).end()
941	Traceback (most recent call last):
942	File "<stdin>", line 1, in ?
943	File "/usr/local/lib/python2.5/re.py", line 132, in match
944	return _compile(pattern, flags).match(string)
945	RuntimeError: maximum recursion limit exceeded
946	\end{verbatim}
947
948	You can often restructure your regular expression to avoid recursion.
949
950	Starting with Python 2.3, simple uses of the \regexp{*?} pattern are
951	special-cased to avoid recursion. Thus, the above regular expression
952	can avoid recursion by being recast as
953	\regexp{Begin [a-zA-Z0-9_ ]*?end}. As a further benefit, such regular
954	expressions will run faster than their recursive equivalents.

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/essentials/dev-lang/python/Doc/lib/libre.tex@ 3398

Download in other formats: