Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

ref2.tex@ 3400

Visit:

Last change on this file since 3400 was 3225, checked in by bird, 19 years ago
Python 2.5
File size: 28.1 KB

Line
1	\chapter{Lexical analysis\label{lexical}}
2
3	A Python program is read by a \emph{parser}. Input to the parser is a
4	stream of \emph{tokens}, generated by the \emph{lexical analyzer}. This
5	chapter describes how the lexical analyzer breaks a file into tokens.
6	\index{lexical analysis}
7	\index{parser}
8	\index{token}
9
10	Python uses the 7-bit \ASCII{} character set for program text.
11	\versionadded[An encoding declaration can be used to indicate that
12	string literals and comments use an encoding different from ASCII]{2.3}
13	For compatibility with older versions, Python only warns if it finds
14	8-bit characters; those warnings should be corrected by either declaring
15	an explicit encoding, or using escape sequences if those bytes are binary
16	data, instead of characters.
17
18
19	The run-time character set depends on the I/O devices connected to the
20	program but is generally a superset of \ASCII.
21
22	\strong{Future compatibility note:} It may be tempting to assume that the
23	character set for 8-bit characters is ISO Latin-1 (an \ASCII{}
24	superset that covers most western languages that use the Latin
25	alphabet), but it is possible that in the future Unicode text editors
26	will become common. These generally use the UTF-8 encoding, which is
27	also an \ASCII{} superset, but with very different use for the
28	characters with ordinals 128-255. While there is no consensus on this
29	subject yet, it is unwise to assume either Latin-1 or UTF-8, even
30	though the current implementation appears to favor Latin-1. This
31	applies both to the source character set and the run-time character
32	set.
33
34
35	\section{Line structure\label{line-structure}}
36
37	A Python program is divided into a number of \emph{logical lines}.
38	\index{line structure}
39
40
41	\subsection{Logical lines\label{logical}}
42
43	The end of
44	a logical line is represented by the token NEWLINE. Statements cannot
45	cross logical line boundaries except where NEWLINE is allowed by the
46	syntax (e.g., between statements in compound statements).
47	A logical line is constructed from one or more \emph{physical lines}
48	by following the explicit or implicit \emph{line joining} rules.
49	\index{logical line}
50	\index{physical line}
51	\index{line joining}
52	\index{NEWLINE token}
53
54
55	\subsection{Physical lines\label{physical}}
56
57	A physical line is a sequence of characters terminated by an end-of-line
58	sequence. In source files, any of the standard platform line
59	termination sequences can be used - the \UNIX{} form using \ASCII{} LF
60	(linefeed), the Windows form using the \ASCII{} sequence CR LF (return
61	followed by linefeed), or the Macintosh form using the \ASCII{} CR
62	(return) character. All of these forms can be used equally, regardless
63	of platform.
64
65	When embedding Python, source code strings should be passed to Python
66	APIs using the standard C conventions for newline characters (the
67	\code{\e n} character, representing \ASCII{} LF, is the line
68	terminator).
69
70
71	\subsection{Comments\label{comments}}
72
73	A comment starts with a hash character (\code{\#}) that is not part of
74	a string literal, and ends at the end of the physical line. A comment
75	signifies the end of the logical line unless the implicit line joining
76	rules are invoked.
77	Comments are ignored by the syntax; they are not tokens.
78	\index{comment}
79	\index{hash character}
80
81
82	\subsection{Encoding declarations\label{encodings}}
83	\index{source character set}
84	\index{encodings}
85
86	If a comment in the first or second line of the Python script matches
87	the regular expression \regexp{coding[=:]\e s*([-\e w.]+)}, this comment is
88	processed as an encoding declaration; the first group of this
89	expression names the encoding of the source code file. The recommended
90	forms of this expression are
91
92	\begin{verbatim}
93	# -- coding: <encoding-name> --
94	\end{verbatim}
95
96	which is recognized also by GNU Emacs, and
97
98	\begin{verbatim}
99	# vim:fileencoding=<encoding-name>
100	\end{verbatim}
101
102	which is recognized by Bram Moolenaar's VIM. In addition, if the first
103	bytes of the file are the UTF-8 byte-order mark
104	(\code{'\e xef\e xbb\e xbf'}), the declared file encoding is UTF-8
105	(this is supported, among others, by Microsoft's \program{notepad}).
106
107	If an encoding is declared, the encoding name must be recognized by
108	Python. % XXX there should be a list of supported encodings.
109	The encoding is used for all lexical analysis, in particular to find
110	the end of a string, and to interpret the contents of Unicode literals.
111	String literals are converted to Unicode for syntactical analysis,
112	then converted back to their original encoding before interpretation
113	starts. The encoding declaration must appear on a line of its own.
114
115	\subsection{Explicit line joining\label{explicit-joining}}
116
117	Two or more physical lines may be joined into logical lines using
118	backslash characters (\code{\e}), as follows: when a physical line ends
119	in a backslash that is not part of a string literal or comment, it is
120	joined with the following forming a single logical line, deleting the
121	backslash and the following end-of-line character. For example:
122	\index{physical line}
123	\index{line joining}
124	\index{line continuation}
125	\index{backslash character}
126	%
127	\begin{verbatim}
128	if 1900 < year < 2100 and 1 <= month <= 12 \
129	and 1 <= day <= 31 and 0 <= hour < 24 \
130	and 0 <= minute < 60 and 0 <= second < 60: # Looks like a valid date
131	return 1
132	\end{verbatim}
133
134	A line ending in a backslash cannot carry a comment. A backslash does
135	not continue a comment. A backslash does not continue a token except
136	for string literals (i.e., tokens other than string literals cannot be
137	split across physical lines using a backslash). A backslash is
138	illegal elsewhere on a line outside a string literal.
139
140
141	\subsection{Implicit line joining\label{implicit-joining}}
142
143	Expressions in parentheses, square brackets or curly braces can be
144	split over more than one physical line without using backslashes.
145	For example:
146
147	\begin{verbatim}
148	month_names = ['Januari', 'Februari', 'Maart', # These are the
149	'April', 'Mei', 'Juni', # Dutch names
150	'Juli', 'Augustus', 'September', # for the months
151	'Oktober', 'November', 'December'] # of the year
152	\end{verbatim}
153
154	Implicitly continued lines can carry comments. The indentation of the
155	continuation lines is not important. Blank continuation lines are
156	allowed. There is no NEWLINE token between implicit continuation
157	lines. Implicitly continued lines can also occur within triple-quoted
158	strings (see below); in that case they cannot carry comments.
159
160
161	\subsection{Blank lines \label{blank-lines}}
162
163	\index{blank line}
164	A logical line that contains only spaces, tabs, formfeeds and possibly
165	a comment, is ignored (i.e., no NEWLINE token is generated). During
166	interactive input of statements, handling of a blank line may differ
167	depending on the implementation of the read-eval-print loop. In the
168	standard implementation, an entirely blank logical line (i.e.\ one
169	containing not even whitespace or a comment) terminates a multi-line
170	statement.
171
172
173	\subsection{Indentation\label{indentation}}
174
175	Leading whitespace (spaces and tabs) at the beginning of a logical
176	line is used to compute the indentation level of the line, which in
177	turn is used to determine the grouping of statements.
178	\index{indentation}
179	\index{whitespace}
180	\index{leading whitespace}
181	\index{space}
182	\index{tab}
183	\index{grouping}
184	\index{statement grouping}
185
186	First, tabs are replaced (from left to right) by one to eight spaces
187	such that the total number of characters up to and including the
188	replacement is a multiple of
189	eight (this is intended to be the same rule as used by \UNIX). The
190	total number of spaces preceding the first non-blank character then
191	determines the line's indentation. Indentation cannot be split over
192	multiple physical lines using backslashes; the whitespace up to the
193	first backslash determines the indentation.
194
195	\strong{Cross-platform compatibility note:} because of the nature of
196	text editors on non-UNIX platforms, it is unwise to use a mixture of
197	spaces and tabs for the indentation in a single source file. It
198	should also be noted that different platforms may explicitly limit the
199	maximum indentation level.
200
201	A formfeed character may be present at the start of the line; it will
202	be ignored for the indentation calculations above. Formfeed
203	characters occurring elsewhere in the leading whitespace have an
204	undefined effect (for instance, they may reset the space count to
205	zero).
206
207	The indentation levels of consecutive lines are used to generate
208	INDENT and DEDENT tokens, using a stack, as follows.
209	\index{INDENT token}