source: trunk/essentials/dev-lang/python/Doc/ref/ref2.tex@ 3400

Last change on this file since 3400 was 3225, checked in by bird, 19 years ago

Python 2.5

File size: 28.1 KB
Line 
1\chapter{Lexical analysis\label{lexical}}
2
3A Python program is read by a \emph{parser}. Input to the parser is a
4stream of \emph{tokens}, generated by the \emph{lexical analyzer}. This
5chapter describes how the lexical analyzer breaks a file into tokens.
6\index{lexical analysis}
7\index{parser}
8\index{token}
9
10Python uses the 7-bit \ASCII{} character set for program text.
11\versionadded[An encoding declaration can be used to indicate that
12string literals and comments use an encoding different from ASCII]{2.3}
13For compatibility with older versions, Python only warns if it finds
148-bit characters; those warnings should be corrected by either declaring
15an explicit encoding, or using escape sequences if those bytes are binary
16data, instead of characters.
17
18
19The run-time character set depends on the I/O devices connected to the
20program but is generally a superset of \ASCII.
21
22\strong{Future compatibility note:} It may be tempting to assume that the
23character set for 8-bit characters is ISO Latin-1 (an \ASCII{}
24superset that covers most western languages that use the Latin
25alphabet), but it is possible that in the future Unicode text editors
26will become common. These generally use the UTF-8 encoding, which is
27also an \ASCII{} superset, but with very different use for the
28characters with ordinals 128-255. While there is no consensus on this
29subject yet, it is unwise to assume either Latin-1 or UTF-8, even
30though the current implementation appears to favor Latin-1. This
31applies both to the source character set and the run-time character
32set.
33
34
35\section{Line structure\label{line-structure}}
36
37A Python program is divided into a number of \emph{logical lines}.
38\index{line structure}
39
40
41\subsection{Logical lines\label{logical}}
42
43The end of
44a logical line is represented by the token NEWLINE. Statements cannot
45cross logical line boundaries except where NEWLINE is allowed by the
46syntax (e.g., between statements in compound statements).
47A logical line is constructed from one or more \emph{physical lines}
48by following the explicit or implicit \emph{line joining} rules.
49\index{logical line}
50\index{physical line}
51\index{line joining}
52\index{NEWLINE token}
53
54
55\subsection{Physical lines\label{physical}}
56
57A physical line is a sequence of characters terminated by an end-of-line
58sequence. In source files, any of the standard platform line
59termination sequences can be used - the \UNIX{} form using \ASCII{} LF
60(linefeed), the Windows form using the \ASCII{} sequence CR LF (return
61followed by linefeed), or the Macintosh form using the \ASCII{} CR
62(return) character. All of these forms can be used equally, regardless
63of platform.
64
65When embedding Python, source code strings should be passed to Python
66APIs using the standard C conventions for newline characters (the
67\code{\e n} character, representing \ASCII{} LF, is the line
68terminator).
69
70
71\subsection{Comments\label{comments}}
72
73A comment starts with a hash character (\code{\#}) that is not part of
74a string literal, and ends at the end of the physical line. A comment
75signifies the end of the logical line unless the implicit line joining
76rules are invoked.
77Comments are ignored by the syntax; they are not tokens.
78\index{comment}
79\index{hash character}
80
81
82\subsection{Encoding declarations\label{encodings}}
83\index{source character set}
84\index{encodings}
85
86If a comment in the first or second line of the Python script matches
87the regular expression \regexp{coding[=:]\e s*([-\e w.]+)}, this comment is
88processed as an encoding declaration; the first group of this
89expression names the encoding of the source code file. The recommended
90forms of this expression are
91
92\begin{verbatim}
93# -*- coding: <encoding-name> -*-
94\end{verbatim}
95
96which is recognized also by GNU Emacs, and
97
98\begin{verbatim}
99# vim:fileencoding=<encoding-name>
100\end{verbatim}
101
102which is recognized by Bram Moolenaar's VIM. In addition, if the first
103bytes of the file are the UTF-8 byte-order mark
104(\code{'\e xef\e xbb\e xbf'}), the declared file encoding is UTF-8
105(this is supported, among others, by Microsoft's \program{notepad}).
106
107If an encoding is declared, the encoding name must be recognized by
108Python. % XXX there should be a list of supported encodings.
109The encoding is used for all lexical analysis, in particular to find
110the end of a string, and to interpret the contents of Unicode literals.
111String literals are converted to Unicode for syntactical analysis,
112then converted back to their original encoding before interpretation
113starts. The encoding declaration must appear on a line of its own.
114
115\subsection{Explicit line joining\label{explicit-joining}}
116
117Two or more physical lines may be joined into logical lines using
118backslash characters (\code{\e}), as follows: when a physical line ends
119in a backslash that is not part of a string literal or comment, it is
120joined with the following forming a single logical line, deleting the
121backslash and the following end-of-line character. For example:
122\index{physical line}
123\index{line joining}
124\index{line continuation}
125\index{backslash character}
126%
127\begin{verbatim}
128if 1900 < year < 2100 and 1 <= month <= 12 \
129 and 1 <= day <= 31 and 0 <= hour < 24 \
130 and 0 <= minute < 60 and 0 <= second < 60: # Looks like a valid date
131 return 1
132\end{verbatim}
133
134A line ending in a backslash cannot carry a comment. A backslash does
135not continue a comment. A backslash does not continue a token except
136for string literals (i.e., tokens other than string literals cannot be
137split across physical lines using a backslash). A backslash is
138illegal elsewhere on a line outside a string literal.
139
140
141\subsection{Implicit line joining\label{implicit-joining}}
142
143Expressions in parentheses, square brackets or curly braces can be
144split over more than one physical line without using backslashes.
145For example:
146
147\begin{verbatim}
148month_names = ['Januari', 'Februari', 'Maart', # These are the
149 'April', 'Mei', 'Juni', # Dutch names
150 'Juli', 'Augustus', 'September', # for the months
151 'Oktober', 'November', 'December'] # of the year
152\end{verbatim}
153
154Implicitly continued lines can carry comments. The indentation of the
155continuation lines is not important. Blank continuation lines are
156allowed. There is no NEWLINE token between implicit continuation
157lines. Implicitly continued lines can also occur within triple-quoted
158strings (see below); in that case they cannot carry comments.
159
160
161\subsection{Blank lines \label{blank-lines}}
162
163\index{blank line}
164A logical line that contains only spaces, tabs, formfeeds and possibly
165a comment, is ignored (i.e., no NEWLINE token is generated). During
166interactive input of statements, handling of a blank line may differ
167depending on the implementation of the read-eval-print loop. In the
168standard implementation, an entirely blank logical line (i.e.\ one
169containing not even whitespace or a comment) terminates a multi-line
170statement.
171
172
173\subsection{Indentation\label{indentation}}
174
175Leading whitespace (spaces and tabs) at the beginning of a logical
176line is used to compute the indentation level of the line, which in
177turn is used to determine the grouping of statements.
178\index{indentation}
179\index{whitespace}
180\index{leading whitespace}
181\index{space}
182\index{tab}
183\index{grouping}
184\index{statement grouping}
185
186First, tabs are replaced (from left to right) by one to eight spaces
187such that the total number of characters up to and including the
188replacement is a multiple of
189eight (this is intended to be the same rule as used by \UNIX). The
190total number of spaces preceding the first non-blank character then
191determines the line's indentation. Indentation cannot be split over
192multiple physical lines using backslashes; the whitespace up to the
193first backslash determines the indentation.
194
195\strong{Cross-platform compatibility note:} because of the nature of
196text editors on non-UNIX platforms, it is unwise to use a mixture of
197spaces and tabs for the indentation in a single source file. It
198should also be noted that different platforms may explicitly limit the
199maximum indentation level.
200
201A formfeed character may be present at the start of the line; it will
202be ignored for the indentation calculations above. Formfeed
203characters occurring elsewhere in the leading whitespace have an
204undefined effect (for instance, they may reset the space count to
205zero).
206
207The indentation levels of consecutive lines are used to generate
208INDENT and DEDENT tokens, using a stack, as follows.
209\index{INDENT token}