| 1 | \chapter{Lexical analysis\label{lexical}}
|
|---|
| 2 |
|
|---|
| 3 | A Python program is read by a \emph{parser}. Input to the parser is a
|
|---|
| 4 | stream of \emph{tokens}, generated by the \emph{lexical analyzer}. This
|
|---|
| 5 | chapter describes how the lexical analyzer breaks a file into tokens.
|
|---|
| 6 | \index{lexical analysis}
|
|---|
| 7 | \index{parser}
|
|---|
| 8 | \index{token}
|
|---|
| 9 |
|
|---|
| 10 | Python uses the 7-bit \ASCII{} character set for program text.
|
|---|
| 11 | \versionadded[An encoding declaration can be used to indicate that
|
|---|
| 12 | string literals and comments use an encoding different from ASCII]{2.3}
|
|---|
| 13 | For compatibility with older versions, Python only warns if it finds
|
|---|
| 14 | 8-bit characters; those warnings should be corrected by either declaring
|
|---|
| 15 | an explicit encoding, or using escape sequences if those bytes are binary
|
|---|
| 16 | data, instead of characters.
|
|---|
| 17 |
|
|---|
| 18 |
|
|---|
| 19 | The run-time character set depends on the I/O devices connected to the
|
|---|
| 20 | program but is generally a superset of \ASCII.
|
|---|
| 21 |
|
|---|
| 22 | \strong{Future compatibility note:} It may be tempting to assume that the
|
|---|
| 23 | character set for 8-bit characters is ISO Latin-1 (an \ASCII{}
|
|---|
| 24 | superset that covers most western languages that use the Latin
|
|---|
| 25 | alphabet), but it is possible that in the future Unicode text editors
|
|---|
| 26 | will become common. These generally use the UTF-8 encoding, which is
|
|---|
| 27 | also an \ASCII{} superset, but with very different use for the
|
|---|
| 28 | characters with ordinals 128-255. While there is no consensus on this
|
|---|
| 29 | subject yet, it is unwise to assume either Latin-1 or UTF-8, even
|
|---|
| 30 | though the current implementation appears to favor Latin-1. This
|
|---|
| 31 | applies both to the source character set and the run-time character
|
|---|
| 32 | set.
|
|---|
| 33 |
|
|---|
| 34 |
|
|---|
| 35 | \section{Line structure\label{line-structure}}
|
|---|
| 36 |
|
|---|
| 37 | A Python program is divided into a number of \emph{logical lines}.
|
|---|
| 38 | \index{line structure}
|
|---|
| 39 |
|
|---|
| 40 |
|
|---|
| 41 | \subsection{Logical lines\label{logical}}
|
|---|
| 42 |
|
|---|
| 43 | The end of
|
|---|
| 44 | a logical line is represented by the token NEWLINE. Statements cannot
|
|---|
| 45 | cross logical line boundaries except where NEWLINE is allowed by the
|
|---|
| 46 | syntax (e.g., between statements in compound statements).
|
|---|
| 47 | A logical line is constructed from one or more \emph{physical lines}
|
|---|
| 48 | by following the explicit or implicit \emph{line joining} rules.
|
|---|
| 49 | \index{logical line}
|
|---|
| 50 | \index{physical line}
|
|---|
| 51 | \index{line joining}
|
|---|
| 52 | \index{NEWLINE token}
|
|---|
| 53 |
|
|---|
| 54 |
|
|---|
| 55 | \subsection{Physical lines\label{physical}}
|
|---|
| 56 |
|
|---|
| 57 | A physical line is a sequence of characters terminated by an end-of-line
|
|---|
| 58 | sequence. In source files, any of the standard platform line
|
|---|
| 59 | termination sequences can be used - the \UNIX{} form using \ASCII{} LF
|
|---|
| 60 | (linefeed), the Windows form using the \ASCII{} sequence CR LF (return
|
|---|
| 61 | followed by linefeed), or the Macintosh form using the \ASCII{} CR
|
|---|
| 62 | (return) character. All of these forms can be used equally, regardless
|
|---|
| 63 | of platform.
|
|---|
| 64 |
|
|---|
| 65 | When embedding Python, source code strings should be passed to Python
|
|---|
| 66 | APIs using the standard C conventions for newline characters (the
|
|---|
| 67 | \code{\e n} character, representing \ASCII{} LF, is the line
|
|---|
| 68 | terminator).
|
|---|
| 69 |
|
|---|
| 70 |
|
|---|
| 71 | \subsection{Comments\label{comments}}
|
|---|
| 72 |
|
|---|
| 73 | A comment starts with a hash character (\code{\#}) that is not part of
|
|---|
| 74 | a string literal, and ends at the end of the physical line. A comment
|
|---|
| 75 | signifies the end of the logical line unless the implicit line joining
|
|---|
| 76 | rules are invoked.
|
|---|
| 77 | Comments are ignored by the syntax; they are not tokens.
|
|---|
| 78 | \index{comment}
|
|---|
| 79 | \index{hash character}
|
|---|
| 80 |
|
|---|
| 81 |
|
|---|
| 82 | \subsection{Encoding declarations\label{encodings}}
|
|---|
| 83 | \index{source character set}
|
|---|
| 84 | \index{encodings}
|
|---|
| 85 |
|
|---|
| 86 | If a comment in the first or second line of the Python script matches
|
|---|
| 87 | the regular expression \regexp{coding[=:]\e s*([-\e w.]+)}, this comment is
|
|---|
| 88 | processed as an encoding declaration; the first group of this
|
|---|
| 89 | expression names the encoding of the source code file. The recommended
|
|---|
| 90 | forms of this expression are
|
|---|
| 91 |
|
|---|
| 92 | \begin{verbatim}
|
|---|
| 93 | # -*- coding: <encoding-name> -*-
|
|---|
| 94 | \end{verbatim}
|
|---|
| 95 |
|
|---|
| 96 | which is recognized also by GNU Emacs, and
|
|---|
| 97 |
|
|---|
| 98 | \begin{verbatim}
|
|---|
| 99 | # vim:fileencoding=<encoding-name>
|
|---|
| 100 | \end{verbatim}
|
|---|
| 101 |
|
|---|
| 102 | which is recognized by Bram Moolenaar's VIM. In addition, if the first
|
|---|
| 103 | bytes of the file are the UTF-8 byte-order mark
|
|---|
| 104 | (\code{'\e xef\e xbb\e xbf'}), the declared file encoding is UTF-8
|
|---|
| 105 | (this is supported, among others, by Microsoft's \program{notepad}).
|
|---|
| 106 |
|
|---|
| 107 | If an encoding is declared, the encoding name must be recognized by
|
|---|
| 108 | Python. % XXX there should be a list of supported encodings.
|
|---|
| 109 | The encoding is used for all lexical analysis, in particular to find
|
|---|
| 110 | the end of a string, and to interpret the contents of Unicode literals.
|
|---|
| 111 | String literals are converted to Unicode for syntactical analysis,
|
|---|
| 112 | then converted back to their original encoding before interpretation
|
|---|
| 113 | starts. The encoding declaration must appear on a line of its own.
|
|---|
| 114 |
|
|---|
| 115 | \subsection{Explicit line joining\label{explicit-joining}}
|
|---|
| 116 |
|
|---|
| 117 | Two or more physical lines may be joined into logical lines using
|
|---|
| 118 | backslash characters (\code{\e}), as follows: when a physical line ends
|
|---|
| 119 | in a backslash that is not part of a string literal or comment, it is
|
|---|
| 120 | joined with the following forming a single logical line, deleting the
|
|---|
| 121 | backslash and the following end-of-line character. For example:
|
|---|
| 122 | \index{physical line}
|
|---|
| 123 | \index{line joining}
|
|---|
| 124 | \index{line continuation}
|
|---|
| 125 | \index{backslash character}
|
|---|
| 126 | %
|
|---|
| 127 | \begin{verbatim}
|
|---|
| 128 | if 1900 < year < 2100 and 1 <= month <= 12 \
|
|---|
| 129 | and 1 <= day <= 31 and 0 <= hour < 24 \
|
|---|
| 130 | and 0 <= minute < 60 and 0 <= second < 60: # Looks like a valid date
|
|---|
| 131 | return 1
|
|---|
| 132 | \end{verbatim}
|
|---|
| 133 |
|
|---|
| 134 | A line ending in a backslash cannot carry a comment. A backslash does
|
|---|
| 135 | not continue a comment. A backslash does not continue a token except
|
|---|
| 136 | for string literals (i.e., tokens other than string literals cannot be
|
|---|
| 137 | split across physical lines using a backslash). A backslash is
|
|---|
| 138 | illegal elsewhere on a line outside a string literal.
|
|---|
| 139 |
|
|---|
| 140 |
|
|---|
| 141 | \subsection{Implicit line joining\label{implicit-joining}}
|
|---|
| 142 |
|
|---|
| 143 | Expressions in parentheses, square brackets or curly braces can be
|
|---|
| 144 | split over more than one physical line without using backslashes.
|
|---|
| 145 | For example:
|
|---|
| 146 |
|
|---|
| 147 | \begin{verbatim}
|
|---|
| 148 | month_names = ['Januari', 'Februari', 'Maart', # These are the
|
|---|
| 149 | 'April', 'Mei', 'Juni', # Dutch names
|
|---|
| 150 | 'Juli', 'Augustus', 'September', # for the months
|
|---|
| 151 | 'Oktober', 'November', 'December'] # of the year
|
|---|
| 152 | \end{verbatim}
|
|---|
| 153 |
|
|---|
| 154 | Implicitly continued lines can carry comments. The indentation of the
|
|---|
| 155 | continuation lines is not important. Blank continuation lines are
|
|---|
| 156 | allowed. There is no NEWLINE token between implicit continuation
|
|---|
| 157 | lines. Implicitly continued lines can also occur within triple-quoted
|
|---|
| 158 | strings (see below); in that case they cannot carry comments.
|
|---|
| 159 |
|
|---|
| 160 |
|
|---|
| 161 | \subsection{Blank lines \label{blank-lines}}
|
|---|
| 162 |
|
|---|
| 163 | \index{blank line}
|
|---|
| 164 | A logical line that contains only spaces, tabs, formfeeds and possibly
|
|---|
| 165 | a comment, is ignored (i.e., no NEWLINE token is generated). During
|
|---|
| 166 | interactive input of statements, handling of a blank line may differ
|
|---|
| 167 | depending on the implementation of the read-eval-print loop. In the
|
|---|
| 168 | standard implementation, an entirely blank logical line (i.e.\ one
|
|---|
| 169 | containing not even whitespace or a comment) terminates a multi-line
|
|---|
| 170 | statement.
|
|---|
| 171 |
|
|---|
| 172 |
|
|---|
| 173 | \subsection{Indentation\label{indentation}}
|
|---|
| 174 |
|
|---|
| 175 | Leading whitespace (spaces and tabs) at the beginning of a logical
|
|---|
| 176 | line is used to compute the indentation level of the line, which in
|
|---|
| 177 | turn is used to determine the grouping of statements.
|
|---|
| 178 | \index{indentation}
|
|---|
| 179 | \index{whitespace}
|
|---|
| 180 | \index{leading whitespace}
|
|---|
| 181 | \index{space}
|
|---|
| 182 | \index{tab}
|
|---|
| 183 | \index{grouping}
|
|---|
| 184 | \index{statement grouping}
|
|---|
| 185 |
|
|---|
| 186 | First, tabs are replaced (from left to right) by one to eight spaces
|
|---|
| 187 | such that the total number of characters up to and including the
|
|---|
| 188 | replacement is a multiple of
|
|---|
| 189 | eight (this is intended to be the same rule as used by \UNIX). The
|
|---|
| 190 | total number of spaces preceding the first non-blank character then
|
|---|
| 191 | determines the line's indentation. Indentation cannot be split over
|
|---|
| 192 | multiple physical lines using backslashes; the whitespace up to the
|
|---|
| 193 | first backslash determines the indentation.
|
|---|
| 194 |
|
|---|
| 195 | \strong{Cross-platform compatibility note:} because of the nature of
|
|---|
| 196 | text editors on non-UNIX platforms, it is unwise to use a mixture of
|
|---|
| 197 | spaces and tabs for the indentation in a single source file. It
|
|---|
| 198 | should also be noted that different platforms may explicitly limit the
|
|---|
| 199 | maximum indentation level.
|
|---|
| 200 |
|
|---|
| 201 | A formfeed character may be present at the start of the line; it will
|
|---|
| 202 | be ignored for the indentation calculations above. Formfeed
|
|---|
| 203 | characters occurring elsewhere in the leading whitespace have an
|
|---|
| 204 | undefined effect (for instance, they may reset the space count to
|
|---|
| 205 | zero).
|
|---|
| 206 |
|
|---|
| 207 | The indentation levels of consecutive lines are used to generate
|
|---|
| 208 | INDENT and DEDENT tokens, using a stack, as follows.
|
|---|
| 209 | \index{INDENT token}
|
|---|
|
|---|