| 1 | \section{\module{htmllib} ---
|
|---|
| 2 | A parser for HTML documents}
|
|---|
| 3 |
|
|---|
| 4 | \declaremodule{standard}{htmllib}
|
|---|
| 5 | \modulesynopsis{A parser for HTML documents.}
|
|---|
| 6 |
|
|---|
| 7 | \index{HTML}
|
|---|
| 8 | \index{hypertext}
|
|---|
| 9 |
|
|---|
| 10 |
|
|---|
| 11 | This module defines a class which can serve as a base for parsing text
|
|---|
| 12 | files formatted in the HyperText Mark-up Language (HTML). The class
|
|---|
| 13 | is not directly concerned with I/O --- it must be provided with input
|
|---|
| 14 | in string form via a method, and makes calls to methods of a
|
|---|
| 15 | ``formatter'' object in order to produce output. The
|
|---|
| 16 | \class{HTMLParser} class is designed to be used as a base class for
|
|---|
| 17 | other classes in order to add functionality, and allows most of its
|
|---|
| 18 | methods to be extended or overridden. In turn, this class is derived
|
|---|
| 19 | from and extends the \class{SGMLParser} class defined in module
|
|---|
| 20 | \refmodule{sgmllib}\refstmodindex{sgmllib}. The \class{HTMLParser}
|
|---|
| 21 | implementation supports the HTML 2.0 language as described in
|
|---|
| 22 | \rfc{1866}. Two implementations of formatter objects are provided in
|
|---|
| 23 | the \refmodule{formatter}\refstmodindex{formatter}\ module; refer to the
|
|---|
| 24 | documentation for that module for information on the formatter
|
|---|
| 25 | interface.
|
|---|
| 26 | \withsubitem{(in module sgmllib)}{\ttindex{SGMLParser}}
|
|---|
| 27 |
|
|---|
| 28 | The following is a summary of the interface defined by
|
|---|
| 29 | \class{sgmllib.SGMLParser}:
|
|---|
| 30 |
|
|---|
| 31 | \begin{itemize}
|
|---|
| 32 |
|
|---|
| 33 | \item
|
|---|
| 34 | The interface to feed data to an instance is through the \method{feed()}
|
|---|
| 35 | method, which takes a string argument. This can be called with as
|
|---|
| 36 | little or as much text at a time as desired; \samp{p.feed(a);
|
|---|
| 37 | p.feed(b)} has the same effect as \samp{p.feed(a+b)}. When the data
|
|---|
| 38 | contains complete HTML markup constructs, these are processed immediately;
|
|---|
| 39 | incomplete constructs are saved in a buffer. To force processing of all
|
|---|
| 40 | unprocessed data, call the \method{close()} method.
|
|---|
| 41 |
|
|---|
| 42 | For example, to parse the entire contents of a file, use:
|
|---|
| 43 | \begin{verbatim}
|
|---|
| 44 | parser.feed(open('myfile.html').read())
|
|---|
| 45 | parser.close()
|
|---|
| 46 | \end{verbatim}
|
|---|
| 47 |
|
|---|
| 48 | \item
|
|---|
| 49 | The interface to define semantics for HTML tags is very simple: derive
|
|---|
| 50 | a class and define methods called \method{start_\var{tag}()},
|
|---|
| 51 | \method{end_\var{tag}()}, or \method{do_\var{tag}()}. The parser will
|
|---|
| 52 | call these at appropriate moments: \method{start_\var{tag}} or
|
|---|
| 53 | \method{do_\var{tag}()} is called when an opening tag of the form
|
|---|
| 54 | \code{<\var{tag} ...>} is encountered; \method{end_\var{tag}()} is called
|
|---|
| 55 | when a closing tag of the form \code{<\var{tag}>} is encountered. If
|
|---|
| 56 | an opening tag requires a corresponding closing tag, like \code{<H1>}
|
|---|
| 57 | ... \code{</H1>}, the class should define the \method{start_\var{tag}()}
|
|---|
| 58 | method; if a tag requires no closing tag, like \code{<P>}, the class
|
|---|
| 59 | should define the \method{do_\var{tag}()} method.
|
|---|
| 60 |
|
|---|
| 61 | \end{itemize}
|
|---|
| 62 |
|
|---|
| 63 | The module defines a parser class and an exception:
|
|---|
| 64 |
|
|---|
| 65 | \begin{classdesc}{HTMLParser}{formatter}
|
|---|
| 66 | This is the basic HTML parser class. It supports all entity names
|
|---|
| 67 | required by the XHTML 1.0 Recommendation (\url{http://www.w3.org/TR/xhtml1}).
|
|---|
| 68 | It also defines handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.
|
|---|
| 69 | \end{classdesc}
|
|---|
| 70 |
|
|---|
| 71 | \begin{excdesc}{HTMLParseError}
|
|---|
| 72 | Exception raised by the \class{HTMLParser} class when it encounters an
|
|---|
| 73 | error while parsing.
|
|---|
| 74 | \versionadded{2.4}
|
|---|
| 75 | \end{excdesc}
|
|---|
| 76 |
|
|---|
| 77 |
|
|---|
| 78 | \begin{seealso}
|
|---|
| 79 | \seemodule{formatter}{Interface definition for transforming an
|
|---|
| 80 | abstract flow of formatting events into
|
|---|
| 81 | specific output events on writer objects.}
|
|---|
| 82 | \seemodule{HTMLParser}{Alternate HTML parser that offers a slightly
|
|---|
| 83 | lower-level view of the input, but is
|
|---|
| 84 | designed to work with XHTML, and does not
|
|---|
| 85 | implement some of the SGML syntax not used in
|
|---|
| 86 | ``HTML as deployed'' and which isn't legal
|
|---|
| 87 | for XHTML.}
|
|---|
| 88 | \seemodule{htmlentitydefs}{Definition of replacement text for XHTML 1.0
|
|---|
| 89 | entities.}
|
|---|
| 90 | \seemodule{sgmllib}{Base class for \class{HTMLParser}.}
|
|---|
| 91 | \end{seealso}
|
|---|
| 92 |
|
|---|
| 93 |
|
|---|
| 94 | \subsection{HTMLParser Objects \label{html-parser-objects}}
|
|---|
| 95 |
|
|---|
| 96 | In addition to tag methods, the \class{HTMLParser} class provides some
|
|---|
| 97 | additional methods and instance variables for use within tag methods.
|
|---|
| 98 |
|
|---|
| 99 | \begin{memberdesc}{formatter}
|
|---|
| 100 | This is the formatter instance associated with the parser.
|
|---|
| 101 | \end{memberdesc}
|
|---|
| 102 |
|
|---|
| 103 | \begin{memberdesc}{nofill}
|
|---|
| 104 | Boolean flag which should be true when whitespace should not be
|
|---|
| 105 | collapsed, or false when it should be. In general, this should only
|
|---|
| 106 | be true when character data is to be treated as ``preformatted'' text,
|
|---|
| 107 | as within a \code{<PRE>} element. The default value is false. This
|
|---|
| 108 | affects the operation of \method{handle_data()} and \method{save_end()}.
|
|---|
| 109 | \end{memberdesc}
|
|---|
| 110 |
|
|---|
| 111 |
|
|---|
| 112 | \begin{methoddesc}{anchor_bgn}{href, name, type}
|
|---|
| 113 | This method is called at the start of an anchor region. The arguments
|
|---|
| 114 | correspond to the attributes of the \code{<A>} tag with the same
|
|---|
| 115 | names. The default implementation maintains a list of hyperlinks
|
|---|
| 116 | (defined by the \code{HREF} attribute for \code{<A>} tags) within the
|
|---|
| 117 | document. The list of hyperlinks is available as the data attribute
|
|---|
| 118 | \member{anchorlist}.
|
|---|
| 119 | \end{methoddesc}
|
|---|
| 120 |
|
|---|
| 121 | \begin{methoddesc}{anchor_end}{}
|
|---|
| 122 | This method is called at the end of an anchor region. The default
|
|---|
| 123 | implementation adds a textual footnote marker using an index into the
|
|---|
| 124 | list of hyperlinks created by \method{anchor_bgn()}.
|
|---|
| 125 | \end{methoddesc}
|
|---|
| 126 |
|
|---|
| 127 | \begin{methoddesc}{handle_image}{source, alt\optional{, ismap\optional{,
|
|---|
| 128 | align\optional{, width\optional{, height}}}}}
|
|---|
| 129 | This method is called to handle images. The default implementation
|
|---|
| 130 | simply passes the \var{alt} value to the \method{handle_data()}
|
|---|
| 131 | method.
|
|---|
| 132 | \end{methoddesc}
|
|---|
| 133 |
|
|---|
| 134 | \begin{methoddesc}{save_bgn}{}
|
|---|
| 135 | Begins saving character data in a buffer instead of sending it to the
|
|---|
| 136 | formatter object. Retrieve the stored data via \method{save_end()}.
|
|---|
| 137 | Use of the \method{save_bgn()} / \method{save_end()} pair may not be
|
|---|
| 138 | nested.
|
|---|
| 139 | \end{methoddesc}
|
|---|
| 140 |
|
|---|
| 141 | \begin{methoddesc}{save_end}{}
|
|---|
| 142 | Ends buffering character data and returns all data saved since the
|
|---|
| 143 | preceding call to \method{save_bgn()}. If the \member{nofill} flag is
|
|---|
| 144 | false, whitespace is collapsed to single spaces. A call to this
|
|---|
| 145 | method without a preceding call to \method{save_bgn()} will raise a
|
|---|
| 146 | \exception{TypeError} exception.
|
|---|
| 147 | \end{methoddesc}
|
|---|
| 148 |
|
|---|
| 149 |
|
|---|
| 150 |
|
|---|
| 151 | \section{\module{htmlentitydefs} ---
|
|---|
| 152 | Definitions of HTML general entities}
|
|---|
| 153 |
|
|---|
| 154 | \declaremodule{standard}{htmlentitydefs}
|
|---|
| 155 | \modulesynopsis{Definitions of HTML general entities.}
|
|---|
| 156 | \sectionauthor{Fred L. Drake, Jr.}{[email protected]}
|
|---|
| 157 |
|
|---|
| 158 | This module defines three dictionaries, \code{name2codepoint},
|
|---|
| 159 | \code{codepoint2name}, and \code{entitydefs}. \code{entitydefs} is
|
|---|
| 160 | used by the \refmodule{htmllib} module to provide the
|
|---|
| 161 | \member{entitydefs} member of the \class{HTMLParser} class. The
|
|---|
| 162 | definition provided here contains all the entities defined by XHTML 1.0
|
|---|
| 163 | that can be handled using simple textual substitution in the Latin-1
|
|---|
| 164 | character set (ISO-8859-1).
|
|---|
| 165 |
|
|---|
| 166 |
|
|---|
| 167 | \begin{datadesc}{entitydefs}
|
|---|
| 168 | A dictionary mapping XHTML 1.0 entity definitions to their
|
|---|
| 169 | replacement text in ISO Latin-1.
|
|---|
| 170 |
|
|---|
| 171 | \end{datadesc}
|
|---|
| 172 |
|
|---|
| 173 | \begin{datadesc}{name2codepoint}
|
|---|
| 174 | A dictionary that maps HTML entity names to the Unicode codepoints.
|
|---|
| 175 | \versionadded{2.3}
|
|---|
| 176 | \end{datadesc}
|
|---|
| 177 |
|
|---|
| 178 | \begin{datadesc}{codepoint2name}
|
|---|
| 179 | A dictionary that maps Unicode codepoints to HTML entity names.
|
|---|
| 180 | \versionadded{2.3}
|
|---|
| 181 | \end{datadesc}
|
|---|