| 1 | \section{\module{HTMLParser} ---
|
|---|
| 2 | Simple HTML and XHTML parser}
|
|---|
| 3 |
|
|---|
| 4 | \declaremodule{standard}{HTMLParser}
|
|---|
| 5 | \modulesynopsis{A simple parser that can handle HTML and XHTML.}
|
|---|
| 6 |
|
|---|
| 7 | \versionadded{2.2}
|
|---|
| 8 |
|
|---|
| 9 | This module defines a class \class{HTMLParser} which serves as the
|
|---|
| 10 | basis for parsing text files formatted in HTML\index{HTML} (HyperText
|
|---|
| 11 | Mark-up Language) and XHTML.\index{XHTML} Unlike the parser in
|
|---|
| 12 | \refmodule{htmllib}, this parser is not based on the SGML parser in
|
|---|
| 13 | \refmodule{sgmllib}.
|
|---|
| 14 |
|
|---|
| 15 |
|
|---|
| 16 | \begin{classdesc}{HTMLParser}{}
|
|---|
| 17 | The \class{HTMLParser} class is instantiated without arguments.
|
|---|
| 18 |
|
|---|
| 19 | An HTMLParser instance is fed HTML data and calls handler functions
|
|---|
| 20 | when tags begin and end. The \class{HTMLParser} class is meant to be
|
|---|
| 21 | overridden by the user to provide a desired behavior.
|
|---|
| 22 |
|
|---|
| 23 | Unlike the parser in \refmodule{htmllib}, this parser does not check
|
|---|
| 24 | that end tags match start tags or call the end-tag handler for
|
|---|
| 25 | elements which are closed implicitly by closing an outer element.
|
|---|
| 26 | \end{classdesc}
|
|---|
| 27 |
|
|---|
| 28 | An exception is defined as well:
|
|---|
| 29 |
|
|---|
| 30 | \begin{excdesc}{HTMLParseError}
|
|---|
| 31 | Exception raised by the \class{HTMLParser} class when it encounters an
|
|---|
| 32 | error while parsing. This exception provides three attributes:
|
|---|
| 33 | \member{msg} is a brief message explaining the error, \member{lineno}
|
|---|
| 34 | is the number of the line on which the broken construct was detected,
|
|---|
| 35 | and \member{offset} is the number of characters into the line at which
|
|---|
| 36 | the construct starts.
|
|---|
| 37 | \end{excdesc}
|
|---|
| 38 |
|
|---|
| 39 |
|
|---|
| 40 | \class{HTMLParser} instances have the following methods:
|
|---|
| 41 |
|
|---|
| 42 | \begin{methoddesc}{reset}{}
|
|---|
| 43 | Reset the instance. Loses all unprocessed data. This is called
|
|---|
| 44 | implicitly at instantiation time.
|
|---|
| 45 | \end{methoddesc}
|
|---|
| 46 |
|
|---|
| 47 | \begin{methoddesc}{feed}{data}
|
|---|
| 48 | Feed some text to the parser. It is processed insofar as it consists
|
|---|
| 49 | of complete elements; incomplete data is buffered until more data is
|
|---|
| 50 | fed or \method{close()} is called.
|
|---|
| 51 | \end{methoddesc}
|
|---|
| 52 |
|
|---|
| 53 | \begin{methoddesc}{close}{}
|
|---|
| 54 | Force processing of all buffered data as if it were followed by an
|
|---|
| 55 | end-of-file mark. This method may be redefined by a derived class to
|
|---|
| 56 | define additional processing at the end of the input, but the
|
|---|
| 57 | redefined version should always call the \class{HTMLParser} base class
|
|---|
| 58 | method \method{close()}.
|
|---|
| 59 | \end{methoddesc}
|
|---|
| 60 |
|
|---|
| 61 | \begin{methoddesc}{getpos}{}
|
|---|
| 62 | Return current line number and offset.
|
|---|
| 63 | \end{methoddesc}
|
|---|
| 64 |
|
|---|
| 65 | \begin{methoddesc}{get_starttag_text}{}
|
|---|
| 66 | Return the text of the most recently opened start tag. This should
|
|---|
| 67 | not normally be needed for structured processing, but may be useful in
|
|---|
| 68 | dealing with HTML ``as deployed'' or for re-generating input with
|
|---|
| 69 | minimal changes (whitespace between attributes can be preserved,
|
|---|
| 70 | etc.).
|
|---|
| 71 | \end{methoddesc}
|
|---|
| 72 |
|
|---|
| 73 | \begin{methoddesc}{handle_starttag}{tag, attrs}
|
|---|
| 74 | This method is called to handle the start of a tag. It is intended to
|
|---|
| 75 | be overridden by a derived class; the base class implementation does
|
|---|
| 76 | nothing.
|
|---|
| 77 |
|
|---|
| 78 | The \var{tag} argument is the name of the tag converted to
|
|---|
| 79 | lower case. The \var{attrs} argument is a list of \code{(\var{name},
|
|---|
| 80 | \var{value})} pairs containing the attributes found inside the tag's
|
|---|
| 81 | \code{<>} brackets. The \var{name} will be translated to lower case
|
|---|
| 82 | and double quotes and backslashes in the \var{value} have been
|
|---|
| 83 | interpreted. For instance, for the tag \code{<A
|
|---|
| 84 | HREF="http://www.cwi.nl/">}, this method would be called as
|
|---|
| 85 | \samp{handle_starttag('a', [('href', 'http://www.cwi.nl/')])}.
|
|---|
| 86 | \end{methoddesc}
|
|---|
| 87 |
|
|---|
| 88 | \begin{methoddesc}{handle_startendtag}{tag, attrs}
|
|---|
| 89 | Similar to \method{handle_starttag()}, but called when the parser
|
|---|
| 90 | encounters an XHTML-style empty tag (\code{<a .../>}). This method
|
|---|
| 91 | may be overridden by subclasses which require this particular lexical
|
|---|
| 92 | information; the default implementation simple calls
|
|---|
| 93 | \method{handle_starttag()} and \method{handle_endtag()}.
|
|---|
| 94 | \end{methoddesc}
|
|---|
| 95 |
|
|---|
| 96 | \begin{methoddesc}{handle_endtag}{tag}
|
|---|
| 97 | This method is called to handle the end tag of an element. It is
|
|---|
| 98 | intended to be overridden by a derived class; the base class
|
|---|
| 99 | implementation does nothing. The \var{tag} argument is the name of
|
|---|
| 100 | the tag converted to lower case.
|
|---|
| 101 | \end{methoddesc}
|
|---|
| 102 |
|
|---|
| 103 | \begin{methoddesc}{handle_data}{data}
|
|---|
| 104 | This method is called to process arbitrary data. It is intended to be
|
|---|
| 105 | overridden by a derived class; the base class implementation does
|
|---|
| 106 | nothing.
|
|---|
| 107 | \end{methoddesc}
|
|---|
| 108 |
|
|---|
| 109 | \begin{methoddesc}{handle_charref}{name} This method is called to
|
|---|
| 110 | process a character reference of the form \samp{\&\#\var{ref};}. It
|
|---|
| 111 | is intended to be overridden by a derived class; the base class
|
|---|
| 112 | implementation does nothing.
|
|---|
| 113 | \end{methoddesc}
|
|---|
| 114 |
|
|---|
| 115 | \begin{methoddesc}{handle_entityref}{name}
|
|---|
| 116 | This method is called to process a general entity reference of the
|
|---|
| 117 | form \samp{\&\var{name};} where \var{name} is an general entity
|
|---|
| 118 | reference. It is intended to be overridden by a derived class; the
|
|---|
| 119 | base class implementation does nothing.
|
|---|
| 120 | \end{methoddesc}
|
|---|
| 121 |
|
|---|
| 122 | \begin{methoddesc}{handle_comment}{data}
|
|---|
| 123 | This method is called when a comment is encountered. The
|
|---|
| 124 | \var{comment} argument is a string containing the text between the
|
|---|
| 125 | \samp{--} and \samp{--} delimiters, but not the delimiters
|
|---|
| 126 | themselves. For example, the comment \samp{<!--text-->} will
|
|---|
| 127 | cause this method to be called with the argument \code{'text'}. It is
|
|---|
| 128 | intended to be overridden by a derived class; the base class
|
|---|
| 129 | implementation does nothing.
|
|---|
| 130 | \end{methoddesc}
|
|---|
| 131 |
|
|---|
| 132 | \begin{methoddesc}{handle_decl}{decl}
|
|---|
| 133 | Method called when an SGML declaration is read by the parser. The
|
|---|
| 134 | \var{decl} parameter will be the entire contents of the declaration
|
|---|
| 135 | inside the \code{<!}...\code{>} markup. It is intended to be overridden
|
|---|
| 136 | by a derived class; the base class implementation does nothing.
|
|---|
| 137 | \end{methoddesc}
|
|---|
| 138 |
|
|---|
| 139 | \begin{methoddesc}{handle_pi}{data}
|
|---|
| 140 | Method called when a processing instruction is encountered. The
|
|---|
| 141 | \var{data} parameter will contain the entire processing instruction.
|
|---|
| 142 | For example, for the processing instruction \code{<?proc color='red'>},
|
|---|
| 143 | this method would be called as \code{handle_pi("proc color='red'")}. It
|
|---|
| 144 | is intended to be overridden by a derived class; the base class
|
|---|
| 145 | implementation does nothing.
|
|---|
| 146 |
|
|---|
| 147 | \note{The \class{HTMLParser} class uses the SGML syntactic rules for
|
|---|
| 148 | processing instructions. An XHTML processing instruction using the
|
|---|
| 149 | trailing \character{?} will cause the \character{?} to be included in
|
|---|
| 150 | \var{data}.}
|
|---|
| 151 | \end{methoddesc}
|
|---|
| 152 |
|
|---|
| 153 |
|
|---|
| 154 | \subsection{Example HTML Parser Application \label{htmlparser-example}}
|
|---|
| 155 |
|
|---|
| 156 | As a basic example, below is a very basic HTML parser that uses the
|
|---|
| 157 | \class{HTMLParser} class to print out tags as they are encountered:
|
|---|
| 158 |
|
|---|
| 159 | \begin{verbatim}
|
|---|
| 160 | from HTMLParser import HTMLParser
|
|---|
| 161 |
|
|---|
| 162 | class MyHTMLParser(HTMLParser):
|
|---|
| 163 |
|
|---|
| 164 | def handle_starttag(self, tag, attrs):
|
|---|
| 165 | print "Encountered the beginning of a %s tag" % tag
|
|---|
| 166 |
|
|---|
| 167 | def handle_endtag(self, tag):
|
|---|
| 168 | print "Encountered the end of a %s tag" % tag
|
|---|
| 169 | \end{verbatim}
|
|---|