| 1 | \section{\module{urlparse} ---
|
|---|
| 2 | Parse URLs into components}
|
|---|
| 3 | \declaremodule{standard}{urlparse}
|
|---|
| 4 |
|
|---|
| 5 | \modulesynopsis{Parse URLs into components.}
|
|---|
| 6 |
|
|---|
| 7 | \index{WWW}
|
|---|
| 8 | \index{World Wide Web}
|
|---|
| 9 | \index{URL}
|
|---|
| 10 | \indexii{URL}{parsing}
|
|---|
| 11 | \indexii{relative}{URL}
|
|---|
| 12 |
|
|---|
| 13 |
|
|---|
| 14 | This module defines a standard interface to break Uniform Resource
|
|---|
| 15 | Locator (URL) strings up in components (addressing scheme, network
|
|---|
| 16 | location, path etc.), to combine the components back into a URL
|
|---|
| 17 | string, and to convert a ``relative URL'' to an absolute URL given a
|
|---|
| 18 | ``base URL.''
|
|---|
| 19 |
|
|---|
| 20 | The module has been designed to match the Internet RFC on Relative
|
|---|
| 21 | Uniform Resource Locators (and discovered a bug in an earlier
|
|---|
| 22 | draft!). It supports the following URL schemes:
|
|---|
| 23 | \code{file}, \code{ftp}, \code{gopher}, \code{hdl}, \code{http},
|
|---|
| 24 | \code{https}, \code{imap}, \code{mailto}, \code{mms}, \code{news},
|
|---|
| 25 | \code{nntp}, \code{prospero}, \code{rsync}, \code{rtsp}, \code{rtspu},
|
|---|
| 26 | \code{sftp}, \code{shttp}, \code{sip}, \code{sips}, \code{snews}, \code{svn},
|
|---|
| 27 | \code{svn+ssh}, \code{telnet}, \code{wais}.
|
|---|
| 28 |
|
|---|
| 29 | \versionadded[Support for the \code{sftp} and \code{sips} schemes]{2.5}
|
|---|
| 30 |
|
|---|
| 31 | The \module{urlparse} module defines the following functions:
|
|---|
| 32 |
|
|---|
| 33 | \begin{funcdesc}{urlparse}{urlstring\optional{,
|
|---|
| 34 | default_scheme\optional{, allow_fragments}}}
|
|---|
| 35 | Parse a URL into six components, returning a 6-tuple. This
|
|---|
| 36 | corresponds to the general structure of a URL:
|
|---|
| 37 | \code{\var{scheme}://\var{netloc}/\var{path};\var{parameters}?\var{query}\#\var{fragment}}.
|
|---|
| 38 | Each tuple item is a string, possibly empty.
|
|---|
| 39 | The components are not broken up in smaller parts (for example, the network
|
|---|
| 40 | location is a single string), and \% escapes are not expanded.
|
|---|
| 41 | The delimiters as shown above are not part of the result,
|
|---|
| 42 | except for a leading slash in the \var{path} component, which is
|
|---|
| 43 | retained if present. For example:
|
|---|
| 44 |
|
|---|
| 45 | \begin{verbatim}
|
|---|
| 46 | >>> from urlparse import urlparse
|
|---|
| 47 | >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
|
|---|
| 48 | >>> o
|
|---|
| 49 | ('http', 'www.cwi.nl:80', '/%7Eguido/Python.html', '', '', '')
|
|---|
| 50 | >>> o.scheme
|
|---|
| 51 | 'http'
|
|---|
| 52 | >>> o.port
|
|---|
| 53 | 80
|
|---|
| 54 | >>> o.geturl()
|
|---|
| 55 | 'http://www.cwi.nl:80/%7Eguido/Python.html'
|
|---|
| 56 | \end{verbatim}
|
|---|
| 57 |
|
|---|
| 58 | If the \var{default_scheme} argument is specified, it gives the
|
|---|
| 59 | default addressing scheme, to be used only if the URL does not
|
|---|
| 60 | specify one. The default value for this argument is the empty string.
|
|---|
| 61 |
|
|---|
| 62 | If the \var{allow_fragments} argument is false, fragment identifiers
|
|---|
| 63 | are not allowed, even if the URL's addressing scheme normally does
|
|---|
| 64 | support them. The default value for this argument is \constant{True}.
|
|---|
| 65 |
|
|---|
| 66 | The return value is actually an instance of a subclass of
|
|---|
| 67 | \pytype{tuple}. This class has the following additional read-only
|
|---|
| 68 | convenience attributes:
|
|---|
| 69 |
|
|---|
| 70 | \begin{tableiv}{l|c|l|c}{member}{Attribute}{Index}{Value}{Value if not present}
|
|---|
| 71 | \lineiv{scheme} {0} {URL scheme specifier} {empty string}
|
|---|
| 72 | \lineiv{netloc} {1} {Network location part} {empty string}
|
|---|
| 73 | \lineiv{path} {2} {Hierarchical path} {empty string}
|
|---|
| 74 | \lineiv{params} {3} {Parameters for last path element} {empty string}
|
|---|
| 75 | \lineiv{query} {4} {Query component} {empty string}
|
|---|
| 76 | \lineiv{fragment}{5} {Fragment identifier} {empty string}
|
|---|
| 77 | \lineiv{username}{ } {User name} {\constant{None}}
|
|---|
| 78 | \lineiv{password}{ } {Password} {\constant{None}}
|
|---|
| 79 | \lineiv{hostname}{ } {Host name (lower case)} {\constant{None}}
|
|---|
| 80 | \lineiv{port} { } {Port number as integer, if present} {\constant{None}}
|
|---|
| 81 | \end{tableiv}
|
|---|
| 82 |
|
|---|
| 83 | See section~\ref{urlparse-result-object}, ``Results of
|
|---|
| 84 | \function{urlparse()} and \function{urlsplit()},'' for more
|
|---|
| 85 | information on the result object.
|
|---|
| 86 |
|
|---|
| 87 | \versionchanged[Added attributes to return value]{2.5}
|
|---|
| 88 | \end{funcdesc}
|
|---|
| 89 |
|
|---|
| 90 | \begin{funcdesc}{urlunparse}{parts}
|
|---|
| 91 | Construct a URL from a tuple as returned by \code{urlparse()}.
|
|---|
| 92 | The \var{parts} argument be any six-item iterable.
|
|---|
| 93 | This may result in a slightly different, but equivalent URL, if the
|
|---|
| 94 | URL that was parsed originally had unnecessary delimiters (for example,
|
|---|
| 95 | a ? with an empty query; the RFC states that these are equivalent).
|
|---|
| 96 | \end{funcdesc}
|
|---|
| 97 |
|
|---|
| 98 | \begin{funcdesc}{urlsplit}{urlstring\optional{,
|
|---|
| 99 | default_scheme\optional{, allow_fragments}}}
|
|---|
| 100 | This is similar to \function{urlparse()}, but does not split the
|
|---|
| 101 | params from the URL. This should generally be used instead of
|
|---|
| 102 | \function{urlparse()} if the more recent URL syntax allowing
|
|---|
| 103 | parameters to be applied to each segment of the \var{path} portion of
|
|---|
| 104 | the URL (see \rfc{2396}) is wanted. A separate function is needed to
|
|---|
| 105 | separate the path segments and parameters. This function returns a
|
|---|
| 106 | 5-tuple: (addressing scheme, network location, path, query, fragment
|
|---|
| 107 | identifier).
|
|---|
| 108 |
|
|---|
| 109 | The return value is actually an instance of a subclass of
|
|---|
| 110 | \pytype{tuple}. This class has the following additional read-only
|
|---|
| 111 | convenience attributes:
|
|---|
| 112 |
|
|---|
| 113 | \begin{tableiv}{l|c|l|c}{member}{Attribute}{Index}{Value}{Value if not present}
|
|---|
| 114 | \lineiv{scheme} {0} {URL scheme specifier} {empty string}
|
|---|
| 115 | \lineiv{netloc} {1} {Network location part} {empty string}
|
|---|
| 116 | \lineiv{path} {2} {Hierarchical path} {empty string}
|
|---|
| 117 | \lineiv{query} {3} {Query component} {empty string}
|
|---|
| 118 | \lineiv{fragment} {4} {Fragment identifier} {empty string}
|
|---|
| 119 | \lineiv{username} { } {User name} {\constant{None}}
|
|---|
| 120 | \lineiv{password} { } {Password} {\constant{None}}
|
|---|
| 121 | \lineiv{hostname} { } {Host name (lower case)} {\constant{None}}
|
|---|
| 122 | \lineiv{port} { } {Port number as integer, if present} {\constant{None}}
|
|---|
| 123 | \end{tableiv}
|
|---|
| 124 |
|
|---|
| 125 | See section~\ref{urlparse-result-object}, ``Results of
|
|---|
| 126 | \function{urlparse()} and \function{urlsplit()},'' for more
|
|---|
| 127 | information on the result object.
|
|---|
| 128 |
|
|---|
| 129 | \versionadded{2.2}
|
|---|
| 130 | \versionchanged[Added attributes to return value]{2.5}
|
|---|
| 131 | \end{funcdesc}
|
|---|
| 132 |
|
|---|
| 133 | \begin{funcdesc}{urlunsplit}{parts}
|
|---|
| 134 | Combine the elements of a tuple as returned by \function{urlsplit()}
|
|---|
| 135 | into a complete URL as a string.
|
|---|
| 136 | The \var{parts} argument be any five-item iterable.
|
|---|
| 137 | This may result in a slightly different, but equivalent URL, if the
|
|---|
| 138 | URL that was parsed originally had unnecessary delimiters (for example,
|
|---|
| 139 | a ? with an empty query; the RFC states that these are equivalent).
|
|---|
| 140 | \versionadded{2.2}
|
|---|
| 141 | \end{funcdesc}
|
|---|
| 142 |
|
|---|
| 143 | \begin{funcdesc}{urljoin}{base, url\optional{, allow_fragments}}
|
|---|
| 144 | Construct a full (``absolute'') URL by combining a ``base URL''
|
|---|
| 145 | (\var{base}) with a ``relative URL'' (\var{url}). Informally, this
|
|---|
| 146 | uses components of the base URL, in particular the addressing scheme,
|
|---|
| 147 | the network location and (part of) the path, to provide missing
|
|---|
| 148 | components in the relative URL. For example:
|
|---|
| 149 |
|
|---|
| 150 | \begin{verbatim}
|
|---|
| 151 | >>> from urlparse import urljoin
|
|---|
| 152 | >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
|
|---|
| 153 | 'http://www.cwi.nl/%7Eguido/FAQ.html'
|
|---|
| 154 | \end{verbatim}
|
|---|
| 155 |
|
|---|
| 156 | The \var{allow_fragments} argument has the same meaning and default as
|
|---|
| 157 | for \function{urlparse()}.
|
|---|
| 158 | \end{funcdesc}
|
|---|
| 159 |
|
|---|
| 160 | \begin{funcdesc}{urldefrag}{url}
|
|---|
| 161 | If \var{url} contains a fragment identifier, returns a modified
|
|---|
| 162 | version of \var{url} with no fragment identifier, and the fragment
|
|---|
| 163 | identifier as a separate string. If there is no fragment identifier
|
|---|
| 164 | in \var{url}, returns \var{url} unmodified and an empty string.
|
|---|
| 165 | \end{funcdesc}
|
|---|
| 166 |
|
|---|
| 167 |
|
|---|
| 168 | \begin{seealso}
|
|---|
| 169 | \seerfc{1738}{Uniform Resource Locators (URL)}{
|
|---|
| 170 | This specifies the formal syntax and semantics of absolute
|
|---|
| 171 | URLs.}
|
|---|
| 172 | \seerfc{1808}{Relative Uniform Resource Locators}{
|
|---|
| 173 | This Request For Comments includes the rules for joining an
|
|---|
| 174 | absolute and a relative URL, including a fair number of
|
|---|
| 175 | ``Abnormal Examples'' which govern the treatment of border
|
|---|
| 176 | cases.}
|
|---|
| 177 | \seerfc{2396}{Uniform Resource Identifiers (URI): Generic Syntax}{
|
|---|
| 178 | Document describing the generic syntactic requirements for
|
|---|
| 179 | both Uniform Resource Names (URNs) and Uniform Resource
|
|---|
| 180 | Locators (URLs).}
|
|---|
| 181 | \end{seealso}
|
|---|
| 182 |
|
|---|
| 183 |
|
|---|
| 184 | \subsection{Results of \function{urlparse()} and \function{urlsplit()}
|
|---|
| 185 | \label{urlparse-result-object}}
|
|---|
| 186 |
|
|---|
| 187 | The result objects from the \function{urlparse()} and
|
|---|
| 188 | \function{urlsplit()} functions are subclasses of the \pytype{tuple}
|
|---|
| 189 | type. These subclasses add the attributes described in those
|
|---|
| 190 | functions, as well as provide an additional method:
|
|---|
| 191 |
|
|---|
| 192 | \begin{methoddesc}[ParseResult]{geturl}{}
|
|---|
| 193 | Return the re-combined version of the original URL as a string.
|
|---|
| 194 | This may differ from the original URL in that the scheme will always
|
|---|
| 195 | be normalized to lower case and empty components may be dropped.
|
|---|
| 196 | Specifically, empty parameters, queries, and fragment identifiers
|
|---|
| 197 | will be removed.
|
|---|
| 198 |
|
|---|
| 199 | The result of this method is a fixpoint if passed back through the
|
|---|
| 200 | original parsing function:
|
|---|
| 201 |
|
|---|
| 202 | \begin{verbatim}
|
|---|
| 203 | >>> import urlparse
|
|---|
| 204 | >>> url = 'HTTP://www.Python.org/doc/#'
|
|---|
| 205 |
|
|---|
| 206 | >>> r1 = urlparse.urlsplit(url)
|
|---|
| 207 | >>> r1.geturl()
|
|---|
| 208 | 'http://www.Python.org/doc/'
|
|---|
| 209 |
|
|---|
| 210 | >>> r2 = urlparse.urlsplit(r1.geturl())
|
|---|
| 211 | >>> r2.geturl()
|
|---|
| 212 | 'http://www.Python.org/doc/'
|
|---|
| 213 | \end{verbatim}
|
|---|
| 214 |
|
|---|
| 215 | \versionadded{2.5}
|
|---|
| 216 | \end{methoddesc}
|
|---|
| 217 |
|
|---|
| 218 | The following classes provide the implementations of the parse results::
|
|---|
| 219 |
|
|---|
| 220 | \begin{classdesc*}{BaseResult}
|
|---|
| 221 | Base class for the concrete result classes. This provides most of
|
|---|
| 222 | the attribute definitions. It does not provide a \method{geturl()}
|
|---|
| 223 | method. It is derived from \class{tuple}, but does not override the
|
|---|
| 224 | \method{__init__()} or \method{__new__()} methods.
|
|---|
| 225 | \end{classdesc*}
|
|---|
| 226 |
|
|---|
| 227 |
|
|---|
| 228 | \begin{classdesc}{ParseResult}{scheme, netloc, path, params, query, fragment}
|
|---|
| 229 | Concrete class for \function{urlparse()} results. The
|
|---|
| 230 | \method{__new__()} method is overridden to support checking that the
|
|---|
| 231 | right number of arguments are passed.
|
|---|
| 232 | \end{classdesc}
|
|---|
| 233 |
|
|---|
| 234 |
|
|---|
| 235 | \begin{classdesc}{SplitResult}{scheme, netloc, path, query, fragment}
|
|---|
| 236 | Concrete class for \function{urlsplit()} results. The
|
|---|
| 237 | \method{__new__()} method is overridden to support checking that the
|
|---|
| 238 | right number of arguments are passed.
|
|---|
| 239 | \end{classdesc}
|
|---|