| 1 | \section{\module{robotparser} ---
|
|---|
| 2 | Parser for robots.txt}
|
|---|
| 3 |
|
|---|
| 4 | \declaremodule{standard}{robotparser}
|
|---|
| 5 | \modulesynopsis{Loads a \protect\file{robots.txt} file and
|
|---|
| 6 | answers questions about fetchability of other URLs.}
|
|---|
| 7 | \sectionauthor{Skip Montanaro}{[email protected]}
|
|---|
| 8 |
|
|---|
| 9 | \index{WWW}
|
|---|
| 10 | \index{World Wide Web}
|
|---|
| 11 | \index{URL}
|
|---|
| 12 | \index{robots.txt}
|
|---|
| 13 |
|
|---|
| 14 | This module provides a single class, \class{RobotFileParser}, which answers
|
|---|
| 15 | questions about whether or not a particular user agent can fetch a URL on
|
|---|
| 16 | the Web site that published the \file{robots.txt} file. For more details on
|
|---|
| 17 | the structure of \file{robots.txt} files, see
|
|---|
| 18 | \url{http://www.robotstxt.org/wc/norobots.html}.
|
|---|
| 19 |
|
|---|
| 20 | \begin{classdesc}{RobotFileParser}{}
|
|---|
| 21 |
|
|---|
| 22 | This class provides a set of methods to read, parse and answer questions
|
|---|
| 23 | about a single \file{robots.txt} file.
|
|---|
| 24 |
|
|---|
| 25 | \begin{methoddesc}{set_url}{url}
|
|---|
| 26 | Sets the URL referring to a \file{robots.txt} file.
|
|---|
| 27 | \end{methoddesc}
|
|---|
| 28 |
|
|---|
| 29 | \begin{methoddesc}{read}{}
|
|---|
| 30 | Reads the \file{robots.txt} URL and feeds it to the parser.
|
|---|
| 31 | \end{methoddesc}
|
|---|
| 32 |
|
|---|
| 33 | \begin{methoddesc}{parse}{lines}
|
|---|
| 34 | Parses the lines argument.
|
|---|
| 35 | \end{methoddesc}
|
|---|
| 36 |
|
|---|
| 37 | \begin{methoddesc}{can_fetch}{useragent, url}
|
|---|
| 38 | Returns \code{True} if the \var{useragent} is allowed to fetch the \var{url}
|
|---|
| 39 | according to the rules contained in the parsed \file{robots.txt} file.
|
|---|
| 40 | \end{methoddesc}
|
|---|
| 41 |
|
|---|
| 42 | \begin{methoddesc}{mtime}{}
|
|---|
| 43 | Returns the time the \code{robots.txt} file was last fetched. This is
|
|---|
| 44 | useful for long-running web spiders that need to check for new
|
|---|
| 45 | \code{robots.txt} files periodically.
|
|---|
| 46 | \end{methoddesc}
|
|---|
| 47 |
|
|---|
| 48 | \begin{methoddesc}{modified}{}
|
|---|
| 49 | Sets the time the \code{robots.txt} file was last fetched to the current
|
|---|
| 50 | time.
|
|---|
| 51 | \end{methoddesc}
|
|---|
| 52 |
|
|---|
| 53 | \end{classdesc}
|
|---|
| 54 |
|
|---|
| 55 | The following example demonstrates basic use of the RobotFileParser class.
|
|---|
| 56 |
|
|---|
| 57 | \begin{verbatim}
|
|---|
| 58 | >>> import robotparser
|
|---|
| 59 | >>> rp = robotparser.RobotFileParser()
|
|---|
| 60 | >>> rp.set_url("http://www.musi-cal.com/robots.txt")
|
|---|
| 61 | >>> rp.read()
|
|---|
| 62 | >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
|
|---|
| 63 | False
|
|---|
| 64 | >>> rp.can_fetch("*", "http://www.musi-cal.com/")
|
|---|
| 65 | True
|
|---|
| 66 | \end{verbatim}
|
|---|