An extensible system for using urllib like a state-machine. Functions act much like they would on a browser, with stuff like a back() method and caching. A plugin system is used to allow a prototypical multiple- inheritance-like system at runtime. It’s like a highly structured form of monkey-patching.
Creates a new Browser object, loaded with the specified set of plugins, and using the specified default parser. Both these values can be changed after instantiation (however you cannot remove plugins, only add them).
Requests, loads, and parses a webpage using the internal urllib based opener It is recommended, but not required, that beyond the first url argument, you use keyword arguments, as some poorly written plugins overriding this function may not behave nicely with the large number of positional arguments. Additionally, their order is subject to change.
Keyword arguments:
If None, maps to the default parser, specified upon construction. (If one was not specified, then parsers.passthrough() is used, which does nothing to the page data, and just returns the page source. A parser should follow the format:
parser(source, url)
Takes values and encodes them to send back to the server, eventually loading a new page as a result. It satisfies the lxml.html.submit_form() open_http argument. Many of the keyword arguments map directly to load_page().
Keyword arguments:
If None, maps to the default parser, specified upon construction. (If one was not specified, then parsers.passthrough() is used, which does nothing to the page data, and just returns the page source. A parser should follow the format:
parser(source, url)
Reloads the previous page (if there is one) and returns it. Additional arguments (positional and keyword) will be passed through to load_page().
Reloads the next page (if there is one) and returns it.
A list containing information on previously visited web pages, in the form of tuples, (url, post_data). While HTTP POST data is included explicitly, HTTP GET data is included within the URL.
Reloads the current web page, and returns it.
Loads a page relative in history to the current page. For example, going back one page could be done with:
self._load_relative(-1, *args, **kwargs)
This method is not to be confused with the similarly named expand_relative_url(), which instead works to turn relative urls into absolute ones.
If passed a relative url, finds it’s absolute url in relation to the current page’s url.
Takes a page and parses it with a given parser, or with the default parser if the given parser is None. There are two ways to call this method:
- self._parse_page(parser, source, headers, url)
- self._parse_page(parser, response)
Where source is the byte-string gotten from response.read(), headers is the result of calling response.info(), and response is the result given by calling urllib.response.urlopen().
Removes #fragments from urls, removes a trailing/ from the path, if there is one (without destroying ?parameters), and gets rid of any unnecessary elements, like an unused ?. This is useful, because it makes it easier to compare urls, for caching, and for other reasons.
Returns a new Browser object with the set of recommended plugins.