1. Web-Browser Services

1.1. browser – Webbrowser-like Extensable State Machine

class lib.browser.Browser(default_parser, *plugins)

An extensible system for using urllib like a state-machine. Functions act much like they would on a browser, with stuff like a back() method and caching. A plugin system is used to allow a prototypical multiple- inheritance-like system at runtime. It’s like a highly structured form of monkey-patching.

__init__(default_parser, *plugins)

Creates a new Browser object, loaded with the specified set of plugins, and using the specified default parser. Both these values can be changed after instantiation (however you cannot remove plugins, only add them).

load_page(url, parser=None, data=None, record_history=True)

Requests, loads, and parses a webpage using the internal urllib based opener It is recommended, but not required, that beyond the first url argument, you use keyword arguments, as some poorly written plugins overriding this function may not behave nicely with the large number of positional arguments. Additionally, their order is subject to change.

Keyword arguments:

url
The http:// or https:// page to load. Can alternatively be a relative address (which is then treated as relative to the current page)
parser

If None, maps to the default parser, specified upon construction. (If one was not specified, then parsers.passthrough() is used, which does nothing to the page data, and just returns the page source. A parser should follow the format:

parser(source, url)
data
Maps to the data parameter of urllib.request.urlopen(). This should contain pre-encoded HTTP POST data. GET data should be encoded into the page url.
record_history
If False, the page will not be entered into the internal list of pages loaded, and so the only reminder of this page load will be the cache (and possibly stuff like cookies for instance if you have cookies.CookieBrowserPlugin enabled). This is used internally for the back(), forward() and refresh() functions.
submit(method, url, values, *args, **kwargs)

Takes values and encodes them to send back to the server, eventually loading a new page as a result. It satisfies the lxml.html.submit_form() open_http argument. Many of the keyword arguments map directly to load_page().

Keyword arguments:

method
A string consisting of "GET" or "POST".
url
The http:// or https:// page to load.
values
Either a dictionary or list of two-item tuples containing the data to encode
parser

If None, maps to the default parser, specified upon construction. (If one was not specified, then parsers.passthrough() is used, which does nothing to the page data, and just returns the page source. A parser should follow the format:

parser(source, url)
record_history
If False, the page will not be entered into the internal list of pages loaded, and so the only reminder of this page load will be the cache (and possibly stuff like cookies for instance if you have cookies.CookieBrowserPlugin enabled). This is used internally for the back(), forward() and refresh() functions.
back(*args, **kwargs)

Reloads the previous page (if there is one) and returns it. Additional arguments (positional and keyword) will be passed through to load_page().

forward(*args, **kwargs)

Reloads the next page (if there is one) and returns it.

history

A list containing information on previously visited web pages, in the form of tuples, (url, post_data). While HTTP POST data is included explicitly, HTTP GET data is included within the URL.

current_url
refresh(*args, **kwargs)

Reloads the current web page, and returns it.

_load_relative(relative_index, *args, **kwargs)

Loads a page relative in history to the current page. For example, going back one page could be done with:

self._load_relative(-1, *args, **kwargs)

This method is not to be confused with the similarly named expand_relative_url(), which instead works to turn relative urls into absolute ones.

expand_relative_url(url, relative_to=None)

If passed a relative url, finds it’s absolute url in relation to the current page’s url.

_parse_page(parser, *args, **kwargs)

Takes a page and parses it with a given parser, or with the default parser if the given parser is None. There are two ways to call this method:

  • self._parse_page(parser, source, headers, url)
  • self._parse_page(parser, response)

Where source is the byte-string gotten from response.read(), headers is the result of calling response.info(), and response is the result given by calling urllib.response.urlopen().

_simplify_url(url)

Removes #fragments from urls, removes a trailing/ from the path, if there is one (without destroying ?parameters), and gets rid of any unnecessary elements, like an unused ?. This is useful, because it makes it easier to compare urls, for caching, and for other reasons.

lib.browser.get_new_uf_browser()

Returns a new Browser object with the set of recommended plugins.