The mechanicalsoup package: API documentation¶
StatefulBrowser¶
-
class
mechanicalsoup.
StatefulBrowser
(*args, **kwargs)¶ Bases:
mechanicalsoup.browser.Browser
An extension of
Browser
that stores the browser’s state and provides many convenient functions for interacting with HTML elements. It is the primary tool in MechanicalSoup for interfacing with websites.Parameters: - session – Attach a pre-existing requests Session instead of constructing a new one.
- soup_config – Configuration passed to BeautifulSoup to affect
the way HTML is parsed. Defaults to
{'features': 'lxml'}
. If overridden, it is highly recommended to specify a parser. Otherwise, BeautifulSoup will issue a warning and pick one for you, but the parser it chooses may be different on different machines. - requests_adapters – Configuration passed to requests, to affect the way HTTP requests are performed.
- raise_on_404 – If True, raise
LinkNotFoundError
when visiting a page triggers a 404 Not Found error. - user_agent – Set the user agent header to this value.
All arguments are forwarded to
Browser()
.Examples
browser = mechanicalsoup.StatefulBrowser( soup_config={'features': 'lxml'}, # Use the lxml HTML parser raise_on_404=True, user_agent='MyBot/0.1: mysite.example.com/bot_info', ) browser.open(url) # ... browser.close()
Once not used anymore, the browser can be closed using
close()
.-
__setitem__
(name, value)¶ Call item assignment on the currently selected form. See
Form.__setitem__()
.
-
absolute_url
(url)¶ Return the absolute URL made from the current URL and
url
. The current URL is only used to provide any missing components ofurl
, as in the .urljoin() method of urllib.parse.
-
download_link
(link=None, file=None, *bs4_args, bs4_kwargs={}, requests_kwargs={}, **kwargs)¶ Downloads the contents of a link to a file. This function behaves similarly to
follow_link()
, but the browser state will not change when calling this function.Parameters: file – Filesystem path where the page contents will be downloaded. If the file already exists, it will be overwritten. Other arguments are the same as
follow_link()
(link
can either be a bs4.element.Tag or a URL regex.bs4_kwargs
arguments are forwarded tofind_link()
, as are any excess keyword arguments (aka**kwargs
) for backwards compatibility).Returns: requests.Response object.
-
find_link
(*args, **kwargs)¶ Find and return a link, as a bs4.element.Tag object.
The search can be refined by specifying any argument that is accepted by
links()
. If several links match, return the first one found.If no link is found, raise
LinkNotFoundError
.
-
follow_link
(link=None, *bs4_args, bs4_kwargs={}, requests_kwargs={}, **kwargs)¶ Follow a link.
If
link
is a bs4.element.Tag (i.e. from a previous call tolinks()
orfind_link()
), then follow the link.If
link
doesn’t have a href-attribute or is None, treatlink
as a url_regex and look it up withfind_link()
.bs4_kwargs
are forwarded tofind_link()
. For backward compatibility, any excess keyword arguments (aka**kwargs
) are also forwarded tofind_link()
.If the link is not found, raise
LinkNotFoundError
. Before raising, if debug is activated, list available links in the page and launch a browser.requests_kwargs
are forwarded toopen_relative()
.Returns: Forwarded from open_relative()
.
-
form
¶ Get the currently selected form as a
Form
object. Seeselect_form()
.
-
get_debug
()¶ Get the debug mode (off by default).
-
get_verbose
()¶ Get the verbosity level. See
set_verbose()
.
-
launch_browser
(soup=None)¶ Launch a browser to display a page, for debugging purposes.
Param: soup: Page contents to display, supplied as a bs4 soup object. Defaults to the current page of the StatefulBrowser
instance.
-
links
(url_regex=None, link_text=None, *args, **kwargs)¶ Return links in the page, as a list of bs4.element.Tag objects.
To return links matching specific criteria, specify
url_regex
to match the href-attribute, orlink_text
to match the text-attribute of the Tag. All other arguments are forwarded to the .find_all() method in BeautifulSoup.
-
list_links
(*args, **kwargs)¶ Display the list of links in the current page. Arguments are forwarded to
links()
.
-
new_control
(type, name, value, **kwargs)¶ Call
Form.new_control()
on the currently selected form.
-
open
(url, *args, **kwargs)¶ Open the URL and store the Browser’s state in this object. All arguments are forwarded to
Browser.get()
.Returns: Forwarded from Browser.get()
.
-
open_fake_page
(page_text, url=None, soup_config=None)¶ Mock version of
open()
.Behave as if opening a page whose text is
page_text
, but do not perform any network access. Ifurl
is set, pretend it is the page’s URL. Useful mainly for testing.
-
open_relative
(url, *args, **kwargs)¶ Like
open()
, buturl
can be relative to the currently visited page.
-
page
¶ Get the current page as a soup object.
-
refresh
()¶ Reload the current page with the same request as originally done. Any change (select_form, or any value filled-in in the form) made to the current page before refresh is discarded.
Raises: ValueError – Raised if no refreshable page is loaded, e.g., when using the shallow Browser
wrapper functions.Returns: Response of the request.
-
select_form
(selector='form', nr=0)¶ Select a form in the current page.
Parameters: - selector – CSS selector or a bs4.element.Tag object to identify
the form to select.
If not specified,
selector
defaults to “form”, which is useful if, e.g., there is only one form on the page. Forselector
syntax, see the .select() method in BeautifulSoup. - nr – A zero-based index specifying which form among those that
match
selector
will be selected. Useful when one or more forms have the same attributes as the form you want to select, and its position on the page is the only way to uniquely identify it. Default is the first matching form (nr=0
).
Returns: The selected form as a soup object. It can also be retrieved later with the
form
attribute.- selector – CSS selector or a bs4.element.Tag object to identify
the form to select.
If not specified,
-
set_debug
(debug)¶ Set the debug mode (off by default).
Set to True to enable debug mode. When active, some actions will launch a browser on the current page on failure to let you inspect the page content.
-
set_verbose
(verbose)¶ Set the verbosity level (an integer).
- 0 means no verbose output.
- 1 shows one dot per visited page (looks like a progress bar)
- >= 2 shows each visited URL.
-
submit_selected
(btnName=None, update_state=True, **kwargs)¶ Submit the form that was selected with
select_form()
.Returns: Forwarded from
Browser.submit()
.Parameters: - btnName – Passed to
Form.choose_submit()
to choose the element of the current form to use for submission. IfNone
, will choose the first valid submit element in the form, if one exists. IfFalse
, will not use any submit element; this is useful for simulating AJAX requests, for example. - update_state – If False, the form will be submitted but the browser state will remain unchanged; this is useful for forms that result in a download of a file, for example.
All other arguments are forwarded to
Browser.submit()
.- btnName – Passed to
-
url
¶ Get the URL of the currently visited page.
Browser¶
-
class
mechanicalsoup.
Browser
(session=None, soup_config={'features': 'lxml'}, requests_adapters=None, raise_on_404=False, user_agent=None)¶ Builds a low-level Browser.
It is recommended to use
StatefulBrowser
for most applications, since it offers more advanced features and conveniences than Browser.Parameters: - session – Attach a pre-existing requests Session instead of constructing a new one.
- soup_config – Configuration passed to BeautifulSoup to affect
the way HTML is parsed. Defaults to
{'features': 'lxml'}
. If overridden, it is highly recommended to specify a parser. Otherwise, BeautifulSoup will issue a warning and pick one for you, but the parser it chooses may be different on different machines. - requests_adapters – Configuration passed to requests, to affect the way HTTP requests are performed.
- raise_on_404 – If True, raise
LinkNotFoundError
when visiting a page triggers a 404 Not Found error. - user_agent – Set the user agent header to this value.
-
static
add_soup
(response, soup_config)¶ Attaches a soup object to a requests response.
-
close
()¶ Close the current session, if still open.
-
get
(*args, **kwargs)¶ Straightforward wrapper around requests.Session.get.
Returns: requests.Response object with a soup-attribute added by add_soup()
.
Gets the cookiejar from the requests session.
-
classmethod
get_request_kwargs
(form, url=None, **kwargs)¶ Extract input data from the form.
-
launch_browser
(soup)¶ Launch a browser to display a page, for debugging purposes.
Param: soup: Page contents to display, supplied as a bs4 soup object.
-
post
(*args, **kwargs)¶ Straightforward wrapper around requests.Session.post.
Returns: requests.Response object with a soup-attribute added by add_soup()
.
-
put
(*args, **kwargs)¶ Straightforward wrapper around requests.Session.put.
Returns: requests.Response object with a soup-attribute added by add_soup()
.
-
request
(*args, **kwargs)¶ Straightforward wrapper around requests.Session.request.
Returns: requests.Response object with a soup-attribute added by add_soup()
.This is a low-level function that should not be called for basic usage (use
get()
orpost()
instead). Use it if you need an HTTP verb that MechanicalSoup doesn’t manage (e.g. MKCOL) for example.
Replaces the current cookiejar in the requests session. Since the session handles cookies automatically without calling this function, only use this when default cookie handling is insufficient.
Parameters: cookiejar – Any http.cookiejar.CookieJar compatible object.
-
set_user_agent
(user_agent)¶ Replaces the current user agent in the requests session headers.
-
submit
(form, url=None, **kwargs)¶ Prepares and sends a form request.
NOTE: To submit a form with a
StatefulBrowser
instance, it is recommended to useStatefulBrowser.submit_selected()
instead of this method so that the browser state is correctly updated.Parameters: - form – The filled-out form.
- url – URL of the page the form is on. If the form action is a relative path, then this must be specified.
- **kwargs – Arguments forwarded to requests.Session.request. If files, params (with GET), or data (with POST) are specified, they will be appended to by the contents of form.
Returns: requests.Response object with a soup-attribute added by
add_soup()
.
Form¶
-
class
mechanicalsoup.
Form
(form)¶ Build a fillable form.
Parameters: form – A bs4.element.Tag corresponding to an HTML form element. The Form class is responsible for preparing HTML forms for submission. It handles the following types of elements: input (text, checkbox, radio), select, and textarea.
Each type is set by a method named after the type (e.g.
set_select()
), and then there are convenience methods (e.g.set()
) that do type-deduction and set the value using the appropriate method.It also handles submit-type elements using
choose_submit()
.-
__setitem__
(name, value)¶ Forwards arguments to
set()
. For example,form["name"] = "value"
callsform.set("name", "value")
.
-
check
(data)¶ For backwards compatibility, this method handles checkboxes and radio buttons in a single call. It will not uncheck any checkboxes unless explicitly specified by
data
, in contrast with the default behavior ofset_checkbox()
.
-
choose_submit
(submit)¶ Selects the input (or button) element to use for form submission.
Parameters: submit – The bs4.element.Tag
(or just its name-attribute) that identifies the submit element to use. IfNone
, will choose the first valid submit element in the form, if one exists. IfFalse
, will not use any submit element; this is useful for simulating AJAX requests, for example.To simulate a normal web browser, only one submit element must be sent. Therefore, this does not need to be called if there is only one submit element in the form.
If the element is not found or if multiple elements match, raise a
LinkNotFoundError
exception.Example:
browser = mechanicalsoup.StatefulBrowser() browser.open(url) form = browser.select_form() form.choose_submit('form_name_attr') browser.submit_selected()
-
new_control
(type, name, value, **kwargs)¶ Add a new input element to the form.
The arguments set the attributes of the new element.
-
print_summary
()¶ Print a summary of the form.
May help finding which fields need to be filled-in.
-
set
(name, value, force=False)¶ Set a form element identified by
name
to a specifiedvalue
. The type of element (input, textarea, select, …) does not need to be given; it is inferred by the following methods:set_checkbox()
,set_radio()
,set_input()
,set_textarea()
,set_select()
. If none of these methods find a matching element, then ifforce
is True, a new element (<input type="text" ...>
) will be added usingnew_control()
.Example: filling-in a login/password form with EULA checkbox
form.set("login", username) form.set("password", password) form.set("eula-checkbox", True)
Example: uploading a file through a
<input type="file" name="tagname">
field (provide an open file object, and its content will be uploaded):form.set("tagname", open(path_to_local_file, "rb"))
-
set_checkbox
(data, uncheck_other_boxes=True)¶ Set the checked-attribute of input elements of type “checkbox” specified by
data
(i.e. check boxes).Parameters: - data – Dict of
{name: value, ...}
. In the family of checkboxes whose name-attribute isname
, check the box whose value-attribute isvalue
. All boxes in the family can be checked (unchecked) ifvalue
is True (False). To check multiple specific boxes, letvalue
be a tuple or list. - uncheck_other_boxes – If True (default), before checking any
boxes specified by
data
, uncheck the entire checkbox family. Consider setting to False if some boxes are checked by default when the HTML is served.
- data – Dict of
-
set_input
(data)¶ Fill-in a set of fields in a form.
Example: filling-in a login/password form
form.set_input({"login": username, "password": password})
This will find the input element named “login” and give it the value
username
, and the input element named “password” and give it the valuepassword
.
-
set_radio
(data)¶ Set the checked-attribute of input elements of type “radio” specified by
data
(i.e. select radio buttons).Parameters: data – Dict of {name: value, ...}
. In the family of radio buttons whose name-attribute isname
, check the radio button whose value-attribute isvalue
. Only one radio button in the family can be checked.
-
set_select
(data)¶ Set the selected-attribute of the first option element specified by
data
(i.e. select an option from a dropdown).Parameters: data – Dict of {name: value, ...}
. Find the select element whose name-attribute isname
. Then select from among its children the option element whose value-attribute isvalue
. If no matching value-attribute is found, this will search for an option whose text matchesvalue
. If the select element’s multiple-attribute is set, thenvalue
can be a list or tuple to select multiple options.
-
set_textarea
(data)¶ Set the string-attribute of the first textarea element specified by
data
(i.e. set the text of a textarea).Parameters: data – Dict of {name: value, ...}
. The textarea whose name-attribute isname
will have its string-attribute set tovalue
.
-
uncheck_all
(name)¶ Remove the checked-attribute of all input elements with a name-attribute given by
name
.
-
Exceptions¶
-
exception
mechanicalsoup.
LinkNotFoundError
¶ Bases:
Exception
Exception raised when mechanicalsoup fails to find something.
This happens in situations like (non-exhaustive list):
find_link()
is called, but no link is found.- The browser was configured with raise_on_404=True and a 404 error is triggered while browsing.
- The user tried to fill-in a field which doesn’t exist in a form (e.g. browser[“name”] = “val” with browser being a StatefulBrowser).
-
exception
mechanicalsoup.
InvalidFormMethod
¶ Bases:
mechanicalsoup.utils.LinkNotFoundError
This exception is raised when a method of
Form
is used for an HTML element that is of the wrong type (or is malformed). It is caught withinForm.set()
to perform element type deduction.It is derived from
LinkNotFoundError
so that a single base class can be used to catch all exceptions specific to this module.