Scrapy framework–Request and FormRequest

Directory

Request object

principle

parameter

Pass additional data to the callback function

principle

sample code

FormRequest

concept

parameter

request usage example

response object

parameter

Request object

Principle

Request and response are the most common operations in the crawler. The Request object is generated in the crawler program and passed to the downloader, which >Executes the request and returns a Response object.

class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback ])

A Request object represents an HTTP request, which is usually generated by a crawler and executed by a downloader to generate a Response.

parameter

url (string) – URL for this request
callback (callable) – The function that will be called with the response of this request (once downloaded) as its first argument. See Passing Additional Data to Callback Functions below for more information. If the request does not specify a callback, parse() will use the spider’s method. Note that errback is called if an exception is thrown during processing.
method (string) – The HTTP method of this request. Defaults to ‘GET’. It can be set to “GET”, “POST”, “PUT”, etc., and the string is guaranteed to be uppercase
meta (dict) – the initial value of the attribute Request.meta, used to pass data between different requests
body (str or unicode) – request body. If unicode is passed, then it is encoded as str using the passed encoding (defaults to utf-8). If body is not given, an empty string is stored. Regardless of the type of this parameter, the final value stored will be a str (not unicode or None).
headers (dict) – headers for this request. The dict values can be strings (for single-value headers) or lists (for multi-value headers). If None is passed as the value, no HTTP header will be sent. Generally not needed
encoding: Just use the default ‘utf-8’
dont_filter: Whether to filter duplicate URL addresses, the default is False filter
cookies (dict or list) – request cookies. These can be sent in two forms.
- Use dict:

request_with_cookies = Request(url="http://www.sxt.cn/index/login/login.html",)

Use the list:

 request_with_cookies = Request(url="http://www.example.com",
                cookies=[{'name': 'currency',
                    'value': 'USD',
                    'domain': 'example.com',
                    'path': '/currency'}])

The latter form allows customization of the cookie’s domain and path attributes. This is only useful when saving cookies for future requests

request_with_cookies = Request(url="http://www.example.com",
                cookies={'currency': 'USD', 'country': 'UY'},
                meta={'dont_merge_cookies': True})

Pass additional data to callback function

Principle

A request’s callback is the function that will be called when the response for that request is downloaded. The callback function will be called with the downloaded Response object as its first argument

Sample Code

def parse_page1(self, response):
  item = MyItem()
  item['main_url'] = response.url
  request = scrapy.Request("http://www.example.com/some_page.html",
               callback=self. parse_page2)
  request.meta['item'] = item
  return request


def parse_page2(self, response):
  item = response. meta['item']
  item['other_url'] = response.url
  return item

Analyze the code:

The code defines two functions: parse_page1 and parse_page2. These functions are the callback functions used in Scrapy to process the response. The parse_page1 function receives a response parameter, which represents the response sent from the web page. In this function, first create a MyItem object, and assign response.url to the main_url field. Next, create a new Scrapy request object request, request the target page “http://www.example.com/some_page.html”, and specify the callback function as `parse_page2’\ ‘. Then, save the `item` object in the request’s metadata (meta), so that it can be accessed in the `parse_page2` function. Finally, the parse_page1 function returns the request object so that Scrapy can continue processing it. The parse_page2 function receives a response parameter, which represents the response sent from the target page. In this function, first obtain the item object saved in the metadata through response.meta['item']. Then, assign response.url to the other_url field of the item object. Finally, return the item object.

FormRequest

Concept

FormRequest is an extension class of Request. The specific and commonly used functions are as follows:

When requesting, carry parameters, such as form data
Get form data from Response

The main reason why the FormRequest class can carry parameters is: the parameter formdata of the new constructor is added. The rest of the parameters are the same as the Request class.

The formdata parameter type is: dict

class scrapy.http.FormRequest(url[, formdata, …])

class method from_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])

Returns a new FormRequest object with the form field values pre-populated

in the HTML elements contained in the given response.

parameter

response (Responseobject) – contains the response of the HTML form that will be used to pre-populate the form fields
formname (string) – If given, the form that will be set to this value with the name attribute
formid (string) – If given, the form that will be set to this value with the id attribute
formxpath (string) – if given, the first form matching the xpath will be used
formcss (string) – If given, the first form matching the css selector will be used
formnumber (integer) – The number of forms to use when the response contains more than one form. The first (and default) is 0
formdata (dict) – Fields to override in formdata. If the field already exists in the response element, its value will be overwritten with the value passed in this parameter
clickdata (dict) – Lookup the attribute for which the control was clicked. If not provided, the form data will be submitted, simulating a click on the first clickable element. In addition to the html attribute, controls can be identified by their zero-based index relative to other submit table inputs in the form, via the nr attribute
dont_click (boolean) – if True, the form data will be submitted without clicking any element

Request usage example

Send data via HTTP POST

FormRequest(
           url="http://www.example.com/post/action",
      formdata={'name': 'John Doe', 'age': '27'},
      callback=self.after_post
      )

Send data via FormRequest.from_response()

FormRequest.from_response(
      response,
      formdata={'username': 'john', 'password': 'secret'},
      callback=self.after_login
)

Response Object

class scrapy.http.Response(url[, status=200, headers=None, body=b'', flags=None, request=None])

An HTTP response represented by a Response object, which is usually downloaded by the downloader and supplied to the crawler for processing

parameter

url (string) – URL for this response
status (integer) – HTTP status of the response. The default is 200
headers (dict) – headers for this response. dict values can be strings (for single-value headers) or lists (for multi-value headers)
body (bytes) – Response body. It must be str, not unicode, unless you use an encoding-aware response subclass like TextResponse
flags (list) – is a list of Response.flags containing initial values for the attributes. If given, the list will be shallow copied
request (Requestobject) – Initial value for the property Response.request. This generates this response on behalf of Request
text get text