Directory
Request object
principle
parameter
Pass additional data to the callback function
principle
sample code
FormRequest
concept
parameter
request usage example
response object
parameter
Request object
Principle
Request and response are the most common operations in the crawler. The Request object is generated in the crawler program and passed to the downloader, which >Executes the request and returns a Response object.
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback ])
A Request object represents an HTTP request, which is usually generated by a crawler and executed by a downloader to generate a Response.
parameter
-
url (string) – URL for this request
-
callback (callable) – The function that will be called with the response of this request (once downloaded) as its first argument. See Passing Additional Data to Callback Functions below for more information. If the request does not specify a callback, parse() will use the spider’s method. Note that errback is called if an exception is thrown during processing.
-
method (string) – The HTTP method of this request. Defaults to ‘GET’. It can be set to “GET”, “POST”, “PUT”, etc., and the string is guaranteed to be uppercase
-
meta (dict) – the initial value of the attribute Request.meta, used to pass data between different requests
-
body (str or unicode) – request body. If unicode is passed, then it is encoded as str using the passed encoding (defaults to utf-8). If body is not given, an empty string is stored. Regardless of the type of this parameter, the final value stored will be a str (not unicode or None).
-
headers (dict) – headers for this request. The dict values can be strings (for single-value headers) or lists (for multi-value headers). If None is passed as the value, no HTTP header will be sent. Generally not needed
-
encoding: Just use the default ‘utf-8’
-
dont_filter: Whether to filter duplicate URL addresses, the default is
False
filter -
cookies (dict or list) – request cookies. These can be sent in two forms.
- Use dict:
request_with_cookies = Request(url="http://www.sxt.cn/index/login/login.html",)
- Use the list:
request_with_cookies = Request(url="http://www.example.com", cookies=[{'name': 'currency', 'value': 'USD', 'domain': 'example.com', 'path': '/currency'}])
The latter form allows customization of the cookie’s domain and path attributes. This is only useful when saving cookies for future requests
request_with_cookies = Request(url="http://www.example.com", cookies={'currency': 'USD', 'country': 'UY'}, meta={'dont_merge_cookies': True})
Pass additional data to callback function
Principle
A request’s callback is the function that will be called when the response for that request is downloaded. The callback function will be called with the downloaded Response object as its first argument
Sample Code
def parse_page1(self, response): item = MyItem() item['main_url'] = response.url request = scrapy.Request("http://www.example.com/some_page.html", callback=self. parse_page2) request.meta['item'] = item return request def parse_page2(self, response): item = response. meta['item'] item['other_url'] = response.url return item
Analyze the code:
The code defines two functions:
parse_page1
andparse_page2
. These functions are the callback functions used in Scrapy to process the response. Theparse_page1
function receives aresponse
parameter, which represents the response sent from the web page. In this function, first create aMyItem
object, and assignresponse.url
to themain_url
field. Next, create a new Scrapy request objectrequest
, request the target page “http://www.example.com/some_page.html”, and specify the callback function as `parse_page2’\ ‘. Then, save the `item` object in the request’s metadata (meta), so that it can be accessed in the `parse_page2` function. Finally, theparse_page1
function returns the request object so that Scrapy can continue processing it. Theparse_page2
function receives aresponse
parameter, which represents the response sent from the target page. In this function, first obtain theitem
object saved in the metadata throughresponse.meta['item']
. Then, assignresponse.url
to theother_url
field of theitem
object. Finally, return theitem
object.
FormRequest
Concept
FormRequest is an extension class of Request. The specific and commonly used functions are as follows:
-
When requesting, carry parameters, such as form data
-
Get form data from Response
The main reason why the FormRequest class can carry parameters is: the parameter formdata
of the new constructor is added. The rest of the parameters are the same as the Request class.
- The formdata parameter type is: dict
class scrapy.http.FormRequest(url[, formdata, …])
class method from_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])
Returns a new FormRequest object with the form field values pre-populated