Scrapy crawls asynchronously loaded data

Use of Scrapy middleware

  • foreword
  • Scrapy middleware
    • 1 Classification and function of scrapy middleware
      • 1.1 Classification of scrapy middleware
      • 2 The role of scrapy middleware
    • 2 How to download middleware:
      • process_request(request, spider):
      • process_response(request, response, spider):
      • process_exception(request, exception, spider):
    • 3 Grab some news
      • 3.1 Pre-crawl analysis
      • 3.2 Code configuration
      • 3.3 Print results
  • Summarize

Foreword

What should we do when our crawlers encounter lazy-loaded data? First of all, we will think of using selenium to simulate artificial sliding to continuously load data, and finally get the data, but the speed of selenium is too slow, so using selenium + scrapy can just solve this problem Question, the following is a small case of using scrapy to grab lazy-loaded data. The code and method are insufficient. Please give me some pointers! ! !

Scrapy middleware

1 Classification and function of scrapy middleware

1.1 Classification of scrapy middleware

Scrapy middleware can be divided into the following categories:

  • Request Middleware:
    Request middleware acts on each request sent to the server, and can be used to modify the request, add request headers, set a proxy, etc. The main method is process_request(request, spider). If a middleware returns a Response object, Scrapy will not continue to process the request, but directly return the Response to the crawler.
  • Downloader Middleware:
    The download middleware acts on each response received, and can be used to modify the response, handle exceptions, set a proxy, etc. The main method is process_response(request, response, spider). Similar to request middleware, if a middleware returns a Response object, Scrapy will no longer continue processing the response, but will directly return the Response to the crawler.
  • Spider Middleware:
    The exception middleware is used to handle the exceptions generated during the request and response process, and the exceptions can be customized. The main method is process_exception(request, exception, spider).

2 The role of scrapy middleware

  • The main function of the crawly middleware is to intercept and process requests and responses, and to control and intervene the crawlers globally. Middleware can be used to implement functions such as request and response processing, proxy settings, user agent (User-Agent) random switching, cookie handling, error handling, custom retry logic, and more.
  • But in the default case of scrapy, both middlewares are in the middlewares.py file
  • The usage method of the crawler middleware is the same as that of the download middleware, and the download middleware is commonly used

2 How to download middleware:

process_request(request, spider):

  • Function:
    • Preprocess the request before sending it, allowing you to modify the request or add custom request header information, etc.
  • Parameters:
    • request: The request object to be sent.
    • spider: crawler instance, you can access the properties and methods of the crawler in the middleware.
  • Return value:
    • Returning a None means continue processing the request and send it to the server.
    • A Response object is returned, indicating that the middleware has processed the request and returned this custom response to the crawler, and will not continue to send the request.
    • A Request object is returned, indicating that the middleware has processed the request, and a new request object is returned to the crawler, and Scrapy will continue to process the new request.

process_response(request, response, spider):

  • Function:
    • Preprocess the response after it is received, allowing you to modify the response content, handle exceptions, etc.
  • Parameters:
    • request: The corresponding request object.
    • response: The received response object.
    • spider: crawler instance, you can access the properties and methods of the crawler in the middleware.
  • Return value:
    • A Response object is returned, indicating that the middleware has processed the response, and returned this custom response to the crawler, and will not continue to process the response.
    • A Request object is returned, indicating that the middleware has processed the response, and a new request object is returned to the crawler, and Scrapy will continue to process the new request.
    • Returning a None means to continue processing the response and return it to the crawler.

process_exception(request, exception, spider):

  • Function:

    • Handle when an exception occurs during request or response processing, allowing you to customize exception handling or perform error retry operations.
  • Parameters:

    • request: The request object that caused the exception.
    • exception: The exception object generated.
    • spider: crawler instance, you can access the properties and methods of the crawler in the middleware.
  • Return value:

    • A Request object is returned, indicating that the middleware has handled the exception, and a new request object is returned to the crawler, and Scrapy will continue to process the new request.
    • Returning a None means to continue processing exceptions, and Scrapy will handle exceptions in the default way.
  • In these methods, you can modify the request and response according to your needs, handle exceptions, implement custom retry logic, etc. At the same time, you can also use the spider parameter to access the properties and methods of the crawler for better interaction and control with the crawler. Such as: spider.name

  • It should be noted that the execution order of these methods in the middleware is determined by their priority in the DOWNLOADER_MIDDLEWARES setting item. The smaller the priority value, the higher the priority, that is, priority execution. Therefore, when implementing custom download middleware, you can reasonably set the priority of middleware according to the needs and dependencies of functions.

3 Grab some news

3.1 Pre-crawl analysis

Grab the data of domestic and international sectors

  • analyze
    The data of the domestic and international sections are dynamically loaded, and are not returned together with the current request

  • Solution
    1. Use selenium to cooperate with crawlers to grab pages and extract data
    2. Find the url address for loading dynamic data and crawl it through crawlers

3.2 Code Configuration

  • Configure the settings.py file
# Scrapy settings for wangyi project
BOT_NAME = 'wangyi'

SPIDER_MODULES = ['wangyi. spiders']
NEWSPIDER_MODULE = 'wangyi. spiders'

# Default request header
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'

# Used to replace the random request header
USER_AGENTS_LIST = [
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527 + (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5" ]

LOG_LEVEL = 'ERROR'
ROBOTSTXT_OBEY = False

DOWNLOADER_MIDDLEWARES = {<!-- -->
   'wangyi.middlewares.WangyiDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {<!-- -->
   'wangyi.pipelines.WangyiPipeline': 300,
}
  • Crawler code wy.py
import scrapy
from selenium import webdriver
from selenium. webdriver import ChromeOptions
from selenium.webdriver.chrome.options import Options
from wangyi.items import WangyiItem

class WySpider(scrapy. Spider):
    name = "wy"
    # allowed_domains = ["Crawled website domains"]
    start_urls = ["The starting url of crawling website"]
    li_index = [1, 2] # Indexes 1 and 2 are the urls of domestic and international sections
    page_url = [] # Store the url of the required section
    # Hide the browser interface
    chrome_option = Options()
    chrome_option.add_argument('--headless')
    chrome_option.add_argument('--disable-gpu')
    # prevent detection
    chrome_option.add_experimental_option('excludeSwitches', ['enable-automation'])
    chrome_option.add_experimental_option('useAutomationExtension', False)
    # import configuration
    driver = webdriver. Chrome(options=chrome_option)
    
    def parse(self, response, **kwargs):
        # Grab the data of domestic and international sectors
        # Returns the urls of all sections
        url_list = response.xpath('/html/body/div/div[3]/div[2]/div[2]/div/ul/li/a/@href').extract()
        # print(url_list)
        for i in range(len(url_list)):
            if i in self.li_index:
                url = url_list[i]
                # Store the url corresponding to the plate
                self. page_url. append(url)
                yield scrapy.Request(url, callback=self.parse_detail)

    # For the processing after the plate request
    def parse_detail(self, response):
        url_list = response.xpath('/html/body/div/div[3]/div[3]/div[1]/div[1]/div/ul/li/div/div/div/div[1] /h3/a/@href').extract()
        for url in url_list:
            yield scrapy.Request(url, callback=self.parse_detail_con)

    def parse_detail_con(self, response):
        items = WangyiItem()
        title = response.xpath('//*[@id="container"]/div[1]/h1/text()').extract_first()
        con = response.xpath('//*[@id="content"]/div[2]/p/text()').extract()[1:]
        items['title'] = title
        items['con'] = con
        print(items)
        yield items
  • Middlewares.py
import time
from scrapy.http import HtmlResponse
from wangyi.settings import USER_AGENTS_LIST
import random

class WangyiSpiderMiddleware:
···
class WangyiDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        returns

    def process_request(self, request, spider):
        ua = random.choice(USER_AGENTS_LIST) # One random request header each time
        request.headers['User-Agent'] = ua # set the request header
        return None

    def process_response(self, request, response, spider):
        driver = spider.driver # Get the driver attribute in the crawler
        # If it is the url of the plate, we will crawl it
        if request.url in spider.page_url:
            driver.get(request.url) # Use selenium to grab the previous request
            # scroll bar to the bottom
            driver. execute_script('window. scrollTo(0, document. body. scrollHeight)')
            time. sleep(1)
            # drag twice
            driver. execute_script('window. scrollTo(0, document. body. scrollHeight)')
            time. sleep(1)
            text = driver. page_source
            # Construct HTML response Tampering response
            return HtmlResponse(url=request.url, body=text, request=request, encoding='utf-8')
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        print('process_exception')


    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)

3.3 Print results

Summary

This article introduces how to use Scrapy middleware, mainly covering the detailed instructions for downloading middleware. Download middleware is a type of middleware in Scrapy that is used to intercept and process requests and responses during processing. The main three methods are process_request, process_response, and process_exception.

  1. process_request(request,
    spider) method is used to preprocess the request before sending it, allowing you to modify the request or add custom request header information, etc. You can return a None, which means continue to process the request and send it to the server; or return a Response object, which means that the middleware has processed the request, and returns a custom response to the crawler, and does not continue to send the request; or return a The Request object indicates that the middleware has processed the request and returns a new request object to the crawler, and Scrapy will continue to process the new request.

  2. process_response(request, response,
    spider) method is used to preprocess the response after receiving it, allowing you to modify the response content, handle exceptions, etc. Can return a Response object, indicating that the middleware has processed the response, and return the custom response to the crawler, and no longer continue to process the response; or return a Request object, indicating that the middleware has processed the response, and return a new request object To the crawler, Scrapy will continue to process this new request; or return a None, indicating that it will continue to process the response and return it to the crawler.

  3. process_exception(request, exception,
    spider) method is used to handle exceptions generated during request or response processing, allowing you to customize exception handling or perform error retry operations. It can return a Request object, indicating that the middleware has handled the exception, and return a new request object to the crawler, and Scrapy will continue to process the new request; or return a None, indicating that the exception will continue to be processed, and Scrapy will handle the exception in the default way.

This article also provides a small case of using Scrapy to grab lazy-loaded data, and realizes the capture of dynamically loaded data by cooperating with Selenium and Scrapy. The crawler in the case grabs the URL of the section by downloading middleware, uses Selenium to simulate browser scrolling to achieve dynamic loading, and then returns the captured data to the crawler for processing. Interested partners can configure the pipeline for saving by themselves.

In general, downloading middleware is one of the powerful and flexible functions of Scrapy. By customizing middleware, global processing of requests and responses can be achieved, improving the flexibility and reliability of crawlers. At the same time, it can also cooperate with other tools such as Selenium to realize the capture of dynamically loaded data.