Scrapy crawler framework

Getting started with Scrapy

    • 1. Scrapy overview
      • 1.1. Introduction to Scrapy
      • 1.2. Scrapy architecture principle
    • 2. Scrapy environment construction
      • 2.1. CMD to build Scrapy environment
      • 2.2. PyCharm builds Scrapy environment
      • 2.3. Scrapy project structure
    • 3. Scrapy uses four steps
    • 4. Scrapy entry case
      • 4.1. Clear goals
      • 4.2. Making a crawler
      • 4.3. Storing data
      • 4.4. Run the crawler

1. Overview of Scrapy

1.1, Introduction to Scrapy

Scrapy is a Python-based open source web crawler framework for extracting data from web pages. It provides a set of efficient, flexible and scalable tools to help developers quickly build and deploy crawler programs

Scrapy is a web application framework developed in Python language that is suitable for crawling website data and extracting structured data. Mainly used for data mining, information processing, data storage and automated testing. Implementing a crawler through the Scrapy framework can quickly crawl the web with only a small amount of code.

Scrapy is based on Twisted, an asynchronous network framework mainly used to improve the download speed of crawlers. Scrapy uses the Twisted asynchronous network library to handle network communication. It has a clear architecture and contains various middleware interfaces, which can flexibly fulfill various needs. Scrapy uses non-blocking asynchronous processing

The Scrapy framework has the following features:

  • High performance: Scrapy uses an asynchronous network request and processing mechanism to efficiently handle large-scale web crawling tasks.

  • Configurability: Scrapy provides a wealth of configuration options, and you can flexibly set the behavior of the crawler through configuration files or code, including request headers, request intervals, number of concurrencies, etc.

  • XPath and CSS selectors: Scrapy has built-in powerful selectors that support the use of XPath and CSS selectors to locate and extract data from web pages.

  • Middleware and extensions: Scrapy provides middleware and extension mechanisms. Developers can customize and extend the functions of the framework by writing middleware and extensions, such as custom request processing, data processing, error handling, etc.

  • Distributed support: Scrapy can be used in conjunction with distributed task queues (such as Celery) to achieve distributed crawling and data processing

  • Data storage: Scrapy supports storing crawled data in various data storage systems, including files, databases (such as MySQL, PostgreSQL) and NoSQL databases (such as MongoDB), etc.

  • Logging and debugging: Scrapy provides powerful logging and debugging functions, which can help developers debug and troubleshoot crawlers.

1.2, Scrapy architecture principle

5 major components (architecture) of Scrapy framework:

  • Scrapy Engine: The Scrapy engine is the core of the entire framework and is responsible for communication and data transfer between Spider, ItemPipeline, Downloader, and Scheduler.
  • Scheduler: The priority queue of the web page URL. It is mainly responsible for processing the requests sent by the engine, and arranging and scheduling them in a certain way. When the engine needs it, it is returned to the engine.
  • Downloader: Responsible for downloading all Requests sent by the engine and returning the obtained Responses to the engine, which will be handed over to the Spider for processing.
  • Spider: A user-customized crawler used to extract information (entity items) from specific web pages. It is responsible for processing all Responses, extracting data from them, and submitting the URLs that need to be followed to the engine. Enter the scheduler again
  • Item Pipeline: used to process entities obtained from Spider and perform post-processing (detailed analysis, filtering, persistent storage, etc.)

Other components:

  • Downloader Middlewares: A component that can be customized to extend the download function
  • Spider Middlewares: A component that can customize the extension and operate the communication between the engine and Spider

Scrapy’s crawling process is:

  • The engine takes out a link (URL) from the scheduler for subsequent crawling
  • The engine encapsulates the URL into a request and passes it to the downloader
  • The downloader downloads the resources and encapsulates them into a response package (Response)
  • Crawler parses Response
  • The parsed entity (Item) is handed over to the entity pipeline for further processing.
  • The parsed link (URL) is handed over to the scheduler to wait for crawling

Scrapy official website: https://docs.scrapy.org

Getting started documentation: https://doc.scrapy.org/en/latest/intro/tutorial.html

Scrapy Chinese documentation: https://www.osgeo.cn/scrapy/

2. Scrapy environment construction

2.1, CMD to build Scrapy environment

1) CMD command line to install Scrapy:

pip install scrapy

After the installation is complete, enter the scrapy command to verify:

scrapy

2) Create a crawler project in the directory where the crawler project is stored:

scrapy startproject ScrapyDemo

CMD switching operation:

F: # Cutting disc
cd A/B/... # Switch directory

This command will generate a Scrapy project in the current directory:

3) Use PyCharm to open the created project. The initial project structure is as follows:

2.2, PyCharm builds Scrapy environment

1) Create a new crawler project directory

Method 1: Use PyCharm to open the local folder where the crawler project is stored (delete main.py)

Method 2: Create a new directory to store the crawler project in the existing project

2) Open Terminal and install Scrapy:

pip install scrapy

3) Create a crawler project in Terminal (method 2 requires cd to this directory):

scrapy startproject ScrapyDemo

This command will generate a Scrapy project in the current directory. The initial structure of the project is the same as in CMD mode.

2.3. Scrapy project structure

No matter which method is used to build it, the most important thing is the crawler, so crawler files are indispensable.

4) Create the core crawler file (customizable) in the spiders folder: