The scrapy framework crawls data (creates a scrapy project + xpath parses data + saves data through pipelines + middleware)

Table of Contents

1. Create a scrapy project

2. Xpath parses data

3. Data saving through pipelines

4. Middleware


1. Create a scrapy project

1. Create a folder: C06

Enter the following command in the terminal:

2. Install scrapy: pip install scrapy

3. Go to the folder: cd C06

4. Create project: scrapy startproject C06L02 (project name)

5. Switch to C06L02: cd C06L02/C06L02

Switch to spiders: cd spiders

6. Create a crawler name and enter the crawling link: scrapy genspider app https://product.cheshi.com/rank/2-0-0-0-1/

(If crawlspider is used to crawl the entire site: scrapy genspider -t crawl app http://seller.cheshi.com/jinan/

7. Pay attention to whether the links of the crawler file (newly generated app.py) are consistent

8. Run the crawler file: scrapy crawl app

9. If you want to eliminate the log file, add the command in settings.py: LOG_LEVEL=”ERROR”

If you want to bypass the ROBOTS protocol, add the command in settings.py: ROBOTSTXT_OBEY=False

10. The app.py file code of a simple scrapy project is as follows:

import scrapy

class AppSpider(scrapy.Spider):
    name = "app"
    allowed_domains = ["product.cheshi.com"]
    started_urls = ["http://product.cheshi.com/rank/2-0-0-0-1/"]

    def parse(self, response):
        print(response.text)

If crawlspider is used to crawl the entire site:

import scrapy
from scrapy.linkextractors import linkExtractor
from scrapy.spiders import CrawlSpider, Rule

class AppSpider(CrawlSpider):
    name = "app"
    allowed_domains = ["product.cheshi.com"]
    started_urls = ["http://product.cheshi.com/jinan"]

    rules = (Rule(linkExtractor(allow=r"seller.cheshi.com/\d + ", deny=r"seller.cheshi.com/\d + /. + "),callback="parse_item",follow=True ),)

    def parse(self, response):
        print(response.url)

11. User-agent configuration: Expand the user-agent annotation content in the settings.py file and add the required content

2. xpath analysis data

Modify the parse function in the app.py file

import scrapy

class AppSpider(scrapy.Spider):
    name = "app"
    allowed_domains = ["product.cheshi.com"]
    started_urls = ["http://product.cheshi.com/rank/2-0-0-0-1/"]

    def parse(self, response):
        cars = response.xpath('//ul[@class="condition_list_con"]/li')
        for car in cars:
            title = car.xpath('./div[@class="m_detail"]//a/text()').get()
            price = car.xpath('./div[@class="m_detail"]//b/text()').get()

If paging crawling is implemented, the code is as follows

import scrapy
from ..items import C06L10Item

class AppSpider(scrapy.Spider):
    name = "app"
    allowed_domains = ["book.douban.com"]
    started_urls = ["http://book.douban.com/latest"]

    def parse(self, response):
        books = response.xpath('//ul[@class="chart-dashed-list"]/li')
        for book in books:
            link = book.xpath('.//h2/a/@href').get()
            yield scrapy.Request(url=link,callback=self.parse_details)
        
        next_url = response.xpath('//*[@id="content"]/div/div[1]/div[4]/span[4]/a/@href').get()
    
        if next_url is not None:
            next_url = response.urljoin(next_url)
            print(next_url)
            yield scrapy.Request(url=next_url, callback=self.parse)
        else:
            next_url = response.xpath('//*[@id="content"]/div/div[1]/div[4]/span[3]/a/@href').get()
            next_url = response.urljoin(next_url)
            print(next_url)
            yield scrapy.Request(url=next_url, callback=self.parse)

    def parse_details(self, response):
        item = C06L10Item()
        item["title"] = response.xpath('//*[id="wrapper"]/h1/span/text()').get()
        item["publisher"] = response.xpath('//*[id="info"]/a[1]/text()').get()
        yield item

3. Data saving through pipelines

1. Define the data model in the items.py file

import scrapy

class C06L04Item(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()

2. Add the following code to the app.py file

import scrapy
from ..items import C06L04Item

class AppSpider(scrapy.Spider):
    name = "app"
    allowed_domains = ["product.cheshi.com"]
    started_urls = ["http://product.cheshi.com/rank/2-0-0-0-1/"]

    def parse(self, response):
        item = C06L04Item()
        cars = response.xpath('//ul[@class="condition_list_con"]/li')
        for car in cars:
            item["title"] = car.xpath('./div[@class="m_detail"]//a/text()').get()
            item["price"] = car.xpath('./div[@class="m_detail"]//b/text()').get()
            yield item

3. Expand the commented out ITEM_PIPELINES in the settings.py file without modification.

4. Modify the pipelines.py file code

from itemadapter import ItemAdapter

class C06L04Pipeline:
    def process_item(self, item, spider):
        # print(item["title"],item["price"])
        return item

If you want to save it as a file add the following code

from itemadapter import ItemAdapter

class C06L04Pipeline:
    def __init__(self):
        self.f = open("data.tet", "w")
    def process_item(self, item, spider):
        self.f.write(item["title"] + item["price"] + "\
")
        return item
    def __del__(self):
        self.f.close()

Stored in mongodb format as the following code

from itemadapter import ItemAdapter
importpymongo

class C06L04Pipeline:
    def __init__(self):
        self.client = pymongo.MongoClient("mongodb://localhost:27017")
        self.db = self.client["cheshi"]
        self.col = self.db["cars"]
    def process_item(self, item, spider):
        res = self.col.insert_one(dict(item))
        print(res.inserted_id)
        return item
    def __del__(self):
        print("end")

4. Middleware

1.Middleware application: random User-Agent, proxy IP, using Selenium, adding cookies

2.Dynamic User-Agent

Open the DOWNLOADER_MIDDLEWARES commented out in the settings.py file

Add the following code to the middlewares.py file (only the modified part is shown):

import random

def process_request(self, request, spider):
    uas = [
        "User-Agent:Mxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "User-Agent:Mxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "User-Agent:Mxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "User-Agent:Mxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    ]
    request.headers["User-Agent"] = random.choice(uas)

2.Proxy IP

The specific operations are omitted, for example: the documentation center of fast proxy-tunnel proxy-python-scrapy has specific writing methods.

The knowledge points of the article match the official knowledge archives, and you can further learn relevant knowledge. Python entry-level skills treeWeb crawlerScrapy framework 384525 people are learning the system