[scray_redis distributed crawling case] — 17k novel

Note: These are only my personal case notes! ! ! No other guidance

1. Requirement Description_Web Page Analysis_Preparation

1.1 Requirements Description

Website: [Boys_Novel Classification_Complete Novel Classification_Free Novel Classification-17K Novel Network]

Requirements: Open the homepage—->Select: Category—->Select: Completed, only view free

Get the name, chapter name, and chapter url of the book

This is temporarily called the List page:

Book introduction page:

Book Page:

1.2 Web Page Analysis

Regular matching statement:r’//www.17k.com/book/\d + .html’

Book information location in the list:

Book title and url location on the introduction page:

Chapter information location in the book:

The specific link of the novel=f'https://www.17k.com{a_tag.attrs["href"]}'

Chapter link=fhttps://www.17k.com{href attribute}’

1.3 Prepare a new crawler and modify settings.py

>scrapy startproject scrapy_redis1
>scrapy genspider -t crawl XiaoShuo aaa.com

USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

2. Preparation of crawler project

2.1 Set the data to be stored in items.py

class XSItem(scrapy.Item):
    # book name
    title = scrapy.Field()
    # Chapter name
    chapter = scrapy.Field()
    # Chapter link
    url = scrapy.Field()

2.2 Change XiaoShuo.py template

Modify the native crawler module

  • 1. Import the RedisCrawlSpider class from the scrapy module —– used to implement automated crawling
  • 2. Change the parent class of XiaoshuoSpider to RedisCrawlSpide
  • 3. Set the redis allocation url: comment out the original start_urls, add a redis_key variable name to the class and assign a value of ‘XiaoShuo’.
  • 4. Rules are the matching rules for connecting novels on each page.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from scrapy_redis.spiders import RedisCrawlSpider
from bs4 import BeautifulSoup
from ..items import XSItem
import logging

logger = logging.getLogger(__name__)
class XiaoshuoSpider(RedisCrawlSpider):
    name = "XiaoShuo"
    # allowed_domains = ["aaa.com"]
    # start_urls = ["https://www.17k.com/all/book/2_0_0_0_3_0_1_0_2.html"]
    redis_key = 'XiaoShuo' # This is the key name in the redis database.
    rules = (Rule(LinkExtractor(allow=r'//www.17k.com/book/\d + .html'),
                                callback="parse_item",
                                follow=True),)

    def parse_item(self, response):
        pass

2.3 Write parse_item() and parse_detail() methods

def parse_item(self, response):
        item = XSItem()
        soup = BeautifulSoup(response.text, 'lxml')
        a_tags = soup.select('.Info Sign > a')
        for a_tag in a_tags:
            url = f'https://www.17k.com{a_tag.attrs["href"]}'
            item['title'] = a_tag.text
            return scrapy.Request(url=url,
                                 callback=self.parse_detail,
                                 meta={'item': item})

    def parse_detail(self, response):
        item = response.meta['item']
        soup = BeautifulSoup(response.text, 'lxml')
        a_tags = soup.select('dl[class="Volume"] > dd > a')
        chapters = [] # Chapter name
        urls = [] # url
        for a_tag in a_tags:
            url = "https://www.17k.com" + a_tag.attrs["href"]
            urls.append(url)
            chapters.append(a_tag.attrs['title'])
        item['url'] = urls
        item['chapter'] = chapters
        logger.warning(item["title"])
        yield item

2.4 Modify configuration file

# Necessary configuration for using scrapy_redis
# 1. Use scrapy_redis as the scheduler
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
# 2. Use the deduplication class of scrapy_redis
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 3. Open the pipeline and connect scrapy-redis to redis
ITEM_PIPELINES = {
   "scrapy_redis.pipelines.RedisPipeline": 300,
}
# 4. Specify the IP and port number required to connect to Redis
REDIS_HOST = "127.0.0.1"
REDIS_PORT = "6379"
# Whether to continue crawling after pausing (optional)
SCHEDULER_PERSIST = False

3. Create a second identical project

Note: Since we only have one computer, two projects are used here to simulate two computers. Of course, it doesn’t matter if you directly use a project to crawl.

>scrapy startproject scrapy_redis2
>cd scrapy_redis2
>scrapy genspider -t crawl XiaoShuo aaa.com
1. Modify the setting file and make the same changes as the first project
UA, robots protocol, and necessary configurations for using scrapy_redis
(Don’t copy everything directly, just change what needs to be changed.)

2. In the item file, add an item class that is the same as the first project.

3. The crawler file is written in the same way

4. Create startup file

Both projects need to create startup files and open them

from scrapy import cmdline
cmdline.execute('scrapy crawl XiaoShuo -o xs.json'.split())

There is this description that has been successfully linked to the database:

Start the database scheduling crawler project:

Syntax:lpush key name destination URL

Key name: (Same as redis_key set in the crawler file)

Target URL: It is performed in units of each page, which is the URL of each page.

Crawling log display:

Storage result display:

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill tree Preparatory knowledge Common development tools 388933 people are learning the system