Note: These are only my personal case notes! ! ! No other guidance
1. Requirement Description_Web Page Analysis_Preparation
1.1 Requirements Description
Website: [Boys_Novel Classification_Complete Novel Classification_Free Novel Classification-17K Novel Network]
Requirements: Open the homepage—->Select: Category—->Select: Completed, only view free
Get the name, chapter name, and chapter url of the book
This is temporarily called the List page:
Book introduction page:
Book Page:
1.2 Web Page Analysis
Regular matching statement:r’//www.17k.com/book/\d + .html’
Book information location in the list:
Book title and url location on the introduction page:
Chapter information location in the book:
The specific link of the novel=f'https://www.17k.com{a_tag.attrs["href"]}' Chapter link=fhttps://www.17k.com{href attribute}’
1.3 Prepare a new crawler and modify settings.py
>scrapy startproject scrapy_redis1
>scrapy genspider -t crawl XiaoShuo aaa.com
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36" # Obey robots.txt rules ROBOTSTXT_OBEY = False
2. Preparation of crawler project
2.1 Set the data to be stored in items.py
class XSItem(scrapy.Item): # book name title = scrapy.Field() # Chapter name chapter = scrapy.Field() # Chapter link url = scrapy.Field()
2.2 Change XiaoShuo.py template
Modify the native crawler module
- 1. Import the RedisCrawlSpider class from the scrapy module —– used to implement automated crawling
- 2. Change the parent class of XiaoshuoSpider to RedisCrawlSpide
- 3. Set the redis allocation url: comment out the original start_urls, add a redis_key variable name to the class and assign a value of ‘XiaoShuo’.
- 4. Rules are the matching rules for connecting novels on each page.
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from scrapy_redis.spiders import RedisCrawlSpider from bs4 import BeautifulSoup from ..items import XSItem import logging logger = logging.getLogger(__name__) class XiaoshuoSpider(RedisCrawlSpider): name = "XiaoShuo" # allowed_domains = ["aaa.com"] # start_urls = ["https://www.17k.com/all/book/2_0_0_0_3_0_1_0_2.html"] redis_key = 'XiaoShuo' # This is the key name in the redis database. rules = (Rule(LinkExtractor(allow=r'//www.17k.com/book/\d + .html'), callback="parse_item", follow=True),) def parse_item(self, response): pass
2.3 Write parse_item() and parse_detail() methods
def parse_item(self, response): item = XSItem() soup = BeautifulSoup(response.text, 'lxml') a_tags = soup.select('.Info Sign > a') for a_tag in a_tags: url = f'https://www.17k.com{a_tag.attrs["href"]}' item['title'] = a_tag.text return scrapy.Request(url=url, callback=self.parse_detail, meta={'item': item}) def parse_detail(self, response): item = response.meta['item'] soup = BeautifulSoup(response.text, 'lxml') a_tags = soup.select('dl[class="Volume"] > dd > a') chapters = [] # Chapter name urls = [] # url for a_tag in a_tags: url = "https://www.17k.com" + a_tag.attrs["href"] urls.append(url) chapters.append(a_tag.attrs['title']) item['url'] = urls item['chapter'] = chapters logger.warning(item["title"]) yield item
2.4 Modify configuration file
# Necessary configuration for using scrapy_redis # 1. Use scrapy_redis as the scheduler SCHEDULER = 'scrapy_redis.scheduler.Scheduler' # 2. Use the deduplication class of scrapy_redis DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 3. Open the pipeline and connect scrapy-redis to redis ITEM_PIPELINES = { "scrapy_redis.pipelines.RedisPipeline": 300, } # 4. Specify the IP and port number required to connect to Redis REDIS_HOST = "127.0.0.1" REDIS_PORT = "6379" # Whether to continue crawling after pausing (optional) SCHEDULER_PERSIST = False
3. Create a second identical project
Note: Since we only have one computer, two projects are used here to simulate two computers. Of course, it doesn’t matter if you directly use a project to crawl.
>scrapy startproject scrapy_redis2 >cd scrapy_redis2 >scrapy genspider -t crawl XiaoShuo aaa.com
1. Modify the setting file and make the same changes as the first project UA, robots protocol, and necessary configurations for using scrapy_redis (Don’t copy everything directly, just change what needs to be changed.) 2. In the item file, add an item class that is the same as the first project. 3. The crawler file is written in the same way
4. Create startup file
Both projects need to create startup files and open them
from scrapy import cmdline cmdline.execute('scrapy crawl XiaoShuo -o xs.json'.split())
There is this description that has been successfully linked to the database:
Start the database scheduling crawler project:
Syntax:lpush key name destination URL
Key name: (Same as redis_key set in the crawler file)
Target URL: It is performed in units of each page, which is the URL of each page.
Crawling log display:
Storage result display:
The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill tree Preparatory knowledge Common development tools 388933 people are learning the system