Click on the business card to follow Achen blog, learn and grow together
I briefly talked about the scrapy framework before. Today I will simply crawl some good-looking desktop wallpapers. In addition, this article can help everyone and I also hope that everyone can civilize crawlers and not crawl in large quantities or use them in illegal ways!
01
–
Project preparation
1.1 Create project
Create a folder, then open the terminal with cmd and run the command
scrapy startproject deskpicture
#deskpicture is the name of the crawler project
1.2 Writing Project
Open the project with pycharm
Step 2: Generate crawler files
In the terminal, switch to the project directory. I am already in the project directory, so there is no need to switch and create the crawler file directly.
scrapy genspider pretty_picture desk.zol.com.cn #pretty_picture is the name of the crawler, desk.zol.com.cn is the domain name of the crawler
Part 3: Start writing crawler files
First initial definition
import scrapy class PrettyPictureSpider(scrapy.Spider): # reptile name name = "pretty_picture" #Define the domain name that allows crawlers, that is, this crawler is only allowed to crawl under this domain name allowed_domains = ["desk.zol.com.cn"] #Define crawler start URL start_urls = ["https://desk.zol.com.cn/meinv/"]
Secondly, define the content to be crawled
First analyze the web page, where is the content we want, F12, open the control header, click element, or find the image, right-click and check the menu
We found that the image resources are all in the li list of the div with class=’main’, so we picked them out separately and traversed them one by one. When traversing, we need to know that we want one to be the name of the image and the other to be the image. address, so we need to create the item object first, then open item.py
import scrapy class DeskpictureItem(scrapy.Item): picture = scrapy.Field() name = scrapy.Field()
Only then will we begin to officially obtain resources.
def parse(self, response): li_list = response.xpath('//div[@class="main"]/ul/li') for li in li_list: picture = li.xpath('./a/img/@src').extract_first() name = li.xpath('./a/img/@alt').extract_first() print(picture,name) item = DeskpictureItem() item['picture'] = picture item['name'] = name yield item # Return item to pipeline processing
After getting the item, we start customizing the pipeline
from itemadapter import ItemAdapter from scrapy.pipelines.images import ImagesPipeline import scrapy class DeskpicturePipeline: def process_item(self, item, spider): return item class DownloadPicture(ImagesPipeline): #Rewrite the picture download pipeline to download pictures def get_media_requests(self, item, info): # Define the method to get downloaded images return scrapy.Request(item['picture']) def file_path(self, request, response=None, info=None, *, item=None): # Define the address of the image file_name = item['name'] return f'image/{file_name}.jpg' def item_completed(self, results, item, info): # Return the details of the file, that is, whether the download was successful print(results) After defining the pipeline, we start to define the setting BOT_NAME = "deskpicture" SPIDER_MODULES = ["deskpicture.spiders"] NEWSPIDER_MODULE = "deskpicture.spiders" # Obey robots.txt rules Whether to follow the gentleman's agreement, usually false ROBOTSTXT_OBEY = True #Define the log printing level, which can also be ERROR LOG_LEVEL = "WARNING" # Crawl responsibly by identifying yourself (and your website) on the user-agent # Define USER_AGENT, otherwise the crawl will fail USER_AGENT = "deskpicture ( + http://www.yourdomain.com)" # Configure item pipelines Configure and open pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { "deskpicture.pipelines.DeskpicturePipeline": 300, "deskpicture.pipelines.DownloadPicture": 301, } # Define the image saving location IMAGES_STORE = "./DownloadPicture" REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7" TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" FEED_EXPORT_ENCODING = "utf-8"
Settings CNOOC has many settings, you can study them by yourself
Okay, the main work has been done
02
–
Run the crawler
Open the terminal and run the command (note that it must be in the project, so you can usually open it directly in pycharm)
scrapy crawl pretty_picture # pretty_picture crawler name
If you want to crawl multiple pages of images, you can define one more page in the crawler file.
For example: found through element positioning
def next_page(self,response): next_url = response.xpath('//div[@class="page"]/a[@id="pageNext"]/@href').extract_first() if next_url: yield scrapy.Request( url=next_url, callback=self.parse #Callback, continue to repeat the above parse() after the next page )
The complete crawler file code is shown below
import scrapy from bianbizhi.items import BianbizhiItem class MeinvSpider(scrapy.Spider): name = 'meinv' allowed_domains = ['netbian.com'] start_urls = ['http://www.netbian.com/meinv/'] def parse(self, response): li_list = response.xpath('//div[@class="list"]/ul/li') for li in li_list: tupian = li.xpath('./a/img/@src').extract_first() name = li.xpath('./a/img/@alt').extract_first() print(tupian,name) item = BianbizhiItem() item['tupian'] = tupian item['name'] = name yield item def get_next(self,response): next_href = response.xpath("//div[@class='page']/a[contains(text(), 'next page')]/@href").extract_first() if next_href: yield scrapy.Request( url=next_href, callback=self.parse )
The sharing of this issue is over. The shared notes are only for your reference and study and my own review. Please be civilized crawlers, try to crawl as little as possible, and do not affect other people’s operations and server operations!
Scan the QR code to follow Achen’s blog and communicate and learn together.
Historical Article Index
Introduction to Python-Scrapy Framework