Scrapy’s First Battle-Crawling Desktop Wallpapers

Click on the business card to follow Achen blog, learn and grow together

I briefly talked about the scrapy framework before. Today I will simply crawl some good-looking desktop wallpapers. In addition, this article can help everyone and I also hope that everyone can civilize crawlers and not crawl in large quantities or use them in illegal ways!

–

Project preparation

1.1 Create project

Create a folder, then open the terminal with cmd and run the command

scrapy startproject deskpicture

#deskpicture is the name of the crawler project

1.2 Writing Project

Open the project with pycharm

Step 2: Generate crawler files

In the terminal, switch to the project directory. I am already in the project directory, so there is no need to switch and create the crawler file directly.

scrapy genspider pretty_picture desk.zol.com.cn
#pretty_picture is the name of the crawler, desk.zol.com.cn is the domain name of the crawler

Part 3: Start writing crawler files

First initial definition

import scrapy


class PrettyPictureSpider(scrapy.Spider):
    # reptile name
    name = "pretty_picture"
    #Define the domain name that allows crawlers, that is, this crawler is only allowed to crawl under this domain name
    allowed_domains = ["desk.zol.com.cn"]
    #Define crawler start URL
    start_urls = ["https://desk.zol.com.cn/meinv/"]

Secondly, define the content to be crawled

First analyze the web page, where is the content we want, F12, open the control header, click element, or find the image, right-click and check the menu

We found that the image resources are all in the li list of the div with class=’main’, so we picked them out separately and traversed them one by one. When traversing, we need to know that we want one to be the name of the image and the other to be the image. address, so we need to create the item object first, then open item.py

import scrapy




class DeskpictureItem(scrapy.Item):
    picture = scrapy.Field()
    name = scrapy.Field()

Only then will we begin to officially obtain resources.

def parse(self, response):
    li_list = response.xpath('//div[@class="main"]/ul/li')
    for li in li_list:
        picture = li.xpath('./a/img/@src').extract_first()
        name = li.xpath('./a/img/@alt').extract_first()
        print(picture,name)
        item = DeskpictureItem()
        item['picture'] = picture
        item['name'] = name
        yield item # Return item to pipeline processing

After getting the item, we start customizing the pipeline

from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline
import scrapy


class DeskpicturePipeline:
    def process_item(self, item, spider):
        return item
class DownloadPicture(ImagesPipeline): #Rewrite the picture download pipeline to download pictures
    def get_media_requests(self, item, info): # Define the method to get downloaded images
        return scrapy.Request(item['picture'])
    def file_path(self, request, response=None, info=None, *, item=None): # Define the address of the image
        file_name = item['name']
        return f'image/{file_name}.jpg'
    def item_completed(self, results, item, info): # Return the details of the file, that is, whether the download was successful
        print(results)
After defining the pipeline, we start to define the setting
BOT_NAME = "deskpicture"


SPIDER_MODULES = ["deskpicture.spiders"]
NEWSPIDER_MODULE = "deskpicture.spiders"


# Obey robots.txt rules Whether to follow the gentleman's agreement, usually false
ROBOTSTXT_OBEY = True


#Define the log printing level, which can also be ERROR
LOG_LEVEL = "WARNING"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
# Define USER_AGENT, otherwise the crawl will fail
USER_AGENT = "deskpicture ( + http://www.yourdomain.com)"


# Configure item pipelines Configure and open pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   "deskpicture.pipelines.DeskpicturePipeline": 300,
   "deskpicture.pipelines.DownloadPicture": 301,
}
# Define the image saving location
IMAGES_STORE = "./DownloadPicture"


REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

Settings CNOOC has many settings, you can study them by yourself

Okay, the main work has been done

–

Run the crawler

Open the terminal and run the command (note that it must be in the project, so you can usually open it directly in pycharm)

scrapy crawl pretty_picture
# pretty_picture crawler name

If you want to crawl multiple pages of images, you can define one more page in the crawler file.

For example: found through element positioning

def next_page(self,response):
    next_url = response.xpath('//div[@class="page"]/a[@id="pageNext"]/@href').extract_first()
    if next_url:
        yield scrapy.Request(
            url=next_url,
            callback=self.parse #Callback, continue to repeat the above parse() after the next page
        )

The complete crawler file code is shown below

import scrapy
from bianbizhi.items import BianbizhiItem


class MeinvSpider(scrapy.Spider):
    name = 'meinv'
    allowed_domains = ['netbian.com']
    start_urls = ['http://www.netbian.com/meinv/']


    def parse(self, response):
        li_list = response.xpath('//div[@class="list"]/ul/li')
        for li in li_list:
            tupian = li.xpath('./a/img/@src').extract_first()
            name = li.xpath('./a/img/@alt').extract_first()
            print(tupian,name)


            item = BianbizhiItem()
            item['tupian'] = tupian
            item['name'] = name
            yield item
    def get_next(self,response):
        next_href = response.xpath("//div[@class='page']/a[contains(text(), 'next page')]/@href").extract_first()
        if next_href:
            yield scrapy.Request(
                url=next_href,
                callback=self.parse
            )

The sharing of this issue is over. The shared notes are only for your reference and study and my own review. Please be civilized crawlers, try to crawl as little as possible, and do not affect other people’s operations and server operations!

Scan the QR code to follow Achen’s blog and communicate and learn together.

Historical Article Index

Introduction to Python-Scrapy Framework