Amazon Image Downloader: Use the Scrapy library to complete image download tasks

Yiniu Cloud.png

Overview

This article introduces how to use Python’s Scrapy library to write a simple crawler program to download product images from the Amazon website. Scrapy is a powerful crawler framework that provides many convenient features, such as selectors, pipelines, middleware, proxies, etc. This article will focus on how to use Scrapy’s image pipeline and proxy middleware to improve the efficiency and stability of the crawler.

Text

1. Create Scrapy project

First, we need to create a Scrapy project named amazon_image_downloader. Enter the following command at the command line:

scrapy startproject amazon_image_downloader

This will generate a folder named amazon_image_downloader in the current directory containing the following files and subfolders:

amazon_image_downloader/
    scrapy.cfg # Configuration file
    amazon_image_downloader/ # Python module of the project
        __init__.py

        items.py # item file in the project

        middlewares.py #Middleware files in the project

        pipelines.py # Pipeline files in the project

        settings.py #Project settings file

        spiders/ # Directory to store crawler code
            __init__.py
2. Define Item class

Next, we need to define an Item class in the items.py file to store the data we want to crawl. In this example, we only need to crawl the URL and name of the product image, so we can define it as follows:

import scrapy


class AmazonImageItem(scrapy.Item):
    # Define an Item class to store the URL and name of the image
    image_urls = scrapy.Field() # List of URLs for images
    image_name = scrapy.Field() # The name of the image
3. Write crawler code

Then, we need to create a file called amazon_spider.py in the spiders folder and write our crawler code. We can use the CrawlSpider class provided by Scrapy to implement the function of automatically following links. We need to specify the following:

  • name: The name of the crawler, used when running the crawler.
  • allowed_domains: A list of domain names that are allowed to be crawled to prevent crawlers from running to other websites.
  • start_urls: A list of starting URLs from which the crawler will start crawling data.
  • rules: A list of rules that specify how to extract links from responses and follow them up.
  • parse_item: parsing function, used to extract data from the response and generate Item objects.

We can refer to the structure and URL rules of the Amazon website and write the following code:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from amazon_image_downloader.items import AmazonImageItem


class AmazonSpider(CrawlSpider):
    # Define a CrawlSpider class to implement the function of automatically following links
    name = 'amazon_spider' # The name of the crawler
    allowed_domains = ['amazon.com'] # List of domain names allowed to be crawled
    start_urls = ['https://www.amazon.com/s?k=book'] #Start URL list

    rules = (
        # Define a list of rules to specify how to extract links from responses and follow up
        Rule(LinkExtractor(allow=r'/s\?k=book & amp;page=\d + '), follow=True), # Match the link of the product list page and follow it
        Rule(LinkExtractor(allow=r'/dp/\w + '), callback='parse_item'), # Match the link of the product details page and call the parse_item function
    )

    def parse_item(self, response):
        #Define the parsing function to extract data from the response and generate Item objects
        item = AmazonImageItem() # Create an Item object
        item['image_urls'] = [response.xpath('//img[@id="imgBlkFront"]/@src')
                              .get()] # Extract the URL of the image from the response and store it in the image_urls field
        item['image_name'] = response.xpath('//span[@id="productTitle"]/text()')
                              .get().strip() # Extract the name of the image from the response and store it in the image_name field
        return item # Return Item object
4. Configure image pipeline and proxy middleware

Finally, we need to configure the image pipeline and proxy middleware in the settings.py file to implement image downloading and proxy use. We need to modify the following:

  • ITEM_PIPELINES: A dictionary of pipeline classes enabled in the project and their priorities. We need to enable the ImagesPipeline class provided by Scrapy and specify a suitable priority, such as 300.
  • IMAGES_STORE: The local storage path used by the image pipeline. We can specify a folder named images to store downloaded images.
  • IMAGES_URLS_FIELD: Item field used by the image pipeline. The value of this field is a list containing image URLs. We need to specify image_urls, consistent with the Item class we defined.
  • IMAGES_RESULT_FIELD: Item field used by the image pipeline. The value of this field is a list containing image information. We can specify it as image_results to store the path, check code, size and other information of the image.
  • DOWNLOADER_MIDDLEWARES: A dictionary of downloader middleware classes enabled in the project and their priorities. We need to enable the HttpProxyMiddleware class provided by Scrapy and specify an appropriate priority, such as 100.
  • PROXY_POOL: Proxy pool, used to provide proxy IP and port. We can use the domain name, port, username, and password provided by Yiniu Cloud crawler agent
  • CONCURRENT_REQUESTS: The maximum value of concurrent requests for Scrapy downloader. We can set an appropriate value, such as 16, based on the quality of our network and proxy.
  • CONCURRENT_REQUESTS_PER_DOMAIN: The maximum number of concurrent requests to a single website. We can set an appropriate value, such as 8, based on the anti-crawling strategy of the target website.
  • DOWNLOAD_DELAY: The time to wait between downloading two pages. This can be used to limit crawl speed and reduce server pressure. We can set an appropriate value, such as 0.5 seconds, based on the anti-crawling strategy of the target website.

The modified settings.py file is as follows:

# Scrapy settings for amazon_image_downloader project

assistant = 'amazon_image_downloader'

SPIDER_MODULES = ['amazon_image_downloader.spiders']
NEWSPIDER_MODULE = 'amazon_image_downloader.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'amazon_image_downloader ( + http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {<!-- -->
   'scrapy.pipelines.images.ImagesPipeline': 300, # Enable the image pipeline and specify a priority of 300
}

# Configure images pipeline
# See https://docs.scrapy.org/en/latest/topics/images.html
IMAGES_STORE = 'images' # Specify the local storage path used by the image pipeline as the images folder
IMAGES_URLS_FIELD = 'image_urls' # Specify the Item field used by the image pipeline as image_urls
IMAGES_RESULT_FIELD = 'image_results' # Specify the Item field used by the image pipeline as image_results

# Configure downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {<!-- -->
   'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 100, # Enable proxy middleware and specify a priority of 100
}

# Configure proxy pool
#伊NIiCloudAgent https://www.16yun.cn
PROXY_POOL = [
    'http://username:password@domain:port', # Use the domain name, port, username, and password provided by Yiniu Cloud crawler agent
    'http://username:password@domain:port',
    ...
]

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {<!-- -->
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item exporters
#

# Configure concurrent requests and download delay
# See https://docs.scrapy.org/en/latest/topics/settings.html
CONCURRENT_REQUESTS = 16 #Set the maximum number of concurrent requests for Scrapy downloader to 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8 #Set the maximum value for concurrent requests to a single website to 8
DOWNLOAD_DELAY = 0.5 # Set the waiting time between downloading two pages to 0.5 seconds

Conclusion

This article introduces how to use Python’s Scrapy library to write a simple crawler program to download product images from the Amazon website. We used Scrapy’s image pipeline and proxy middleware to improve the efficiency and stability of the crawler. We also use multi-threading technology to improve collection speed. This crawler program is just an example, you can modify and optimize it according to your specific needs, thank you for reading.