Python crawls webmaster materials website pictures of a certain series

Use Pyhton to crawl images (customize request path, match resources)

Article directory

    • 1. Learning purpose:
    • 2. Code part
      • 1. Create a customized request object
      • 2. The second step is the positioning of target resources.
    • 3. Write code
    • 4. Summary of content and shortcomings

1. Learning purpose:

  1. Learn python request customization based on website connection
  2. Learn to use xpath to find the name, path, etc. of the target image

Picture material link (This link is the sexy beauty section of the picture section under the webmaster’s material website)
https://sc.chinaz.com/tupian/xingganmeinvtupian.html

2. Code part

First, clarify the overall idea of the code and download multi-page image resources. The first idea is:

1. Create a customized request object

Because the request links of multiple pages are different, the link to the next page needs to be automatically generated in the code. First, observe the rules of the links.

The first page of the request page (press F12, click on the network to view)

As shown in the figure: the first page request link is

https://sc.chinaz.com/tupian/xingganmeinvtupian.html

On the second and third pages, the pattern has been discovered. The first page ends directly with xingganmeninvtupian.html. Starting from the second page, add **_n** after the nth page.


2. The second step is the positioning of target resources.

After requesting a connection after customization, the response code obtained must be parsed and processed to obtain the download garbage, name, etc. of the image resource. The code is as follows:
(Readers whose browser does not have the xpath plug-in must first install the plug-in in the browser. The installation method can jump to other creators. I have not written it yet)

 #Customize the name of the image obtained
name_list = tree.xpath('//div[@class="item"]/img/@alt')
    print(len(name_list))
    # Customize the path to obtain the resource path of the image
    src_list = tree.xpath('//div[@class="item"]/img/@data-original')
    print(len(src_list))

Usually friends who are just learning will find it difficult to write xpath paths. In fact, with the help of the browser plug-in xpath helper, you can easily locate the target resource and prompt you whether you have written it correctly. Let’s demonstrate below

In the network, click on the response to view the returned data. Scroll down and you will see the entities of the returned image. The div tag contains the img tag. The img tag contains the alt attribute and the data-original attribute. These two attributes After observation, these two are the name and image resource path, so what we want to get are these two values.

Open XPath (shortcut keys ctrl + shift + It is more stable and can be opened manually)


Now enter the matching string on the right to match the corresponding resource. As we just said, the img tag is stored in the div tag, and the img tag has alt attributes and data-original attributes. So here we come

//div indicates that all divs are selected, and 124 results are displayed.

But after our observation, the class of the div with the picture we want is item, so we can filter the div as long as the class is item

Then what we want is a div with a sub-tag of img, so we need to add a /img at the end, which means matching all divs with an img sub-tag. There are 40 results in total.

Then we need the alt and data-original attributes in the img. For attributes, use /@ to obtain them.

The complete meaning of such a matching string is to get the div with class itm, and then get the alt attribute of the img sub-tag under the div, and the other attribute data-original is the same.

3. Writing code

After the request path and resource matching are resolved, start writing the code. I won’t explain them in detail here. You can study them with comments, or leave comments in the comment area.

Writing one function for each function is a manifestation of refined management, conforms to development specifications, and is easy to maintain

import urllib.request
from lxml import etree

def create_request(page):
    # This if else is a customized request path. If it is the first page, there is no need to add it later.
    if(page == 1):
        url = 'https://sc.chinaz.com/tupian/xingganmeinvtupian.html'
    else:
        url = 'https://sc.chinaz.com/tupian/xingganmeinvtupian.html_' + str(page) + '.html'

    headers = {<!-- -->
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36',
    }

    request = urllib.request.Request(url = url, headers = headers)
    return request

# Get response content
def get_content(request):
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    return content


# Download image
def down_load(content):
    # urllib.request.urlretrieve('image address','file name')
    tree = etree.HTML(content)
    #print(content)
    name_list = tree.xpath('//div[@class="item"]/img/@alt')
    print(len(name_list))
    # Generally, websites that design pictures will perform lazy loading.
    src_list = tree.xpath('//div[@class="item"]/img/@data-original')
    print(len(src_list))
    for i in range(len(name_list)):
        name = name_list[i]
        src = src_list[i]
        url = 'https:' + src
        print(url)
        urllib.request.urlretrieve(url=url,filename='./beautiful/' + name + '.jpg')




if __name__ == '__main__':
    start_page = int(input('Please enter the starting page number'))
    end_page = int(input('Please enter the end page number'))

    for page in range(start_page,end_page + 1):
        # (1) Customization of request object
        request = create_request(page)
        # (2) Get the source code of the web page
        content = get_content(request)
        # (3) Download
        down_load(content)

4. Summary and shortcomings

  • The content of this article is a picture of how to obtain multi-page data in Python. The content includes two core contents: how to customize the request link and write the xpath path.
  • This article does not provide detailed explanations for readers with zero foundation. Readers with zero foundation can pay attention to the earlier articles in this column.
  • In fact, there is a problem in the xpath part of teaching. After inputting into the xpath input box, the highlighted element is highlighted in the first tab of the console, while the code file for the image response is in the network tab. The response inside is here, so readers may want to understand it. If you want to better learn how to write matching rules, you can refer to the article corresponding to my column or the articles of other excellent developers.

If you are interested in my content, please follow me, I will continue to update this series, other column series design network security, javafx development, etc. Thank you very much