Selenium actual combat Python+Selenium crawling commodity data

Practical goal: crawl product information, including product title, link, price, and number of reviews.

The core of the code lies in these parts:

  • One: use element positioning to obtain the specified keywords that need to be crawled on the page;
  • Second: Permanently store the data located on the page in a local file.

Specifically, let’s sort out what work we have done at each node in the entire process from visiting the URL to crawling data.

Analysis of the specific process of crawling Jingdong commodity data

1. Prepare interface data

# Jingdong Mall URL
url = 'https://www.jd.com/'

2. Create a browser instance object

# driver = webdriver.Firefox() # Create Firefox browser instance object
# driver = webdriver.Ie() # Create IE browser instance object
# driver = webdriver.Edge() # Create an Edge browser instance object
# driver = webdriver.Safari() # Create Safari browser instance object
# driver = webdriver.Opera() # Create an Opera browser instance object
driver = webdriver.Chrome() # Create a Chrome browser instance object

After the browser instance object is created through webdriver.Chrome(), the Chrome browser will be started. In the Chrome() method, no parameters are passed in, that is, the parameter executable_path=”chromedriver” is used by default. The executable_path indicates the location of the browser driver. The default browser driver location of this parameter is in the Python installation directory. If the browser driver location is different from the default location, the executable_path parameter needs to be passed in the actual location of the driver, such as:

driver = webdriver.Chrome(executable_path="D:/driver/chromedriver.exe")

2. Access URL

# browser access address
drver. get(url)

3. Implicit waiting, maximizing the browser window

# implicitly wait to ensure that the dynamic content node is fully loaded - the time is not felt
drver. implicitly_wait(3)
# Maximize the browser window, mainly to prevent the content from being blocked
drver.maximize_window()

First use the implicitly_wait() method to implicitly wait for the browser to fully load the page, and then use maximize_window() to maximize the browser window to prevent page elements from being loaded or blocked and search failures.

3. Position the search box

# Locate the search box by id=key
input_search = drver. find_element_by_id('key')
# Enter "mask" in the input box
input_search. send_keys(keyword)
# Simulate keyboard Enter operation to search
input_search. send_keys(Keys. ENTER)
# Mandatory wait for 3 seconds
sleep(3)

The driver first calls find_element_by_id(‘key’) to locate the search box by ID, then calls sent_keys() to pass in the parameter search keyword keyword, and then passes in Keys.ENTER to simulate the keyboard enter key in sent_keys(), and finally Then call sleep(3) to wait for the search content to be loaded. So far, locate the search box -> enter keywords in the search box -> press Enter to search, and the entire search process is completed.

4. Positioning elements (product title, link, price, number of reviews)

# Get the li tags of all products on the current first page
goods = driver. find_elements_by_class_name('gl-item')
for good in goods:
    # get product title
    title = good.find_element_by_css_selector('.p-name em').text.replace('\\
', '')
    # Get product link
    link = good.find_element_by_tag_name('a').get_attribute('href')
    # get item price
    price = good.find_element_by_css_selector('.p-price strong').text.replace('\\
', '')
    # Get the number of product reviews
    commit = good.find_element_by_css_selector('.p-commit a').text

The driver locates class=”gl-item” to the

  • of all items on the page by calling the find_elements_by_class_name(‘gl-item’) method tag, and then traverse all
  • Tags to find product links, product titles, product prices, and product reviews through different positioning methods.

    • Locate the product title: use the CSS selector to locate the element label, and then call its property text to get the text content in the label (that is, the product title), and finally use the replace() method to replace the line break with an empty string to remove the title The purpose of the newline character, so that the complete product title is successfully obtained.
    # Get product title name
    title = good.find_element_by_css_selector('.p-name em').text.replace('\\
    ', '')
    

    # get product link
    link = good.find_element_by_tag_name('a').get_attribute('href')
    

    • Locate the number of product reviews: use CSS selector to locate the element tag, and then call its attribute text to get the text content in the tag (that is, the price of the product).
    # Get the number of product reviews
    commit = good.find_element_by_css_selector('.p-commit a').text
    

    5. Store product data in a file

    ① Store in txt file

    # Get the current file path
    paths = path.dirname(__file__)
     # Concatenate the current file path and file name as the storage path of commodity data
     file = path. join(paths, 'good.txt')
     # Save the product data to the file by appending
     with open(file, 'a + ', encoding='utf-8', newline='') as wf:
         wf.write(msg)
    

    ② Store in CSV file

    # header
    header = ['Product title', 'Product price', 'Product link', 'Comment volume']
    # Get the current file path
    paths = path.dirname(__file__)
    # Concatenate the current file path and file name as the storage path of commodity data
    file = path. join(paths, 'good_data.csv')
    # Save the product data to the file by appending
    with open(file, 'a + ', encoding='utf-8', newline='') as wf:
        f_csv = csv. DictWriter(wf, header)
        f_csv.writeheader()
        f_csv.writerows(data)
    

    6. Exit the browser

    # exit and close the browser
    drver. quit()
    

    After capturing the product data, you can directly close the browser and release resources. This is the whole crawling process, and the process of grabbing data continues to analyze.

    Complete sample code

    ①Save to txt file

    # -*- coding: utf-8 -*-
    # @Time : 2021/10/26 17:35
    # @Author : Jane
    # @Software: PyCharm
    
    
    # import library
    from time import sleep
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys # keyboard key operation
    from os import path
    
    
    # Jingdong Mall URL
    url = 'https://www.jd.com/'
    # create browser object
    driver = webdriver. Chrome()
    # Browser access address
    driver. get(url)
    # Implicitly wait to ensure that dynamic content nodes are fully loaded - time is imperceptible
    driver. implicitly_wait(3)
    # Maximize the browser window, mainly to prevent the content from being blocked
    driver. maximize_window()
    # locate the search box by id=key
    input_search = driver.find_element_by_id('key')
    # Enter "mask" in the input box
    input_search.send_keys('Ladies bag')
    # Simulate keyboard Enter operation to search
    input_search. send_keys(Keys. ENTER)
    # Mandatory wait for 3 seconds
    sleep(3)
    # Get the li tags of all products on the current first page
    goods = driver. find_elements_by_class_name('gl-item')
    for good in goods:
        # Get product link
        link = good.find_element_by_tag_name('a').get_attribute('href')
        # Get product title name
        title = good.find_element_by_css_selector('.p-name em').text.replace('\\
    ', '')
        # get item price
        price = good.find_element_by_css_selector('.p-price strong').text.replace('\\
    ', '')
        # Get the number of product reviews
        commit = good.find_element_by_css_selector('.p-commit a').text
        msg = '''
            Commodity: %s
            Link: %s
            Price: %s
            Comment: %s
        '''%(title, link, price, commit)
        # Get the current file path
        paths = path.dirname(__file__)
        # Concatenate the current file path and file name as the storage path of commodity data
        file = path. join(paths, 'good.txt')
        # Save the product data to the file by appending
        with open(file, 'a + ', encoding='utf-8', newline='') as wf:
            wf.write(msg)
    # Exit closes the browser
    driver. quit()
    

    ②Save to CSV file

    # -*- coding: utf-8 -*-
    # @Time : 2021/10/26 17:35
    # @Author : Jane
    # @Software: PyCharm
    
    
    # import library
    from time import sleep
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys # keyboard key operation
    from os import path
    import csv
    
    
    # Jingdong Mall URL
    url = 'https://www.jd.com/'
    # create browser object
    driver = webdriver. Chrome()
    # Browser access address
    driver. get(url)
    # Implicitly wait to ensure that dynamic content nodes are fully loaded - time is imperceptible
    driver. implicitly_wait(3)
    # Maximize the browser window, mainly to prevent the content from being blocked
    driver. maximize_window()
    # locate the search box by id=key
    input_search = driver.find_element_by_id('key')
    # Enter "mask" in the input box
    input_search.send_keys('Ladies bag')
    # Simulate keyboard Enter operation to search
    input_search. send_keys(Keys. ENTER)
    # Mandatory wait for 3 seconds
    sleep(3)
    # Get the li tags of all products on the current first page
    goods = driver. find_elements_by_class_name('gl-item')
    for good in goods:
        # Get product link
        link = good.find_element_by_tag_name('a').get_attribute('href')
        # Get product title name
        title = good.find_element_by_css_selector('.p-name em').text.replace('\\
    ', '')
        # get item price
        price = good.find_element_by_css_selector('.p-price strong').text.replace('\\
    ', '')
        # Get the number of product reviews
        commit = good.find_element_by_css_selector('.p-commit a').text
        msg = '''
            Commodity: %s
            Link: %s
            Price: %s
            Comment: %s
        '''%(title, link, price, commit)
    
    # Header
    header = ['Product title', 'Product price', 'Product link', 'Comment volume']
    # Get the current file path
    paths = path.dirname(__file__)
    # Concatenate the current file path and file name as the storage path of commodity data
    file = path. join(paths, 'good_data.csv')
    # Save the product data to the file by appending
    with open(file, 'a + ', encoding='utf-8', newline='') as wf:
        f_csv = csv. DictWriter(wf, header)
        f_csv.writeheader()
        f_csv.writerows(data)
        
    # Exit closes the browser
    driver. quit()
    

    ③Encapsulate the code

    # -*- coding: utf-8 -*-
    # @Time : 2021/10/26 17:35
    # @Author : Jane
    # @Software: PyCharm
    
    
    # import library
    from time import sleep
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys # keyboard key operation
    from os import path
    import csv
    
    
    def spider(url, keyword):
        # create browser object
        drver = webdriver. Chrome()
        # Browser access address
        drver. get(url)
        # Implicitly wait to ensure that dynamic content nodes are fully loaded - time is imperceptible
        drver. implicitly_wait(3)
        # Maximize the browser window, mainly to prevent the content from being blocked
        drver.maximize_window()
        # locate the search box by id=key
        input_search = drver. find_element_by_id('key')
        # Enter "mask" in the input box
        input_search. send_keys(keyword)
        # Simulate keyboard Enter operation to search
        input_search. send_keys(Keys. ENTER)
        # Mandatory wait for 3 seconds
        sleep(3)
        # Fetch product data
        get_good(drver)
        # Exit closes the browser
        drver. quit()
    
    # Fetch product data
    def get_good(driver):
        # Get the li tags of all products on the current first page
        goods = driver. find_elements_by_class_name('gl-item')
        data = []
        for good in goods:
            # Get product link
            link = good.find_element_by_tag_name('a').get_attribute('href')
            # Get product title name
            title = good.find_element_by_css_selector('.p-name em').text.replace('\\
    ', '')
            # get item price
            price = good.find_element_by_css_selector('.p-price strong').text.replace('\\
    ', '')
            # Get the number of product reviews
            commit = good.find_element_by_css_selector('.p-commit a').text
            # Store the product data in the dictionary
            good_data = {
                'Product title': title,
                'Commodity price':price,
                'Commodity link': link,
                'comments':commit
            }
            data.append(good_data)
        saveCSV(data)
    
    
    # Save product data to CSV file
    def saveCSV(data):
        # Header
        header = ['Product title', 'Product price', 'Product link', 'Comment volume']
        # Get the current file path
        paths = path.dirname(__file__)
        # Concatenate the current file path and file name as the storage path of commodity data
        file = path. join(paths, 'good_data.csv')
        # Save the product data to the file by appending
        with open(file, 'a + ', encoding='utf-8', newline='') as wf:
            f_csv = csv. DictWriter(wf, header)
            f_csv.writeheader()
            f_csv.writerows(data)
    
    
    # Judgment file program entry
    if __name__ == '__main__':
        # Jingdong Mall URL
        url = 'https://www.jd.com/'
        # Search for the keyword "women's braided bag"
        keyword = 'ladies bag'
        # crawl data
        spider(url, keyword)
    

    The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledgePython entry skill treeWeb crawlerSelenium298019 People are studying systematically