Selenium actual combat Python+Selenium crawling commodity data

Practical goal: crawl product information, including product title, link, price, and number of reviews.

The core of the code lies in these parts:

One: use element positioning to obtain the specified keywords that need to be crawled on the page;
Second: Permanently store the data located on the page in a local file.

Specifically, let’s sort out what work we have done at each node in the entire process from visiting the URL to crawling data.

Analysis of the specific process of crawling Jingdong commodity data

1. Prepare interface data

# Jingdong Mall URL
url = 'https://www.jd.com/'

2. Create a browser instance object

# driver = webdriver.Firefox() # Create Firefox browser instance object
# driver = webdriver.Ie() # Create IE browser instance object
# driver = webdriver.Edge() # Create an Edge browser instance object
# driver = webdriver.Safari() # Create Safari browser instance object
# driver = webdriver.Opera() # Create an Opera browser instance object
driver = webdriver.Chrome() # Create a Chrome browser instance object

After the browser instance object is created through webdriver.Chrome(), the Chrome browser will be started. In the Chrome() method, no parameters are passed in, that is, the parameter executable_path=”chromedriver” is used by default. The executable_path indicates the location of the browser driver. The default browser driver location of this parameter is in the Python installation directory. If the browser driver location is different from the default location, the executable_path parameter needs to be passed in the actual location of the driver, such as:

driver = webdriver.Chrome(executable_path="D:/driver/chromedriver.exe")

2. Access URL

# browser access address
drver. get(url)

3. Implicit waiting, maximizing the browser window

# implicitly wait to ensure that the dynamic content node is fully loaded - the time is not felt
drver. implicitly_wait(3)
# Maximize the browser window, mainly to prevent the content from being blocked
drver.maximize_window()

First use the implicitly_wait() method to implicitly wait for the browser to fully load the page, and then use maximize_window() to maximize the browser window to prevent page elements from being loaded or blocked and search failures.

3. Position the search box

# Locate the search box by id=key
input_search = drver. find_element_by_id('key')
# Enter "mask" in the input box
input_search. send_keys(keyword)
# Simulate keyboard Enter operation to search
input_search. send_keys(Keys. ENTER)
# Mandatory wait for 3 seconds
sleep(3)

The driver first calls find_element_by_id(‘key’) to locate the search box by ID, then calls sent_keys() to pass in the parameter search keyword keyword, and then passes in Keys.ENTER to simulate the keyboard enter key in sent_keys(), and finally Then call sleep(3) to wait for the search content to be loaded. So far, locate the search box -> enter keywords in the search box -> press Enter to search, and the entire search process is completed.

4. Positioning elements (product title, link, price, number of reviews)

# Get the li tags of all products on the current first page
goods = driver. find_elements_by_class_name('gl-item')
for good in goods:
    # get product title
    title = good.find_element_by_css_selector('.p-name em').text.replace('\\
', '')
    # Get product link
    link = good.find_element_by_tag_name('a').get_attribute('href')
    # get item price
    price = good.find_element_by_css_selector('.p-price strong').text.replace('\\
', '')
    # Get the number of product reviews
    commit = good.find_element_by_css_selector('.p-commit a').text

The driver locates class=”gl-item” to the

of all items on the page by calling the find_elements_by_class_name(‘gl-item’) method tag, and then traverse all

Tags to find product links, product titles, product prices, and product reviews through different positioning methods.

Locate the product title: use the CSS selector to locate the element label, and then call its property text to get the text content in the label (that is, the product title), and finally use the replace() method to replace the line break with an empty string to remove the title The purpose of the newline character, so that the complete product title is successfully obtained.

# Get product title name
title = good.find_element_by_css_selector('.p-name em').text.replace('\\
', '')

Locating product links: By first locating the tag and then specifying to get the value of its attribute href, finally get the product link.

# get product link
link = good.find_element_by_tag_name('a').get_attribute('href')

Locate the number of product reviews: use CSS selector to locate the element tag, and then call its attribute text to get the text content in the tag (that is, the price of the product).

# Get the number of product reviews
commit = good.find_element_by_css_selector('.p-commit a').text

5. Store product data in a file

① Store in txt file

# Get the current file path
paths = path.dirname(__file__)
 # Concatenate the current file path and file name as the storage path of commodity data
 file = path. join(paths, 'good.txt')
 # Save the product data to the file by appending
 with open(file, 'a + ', encoding='utf-8', newline='') as wf:
     wf.write(msg)

② Store in CSV file

# header
header = ['Product title', 'Product price', 'Product link', 'Comment volume']
# Get the current file path
paths = path.dirname(__file__)
# Concatenate the current file path and file name as the storage path of commodity data
file = path. join(paths, 'good_data.csv')
# Save the product data to the file by appending
with open(file, 'a + ', encoding='utf-8', newline='') as wf:
    f_csv = csv. DictWriter(wf, header)
    f_csv.writeheader()
    f_csv.writerows(data)

6. Exit the browser

# exit and close the browser
drver. quit()

After capturing the product data, you can directly close the browser and release resources. This is the whole crawling process, and the process of grabbing data continues to analyze.

Complete sample code

①Save to txt file

# -*- coding: utf-8 -*-
# @Time : 2021/10/26 17:35
# @Author : Jane
# @Software: PyCharm


# import library
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.keys import Keys # keyboard key operation
from os import path


# Jingdong Mall URL
url = 'https://www.jd.com/'
# create browser object
driver = webdriver. Chrome()
# Browser access address
driver. get(url)
# Implicitly wait to ensure that dynamic content nodes are fully loaded - time is imperceptible
driver. implicitly_wait(3)
# Maximize the browser window, mainly to prevent the content from being blocked
driver. maximize_window()
# locate the search box by id=key
input_search = driver.find_element_by_id('key')
# Enter "mask" in the input box
input_search.send_keys('Ladies bag')
# Simulate keyboard Enter operation to search
input_search. send_keys(Keys. ENTER)
# Mandatory wait for 3 seconds
sleep(3)
# Get the li tags of all products on the current first page
goods = driver. find_elements_by_class_name('gl-item')
for good in goods:
    # Get product link
    link = good.find_element_by_tag_name('a').get_attribute('href')
    # Get product title name
    title = good.find_element_by_css_selector('.p-name em').text.replace('\\
', '')
    # get item price
    price = good.find_element_by_css_selector('.p-price strong').text.replace('\\
', '')
    # Get the number of product reviews
    commit = good.find_element_by_css_selector('.p-commit a').text
    msg = '''
        Commodity: %s
        Link: %s
        Price: %s
        Comment: %s
    '''%(title, link, price, commit)
    # Get the current file path
    paths = path.dirname(__file__)
    # Concatenate the current file path and file name as the storage path of commodity data
    file = path. join(paths, 'good.txt')
    # Save the product data to the file by appending
    with open(file, 'a + ', encoding='utf-8', newline='') as wf:
        wf.write(msg)
# Exit closes the browser
driver. quit()

②Save to CSV file

# -*- coding: utf-8 -*-
# @Time : 2021/10/26 17:35
# @Author : Jane
# @Software: PyCharm


# import library
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.keys import Keys # keyboard key operation
from os import path
import csv


# Jingdong Mall URL
url = 'https://www.jd.com/'
# create browser object
driver = webdriver. Chrome()
# Browser access address
driver. get(url)
# Implicitly wait to ensure that dynamic content nodes are fully loaded - time is imperceptible
driver. implicitly_wait(3)
# Maximize the browser window, mainly to prevent the content from being blocked
driver. maximize_window()
# locate the search box by id=key
input_search = driver.find_element_by_id('key')
# Enter "mask" in the input box
input_search.send_keys('Ladies bag')
# Simulate keyboard Enter operation to search
input_search. send_keys(Keys. ENTER)
# Mandatory wait for 3 seconds
sleep(3)
# Get the li tags of all products on the current first page
goods = driver. find_elements_by_class_name('gl-item')
for good in goods:
    # Get product link
    link = good.find_element_by_tag_name('a').get_attribute('href')
    # Get product title name
    title = good.find_element_by_css_selector('.p-name em').text.replace('\\
', '')
    # get item price
    price = good.find_element_by_css_selector('.p-price strong').text.replace('\\
', '')
    # Get the number of product reviews
    commit = good.find_element_by_css_selector('.p-commit a').text
    msg = '''
        Commodity: %s
        Link: %s
        Price: %s
        Comment: %s
    '''%(title, link, price, commit)

# Header
header = ['Product title', 'Product price', 'Product link', 'Comment volume']
# Get the current file path
paths = path.dirname(__file__)
# Concatenate the current file path and file name as the storage path of commodity data
file = path. join(paths, 'good_data.csv')
# Save the product data to the file by appending
with open(file, 'a + ', encoding='utf-8', newline='') as wf:
    f_csv = csv. DictWriter(wf, header)
    f_csv.writeheader()
    f_csv.writerows(data)
    
# Exit closes the browser
driver. quit()

③Encapsulate the code

# -*- coding: utf-8 -*-
# @Time : 2021/10/26 17:35
# @Author : Jane
# @Software: PyCharm


# import library
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.keys import Keys # keyboard key operation
from os import path
import csv


def spider(url, keyword):
    # create browser object
    drver = webdriver. Chrome()
    # Browser access address
    drver. get(url)
    # Implicitly wait to ensure that dynamic content nodes are fully loaded - time is imperceptible
    drver. implicitly_wait(3)
    # Maximize the browser window, mainly to prevent the content from being blocked
    drver.maximize_window()
    # locate the search box by id=key
    input_search = drver. find_element_by_id('key')
    # Enter "mask" in the input box
    input_search. send_keys(keyword)
    # Simulate keyboard Enter operation to search
    input_search. send_keys(Keys. ENTER)
    # Mandatory wait for 3 seconds
    sleep(3)
    # Fetch product data
    get_good(drver)
    # Exit closes the browser
    drver. quit()

# Fetch product data
def get_good(driver):
    # Get the li tags of all products on the current first page
    goods = driver. find_elements_by_class_name('gl-item')
    data = []
    for good in goods:
        # Get product link
        link = good.find_element_by_tag_name('a').get_attribute('href')
        # Get product title name
        title = good.find_element_by_css_selector('.p-name em').text.replace('\\
', '')
        # get item price
        price = good.find_element_by_css_selector('.p-price strong').text.replace('\\
', '')
        # Get the number of product reviews
        commit = good.find_element_by_css_selector('.p-commit a').text
        # Store the product data in the dictionary
        good_data = {
            'Product title': title,
            'Commodity price':price,
            'Commodity link': link,
            'comments':commit
        }
        data.append(good_data)
    saveCSV(data)


# Save product data to CSV file
def saveCSV(data):
    # Header
    header = ['Product title', 'Product price', 'Product link', 'Comment volume']
    # Get the current file path
    paths = path.dirname(__file__)
    # Concatenate the current file path and file name as the storage path of commodity data
    file = path. join(paths, 'good_data.csv')
    # Save the product data to the file by appending
    with open(file, 'a + ', encoding='utf-8', newline='') as wf:
        f_csv = csv. DictWriter(wf, header)
        f_csv.writeheader()
        f_csv.writerows(data)


# Judgment file program entry
if __name__ == '__main__':
    # Jingdong Mall URL
    url = 'https://www.jd.com/'
    # Search for the keyword "women's braided bag"
    keyword = 'ladies bag'
    # crawl data
    spider(url, keyword)

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledgePython entry skill treeWeb crawlerSelenium298019 People are studying systematically