Practical goal: crawl product information, including product title, link, price, and number of reviews.
The core of the code lies in these parts:
- One: use element positioning to obtain the specified keywords that need to be crawled on the page;
- Second: Permanently store the data located on the page in a local file.
Specifically, let’s sort out what work we have done at each node in the entire process from visiting the URL to crawling data.
Analysis of the specific process of crawling Jingdong commodity data
1. Prepare interface data
# Jingdong Mall URL url = 'https://www.jd.com/'
2. Create a browser instance object
# driver = webdriver.Firefox() # Create Firefox browser instance object # driver = webdriver.Ie() # Create IE browser instance object # driver = webdriver.Edge() # Create an Edge browser instance object # driver = webdriver.Safari() # Create Safari browser instance object # driver = webdriver.Opera() # Create an Opera browser instance object driver = webdriver.Chrome() # Create a Chrome browser instance object
After the browser instance object is created through webdriver.Chrome(), the Chrome browser will be started. In the Chrome() method, no parameters are passed in, that is, the parameter executable_path=”chromedriver” is used by default. The executable_path indicates the location of the browser driver. The default browser driver location of this parameter is in the Python installation directory. If the browser driver location is different from the default location, the executable_path parameter needs to be passed in the actual location of the driver, such as:
driver = webdriver.Chrome(executable_path="D:/driver/chromedriver.exe")
2. Access URL
# browser access address drver. get(url)
3. Implicit waiting, maximizing the browser window
# implicitly wait to ensure that the dynamic content node is fully loaded - the time is not felt drver. implicitly_wait(3) # Maximize the browser window, mainly to prevent the content from being blocked drver.maximize_window()
First use the implicitly_wait()
method to implicitly wait for the browser to fully load the page, and then use maximize_window() to maximize the browser window to prevent page elements from being loaded or blocked and search failures.
3. Position the search box
# Locate the search box by id=key input_search = drver. find_element_by_id('key') # Enter "mask" in the input box input_search. send_keys(keyword) # Simulate keyboard Enter operation to search input_search. send_keys(Keys. ENTER) # Mandatory wait for 3 seconds sleep(3)
The driver first calls find_element_by_id(‘key’) to locate the search box by ID, then calls sent_keys() to pass in the parameter search keyword keyword, and then passes in Keys.ENTER to simulate the keyboard enter key in sent_keys(), and finally Then call sleep(3) to wait for the search content to be loaded. So far, locate the search box -> enter keywords in the search box -> press Enter to search, and the entire search process is completed.
4. Positioning elements (product title, link, price, number of reviews)
# Get the li tags of all products on the current first page goods = driver. find_elements_by_class_name('gl-item') for good in goods: # get product title title = good.find_element_by_css_selector('.p-name em').text.replace('\\ ', '') # Get product link link = good.find_element_by_tag_name('a').get_attribute('href') # get item price price = good.find_element_by_css_selector('.p-price strong').text.replace('\\ ', '') # Get the number of product reviews commit = good.find_element_by_css_selector('.p-commit a').text
The driver locates class=”gl-item” to the
- Locate the product title: use the CSS selector to locate the element label, and then call its property text to get the text content in the label (that is, the product title), and finally use the replace() method to replace the line break with an empty string to remove the title The purpose of the newline character, so that the complete product title is successfully obtained.
# Get product title name title = good.find_element_by_css_selector('.p-name em').text.replace('\\ ', '')
- Locating product links: By first locating the
tag and then specifying to get the value of its attribute
href
, finally get the product link.
# get product link link = good.find_element_by_tag_name('a').get_attribute('href')
- Locate the number of product reviews: use
CSS selector
to locate the element tag, and then call its attributetext
to get the text content in the tag (that is, the price of the product).
# Get the number of product reviews commit = good.find_element_by_css_selector('.p-commit a').text
5. Store product data in a file
① Store in txt file
# Get the current file path paths = path.dirname(__file__) # Concatenate the current file path and file name as the storage path of commodity data file = path. join(paths, 'good.txt') # Save the product data to the file by appending with open(file, 'a + ', encoding='utf-8', newline='') as wf: wf.write(msg)
② Store in CSV file
# header header = ['Product title', 'Product price', 'Product link', 'Comment volume'] # Get the current file path paths = path.dirname(__file__) # Concatenate the current file path and file name as the storage path of commodity data file = path. join(paths, 'good_data.csv') # Save the product data to the file by appending with open(file, 'a + ', encoding='utf-8', newline='') as wf: f_csv = csv. DictWriter(wf, header) f_csv.writeheader() f_csv.writerows(data)
6. Exit the browser
# exit and close the browser drver. quit()
After capturing the product data, you can directly close the browser and release resources. This is the whole crawling process, and the process of grabbing data continues to analyze.
Complete sample code
①Save to txt file
# -*- coding: utf-8 -*- # @Time : 2021/10/26 17:35 # @Author : Jane # @Software: PyCharm # import library from time import sleep from selenium import webdriver from selenium.webdriver.common.keys import Keys # keyboard key operation from os import path # Jingdong Mall URL url = 'https://www.jd.com/' # create browser object driver = webdriver. Chrome() # Browser access address driver. get(url) # Implicitly wait to ensure that dynamic content nodes are fully loaded - time is imperceptible driver. implicitly_wait(3) # Maximize the browser window, mainly to prevent the content from being blocked driver. maximize_window() # locate the search box by id=key input_search = driver.find_element_by_id('key') # Enter "mask" in the input box input_search.send_keys('Ladies bag') # Simulate keyboard Enter operation to search input_search. send_keys(Keys. ENTER) # Mandatory wait for 3 seconds sleep(3) # Get the li tags of all products on the current first page goods = driver. find_elements_by_class_name('gl-item') for good in goods: # Get product link link = good.find_element_by_tag_name('a').get_attribute('href') # Get product title name title = good.find_element_by_css_selector('.p-name em').text.replace('\\ ', '') # get item price price = good.find_element_by_css_selector('.p-price strong').text.replace('\\ ', '') # Get the number of product reviews commit = good.find_element_by_css_selector('.p-commit a').text msg = ''' Commodity: %s Link: %s Price: %s Comment: %s '''%(title, link, price, commit) # Get the current file path paths = path.dirname(__file__) # Concatenate the current file path and file name as the storage path of commodity data file = path. join(paths, 'good.txt') # Save the product data to the file by appending with open(file, 'a + ', encoding='utf-8', newline='') as wf: wf.write(msg) # Exit closes the browser driver. quit()
②Save to CSV file
# -*- coding: utf-8 -*- # @Time : 2021/10/26 17:35 # @Author : Jane # @Software: PyCharm # import library from time import sleep from selenium import webdriver from selenium.webdriver.common.keys import Keys # keyboard key operation from os import path import csv # Jingdong Mall URL url = 'https://www.jd.com/' # create browser object driver = webdriver. Chrome() # Browser access address driver. get(url) # Implicitly wait to ensure that dynamic content nodes are fully loaded - time is imperceptible driver. implicitly_wait(3) # Maximize the browser window, mainly to prevent the content from being blocked driver. maximize_window() # locate the search box by id=key input_search = driver.find_element_by_id('key') # Enter "mask" in the input box input_search.send_keys('Ladies bag') # Simulate keyboard Enter operation to search input_search. send_keys(Keys. ENTER) # Mandatory wait for 3 seconds sleep(3) # Get the li tags of all products on the current first page goods = driver. find_elements_by_class_name('gl-item') for good in goods: # Get product link link = good.find_element_by_tag_name('a').get_attribute('href') # Get product title name title = good.find_element_by_css_selector('.p-name em').text.replace('\\ ', '') # get item price price = good.find_element_by_css_selector('.p-price strong').text.replace('\\ ', '') # Get the number of product reviews commit = good.find_element_by_css_selector('.p-commit a').text msg = ''' Commodity: %s Link: %s Price: %s Comment: %s '''%(title, link, price, commit) # Header header = ['Product title', 'Product price', 'Product link', 'Comment volume'] # Get the current file path paths = path.dirname(__file__) # Concatenate the current file path and file name as the storage path of commodity data file = path. join(paths, 'good_data.csv') # Save the product data to the file by appending with open(file, 'a + ', encoding='utf-8', newline='') as wf: f_csv = csv. DictWriter(wf, header) f_csv.writeheader() f_csv.writerows(data) # Exit closes the browser driver. quit()
③Encapsulate the code
# -*- coding: utf-8 -*- # @Time : 2021/10/26 17:35 # @Author : Jane # @Software: PyCharm # import library from time import sleep from selenium import webdriver from selenium.webdriver.common.keys import Keys # keyboard key operation from os import path import csv def spider(url, keyword): # create browser object drver = webdriver. Chrome() # Browser access address drver. get(url) # Implicitly wait to ensure that dynamic content nodes are fully loaded - time is imperceptible drver. implicitly_wait(3) # Maximize the browser window, mainly to prevent the content from being blocked drver.maximize_window() # locate the search box by id=key input_search = drver. find_element_by_id('key') # Enter "mask" in the input box input_search. send_keys(keyword) # Simulate keyboard Enter operation to search input_search. send_keys(Keys. ENTER) # Mandatory wait for 3 seconds sleep(3) # Fetch product data get_good(drver) # Exit closes the browser drver. quit() # Fetch product data def get_good(driver): # Get the li tags of all products on the current first page goods = driver. find_elements_by_class_name('gl-item') data = [] for good in goods: # Get product link link = good.find_element_by_tag_name('a').get_attribute('href') # Get product title name title = good.find_element_by_css_selector('.p-name em').text.replace('\\ ', '') # get item price price = good.find_element_by_css_selector('.p-price strong').text.replace('\\ ', '') # Get the number of product reviews commit = good.find_element_by_css_selector('.p-commit a').text # Store the product data in the dictionary good_data = { 'Product title': title, 'Commodity price':price, 'Commodity link': link, 'comments':commit } data.append(good_data) saveCSV(data) # Save product data to CSV file def saveCSV(data): # Header header = ['Product title', 'Product price', 'Product link', 'Comment volume'] # Get the current file path paths = path.dirname(__file__) # Concatenate the current file path and file name as the storage path of commodity data file = path. join(paths, 'good_data.csv') # Save the product data to the file by appending with open(file, 'a + ', encoding='utf-8', newline='') as wf: f_csv = csv. DictWriter(wf, header) f_csv.writeheader() f_csv.writerows(data) # Judgment file program entry if __name__ == '__main__': # Jingdong Mall URL url = 'https://www.jd.com/' # Search for the keyword "women's braided bag" keyword = 'ladies bag' # crawl data spider(url, keyword)
The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledgePython entry skill treeWeb crawlerSelenium298019 People are studying systematically