Web crawler is a technology that automatically obtains web content. It can be used in various scenarios such as data collection, information analysis, and website monitoring. However, the content of some web pages is not static, but dynamically generated through JavaScript, such as charts, maps and other complex elements. These elements often require user interaction to be displayed, or require a certain amount of time to load. If you use traditional crawler technologies, such as requests or urllib, you cannot obtain the content of these elements because they can only request the source code of the web page and cannot execute JavaScript code.
In order to solve this problem, we can use the scrapy_selenium tool, which combines two powerful libraries, scrapy and selenium, to crawl dynamic web pages. Scrapy is a distributed crawler system based on the Scrapy framework. It can easily manage multiple crawler projects and provides a wealth of middleware and pipeline components. selenium is an automated testing tool that can simulate the behavior of a browser, such as opening a web page, clicking a button, Enter text, etc., and get the rendering results of the web page. By using selenium as scrapy’s downloader middleware, we can let scrapy use selenium to request and parse web pages to obtain dynamically generated content.
Overview
This article will introduce how to use scrapy_selenium to crawl web pages containing complex elements such as charts and maps, and take Baidu map as an example to show how to obtain annotation information on the map. This article assumes that readers are already familiar with the basic usage of scrapy and selenium, and have installed the relevant dependency packages and drivers.
Text
Install scrapy_selenium
scrapy_selenium is an open source Python package, which can be installed through the pip command:
# Install scrapy_selenium pip install scrapy_selenium
Create scrapy projects and crawlers
Use the scrapy command to create a project called mapspider:
# Create mapspider project scrapy startproject mapspider
Enter the project directory and use the genspider command to create a crawler named baidumap:
# Enter the project directory cd mapspider #Create baidumap crawler scrapy genspider baidumap baidu.com
Configuring the settings.py file Open the settings.py file in the project directory and modify the following content:
# Import scrapy_selenium module ?from scrapy_selenium import SeleniumMiddleware #Set the downloader middleware and use SeleniumMiddleware to replace the default downloader middleware DOWNLOADER_MIDDLEWARES = { ? 'scrapy_selenium.SeleniumMiddleware': 800, } #Set selenium related parameters, such as browser type, timeout, window size, etc. SELENIUM_BROWSER = 'chrome' # Use chrome browser SELENIUM_TIMEOUT = 30 # Set the timeout to 30 seconds SELENIUM_WINDOW_SIZE = (1920, 1080) # Set the window size to 1920x1080 ?# Yiniu Cloud Set crawler agent information PROXY_HOST = "www.16yun.cn" # Proxy server address PROXY_PORT = "3111" # Proxy server port number PROXY_USER = "16YUN" #Proxy user name PROXY_PASS = "16IP" # Proxy password ? #Set the log level to INFO to facilitate viewing the operation status LOG_LEVEL = 'INFO'
Write baidumap.py file
Open the spiders folder in the project directory, find the baidumap.py file, and modify the following content:
#Import modules related to scrapy and selenium import scrapy ?from selenium import webdriver ?from selenium.webdriver.common.by import By ?from selenium.webdriver.support import expected_conditions as EC ?from selenium.webdriver.support.wait import WebDriverWait # Define baidumap crawler class, inherit scrapy.Spider class class BaidumapSpider(scrapy.Spider): ? #Set the crawler name ? name = 'baidumap' # Set the starting URL, taking Beijing as an example start_urls = ['https://map.baidu.com/?newmap=1 & amp;ie=utf-8 & amp;s=s&wd=Beijing'] ? # Define the parsing method and receive the response parameter ? def parse(self, response): ? # Get the driver object of selenium, used to operate the browser driver = response.meta['driver'] ? # Wait for the map to be loaded and determine whether the map layer is visible. WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, 'BMap_mask'))) # Get all label elements on the map and return a list Markers = driver.find_elements_by_class_name('BMap_Marker') # Traverse the list of annotation elements for marker in markers: # Get the marked text content, such as hotels, restaurants, etc. text = marker.get_attribute('textContent') # Get the coordinate position of the label and return a dictionary containing the two keys x and y position = marker.get_attribute('position') #Print the text and coordinate information of the annotation ?print(text, position)
Run the crawler
In the project directory, use scrapy command to run the crawler:
# Run baidumap crawler scrapy crawl baidumap
Case
After running the crawler, you can see the following output on the console:
Hotel {'x': '116.403119', 'y': '39.914714'} Restaurant {'x': '116.403119', 'y': '39.914714'} Bank {'x': '116.403119', 'y': '39.914714'} Supermarket {'x': '116.403119', 'y': '39.914714'} Hospital {'x': '116.403119', 'y': '39.914714'} School {'x': '116.403119', 'y': '39.914714'} Bus station {'x': '116.403119', 'y': '39.914714'} Subway station {'x': '116.403119', 'y': '39.914714'} Parking lot {'x': '116.403119', 'y': '39.914714'} Gas station {'x': '116.403119', 'y': '39.914714'} ...
These outputs are the annotation information on the crawled map, including text and coordinates. We can perform further analysis or applications based on this information.
Conclusion
This article introduces how to use scrapy_selenium to crawl web pages containing complex elements such as charts and maps. It takes Baidu Map as an example to show how to obtain annotation information on the map. scrapy_selenium is a powerful and flexible tool that can cope with the crawling needs of various dynamic web pages and provides convenience for data collection. Hope this article is helpful to you.
Finally:The complete software testing video tutorial below has been compiled and uploaded. Friends who need it can get it by themselves [Guaranteed 100% Free]
Software Testing Interview Document
We must study to find a high-paying job. The following interview questions are from the latest interview materials from first-tier Internet companies such as Alibaba, Tencent, Byte, etc., and some Byte bosses have given authoritative answers. After finishing this set I believe everyone can find a satisfactory job based on the interview information.