Use scrapy_selenium to obtain map information

Web crawler is a technology that automatically obtains web content. It can be used in various scenarios such as data collection, information analysis, and website monitoring. However, the content of some web pages is not static, but dynamically generated through JavaScript, such as charts, maps and other complex elements. These elements often require user interaction to be displayed, or require a certain amount of time to load. If you use traditional crawler technologies, such as requests or urllib, you cannot obtain the content of these elements because they can only request the source code of the web page and cannot execute JavaScript code.

In order to solve this problem, we can use the scrapy_selenium tool, which combines two powerful libraries, scrapy and selenium, to crawl dynamic web pages. Scrapy is a distributed crawler system based on the Scrapy framework. It can easily manage multiple crawler projects and provides a wealth of middleware and pipeline components. selenium is an automated testing tool that can simulate the behavior of a browser, such as opening a web page, clicking a button, Enter text, etc., and get the rendering results of the web page. By using selenium as scrapy’s downloader middleware, we can let scrapy use selenium to request and parse web pages to obtain dynamically generated content.

Overview

This article will introduce how to use scrapy_selenium to crawl web pages containing complex elements such as charts and maps, and take Baidu map as an example to show how to obtain annotation information on the map. This article assumes that readers are already familiar with the basic usage of scrapy and selenium, and have installed the relevant dependency packages and drivers.

Text

Install scrapy_selenium

scrapy_selenium is an open source Python package, which can be installed through the pip command:

　# Install scrapy_selenium

pip install scrapy_selenium

Create scrapy projects and crawlers

Use the scrapy command to create a project called mapspider:

　# Create mapspider project

scrapy startproject mapspider

Enter the project directory and use the genspider command to create a crawler named baidumap:

　# Enter the project directory

cd mapspider

#Create baidumap crawler

scrapy genspider baidumap baidu.com

Configuring the settings.py file Open the settings.py file in the project directory and modify the following content:

　# Import scrapy_selenium module

?from scrapy_selenium import SeleniumMiddleware

#Set the downloader middleware and use SeleniumMiddleware to replace the default downloader middleware

DOWNLOADER_MIDDLEWARES = {

? 'scrapy_selenium.SeleniumMiddleware': 800,

}

#Set selenium related parameters, such as browser type, timeout, window size, etc.

SELENIUM_BROWSER = 'chrome' # Use chrome browser

SELENIUM_TIMEOUT = 30 # Set the timeout to 30 seconds

SELENIUM_WINDOW_SIZE = (1920, 1080) # Set the window size to 1920x1080

?# Yiniu Cloud Set crawler agent information

　PROXY_HOST = "www.16yun.cn" # Proxy server address

　PROXY_PORT = "3111" # Proxy server port number

　PROXY_USER = "16YUN" #Proxy user name

　PROXY_PASS = "16IP" # Proxy password

? #Set the log level to INFO to facilitate viewing the operation status

LOG_LEVEL = 'INFO'

Write baidumap.py file

Open the spiders folder in the project directory, find the baidumap.py file, and modify the following content:

　#Import modules related to scrapy and selenium

import scrapy

?from selenium import webdriver

?from selenium.webdriver.common.by import By

?from selenium.webdriver.support import expected_conditions as EC

?from selenium.webdriver.support.wait import WebDriverWait

# Define baidumap crawler class, inherit scrapy.Spider class

class BaidumapSpider(scrapy.Spider):

? #Set the crawler name

? name = 'baidumap'

# Set the starting URL, taking Beijing as an example

start_urls = ['https://map.baidu.com/?newmap=1 & amp;ie=utf-8 & amp;s=s&wd=Beijing']

? # Define the parsing method and receive the response parameter

? def parse(self, response):

? # Get the driver object of selenium, used to operate the browser

driver = response.meta['driver']

? # Wait for the map to be loaded and determine whether the map layer is visible.

WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, 'BMap_mask')))

# Get all label elements on the map and return a list

Markers = driver.find_elements_by_class_name('BMap_Marker')

# Traverse the list of annotation elements

for marker in markers:

# Get the marked text content, such as hotels, restaurants, etc.

text = marker.get_attribute('textContent')

# Get the coordinate position of the label and return a dictionary containing the two keys x and y

position = marker.get_attribute('position')

#Print the text and coordinate information of the annotation

?print(text, position)

Run the crawler

In the project directory, use scrapy command to run the crawler:

　# Run baidumap crawler

scrapy crawl baidumap

Case

After running the crawler, you can see the following output on the console:

 Hotel {'x': '116.403119', 'y': '39.914714'}

Restaurant {'x': '116.403119', 'y': '39.914714'}

Bank {'x': '116.403119', 'y': '39.914714'}

Supermarket {'x': '116.403119', 'y': '39.914714'}

Hospital {'x': '116.403119', 'y': '39.914714'}

School {'x': '116.403119', 'y': '39.914714'}

Bus station {'x': '116.403119', 'y': '39.914714'}

Subway station {'x': '116.403119', 'y': '39.914714'}

Parking lot {'x': '116.403119', 'y': '39.914714'}

Gas station {'x': '116.403119', 'y': '39.914714'}

...

These outputs are the annotation information on the crawled map, including text and coordinates. We can perform further analysis or applications based on this information.

Conclusion

This article introduces how to use scrapy_selenium to crawl web pages containing complex elements such as charts and maps. It takes Baidu Map as an example to show how to obtain annotation information on the map. scrapy_selenium is a powerful and flexible tool that can cope with the crawling needs of various dynamic web pages and provides convenience for data collection. Hope this article is helpful to you.

Finally:The complete software testing video tutorial below has been compiled and uploaded. Friends who need it can get it by themselves [Guaranteed 100% Free]

Software Testing Interview Document

We must study to find a high-paying job. The following interview questions are from the latest interview materials from first-tier Internet companies such as Alibaba, Tencent, Byte, etc., and some Byte bosses have given authoritative answers. After finishing this set I believe everyone can find a satisfactory job based on the interview information.