Data collection: Selenium obtains CDN merchant ranking information of a website

Write in front

Encountered at work, easy to organize
If you don’t understand enough, please help me to correct

For everyone, there is only one true responsibility: to find yourself. Then stick to it in your heart for the rest of your life, wholeheartedly, and never stop. All other roads are incomplete, they are ways of escaping, a cowardly return to the ideals of the masses, drifting with the tide, and fear of the heart–Hermann Hesse, Demian< /strong>

Collection process:

Automatic login

Get the current page data of the business ranking page

Get the total number of pages, and the corresponding elements of the next page

Loop through according to the total number of pages, and simulate clicking on the next page to obtain data paging data

data summary

from selenium wire import webdriver import json import time from selenium.webdriver.common.by import By import pandas as pd # Automatic login driver = webdriver. Chrome() with open('C:\Users\The mountains and rivers are safe\Documents\GitHub\reptile_demo\demo\cookie.txt', 'r' , encoding='u8') as f: cookies = json.load(f) driver.get('https://cdn.chinaz.com/') for cookies in cookies: driver.add_cookie(cookie) driver.get('https://cdn.chinaz.com/') time.sleep(6) #CND Merchant ranking acquisition https://cdn.chinaz.com/ CDN_Manufacturer = [] new_div_element = driver.find_element(By.CSS_SELECTOR, ".toplist-main") div_elements = new_div_element. find_elements(By. CSS_SELECTOR, ".ullist") #CDN_Manufacturer. extend(div_elements) for mdn_ms in div_elements: a_target = mdn_ms.find_element(By.CSS_SELECTOR,".tohome") home_url = a_target. get_attribute('href') print(mdn_ms. text) text_temp = str(mdn_ms.text).split("\\ ") CDN_Manufacturer.append({ "Company Name": text_temp[0], "Official website address": home_url, "Business qualification": text_temp[1], "Number of CDN websites": text_temp[2], "Site proportion": text_temp[3], "IP node":text_temp[4], "IP proportion":text_temp[5], }) sum_page = driver.find_element(By.XPATH,"//a[contains(@title, 'last page')]") attribute_value = sum_page. get_attribute('val') print(attribute_value) for page in range(1,int(attribute_value)): next_page = driver.find_element(By.XPATH,"//a[contains(@title, 'next page')]") next_page. click() time. sleep(5) new_div_element = driver.find_element(By.CSS_SELECTOR, ".toplist-main") div_elements = new_div_element. find_elements(By. CSS_SELECTOR, ".ullist") #CDN_Manufacturer. extend(div_elements) for mdn_ms in div_elements: a_target = mdn_ms.find_element(By.CSS_SELECTOR,".tohome") home_url = a_target. get_attribute('href') print(mdn_ms. text) text_temp = str(mdn_ms.text).split("\\ ") CDN_Manufacturer.append({ "Company Name": text_temp[0], "Official website address": home_url, "Business qualification": text_temp[1], "Number of CDN websites": text_temp[2], "Site proportion": text_temp[3], "IP node":text_temp[4], "IP proportion":text_temp[5], }) #print(CDN_Manufacturer) #a_list = page_element.find_elements(By.TAG_NAME,"a") for mdn_ms in CDN_Manufacturer: #divs = mdn_ms.find_elements(By.XPATH,"//div") pass df = pd. DataFrame(CDN_Manufacturer) # Save the data as a CSV file df.to_csv('CDN_Manufacturer.csv', index=False) print("The data has been saved as a CSV file")

pd directly prints the generated result

The data has been saved as a CSV file Company name Official website address... IP node IP proportion 0 Baidu Cloud Acceleration https://cloud.baidu.com/product/cdn.html... 92100 4.7% 1 Alibaba Cloud https://www.aliyun.com/... 238994 12.3% 2 Tencent Cloud https://cloud.tencent.com/... 57212 2.9% 3 Know Chuangyu Cloud Defense https://www.yunaq.com/jsl/... 16333 0.8% 4 Wangsu http://www.chinanetcenter.com/ ... 67683 3.5% .. ... ... ... ... ... ... 67 Ruijiang CDN http://www.efly.cc/ ... 1 <0.1 68 Linking Cloud Painting Department http://www.linkingcloud.com/ ... 6 <0.1 69 Zhengzhou Longling http://www.lonlife.cn/ ... 1 <0.1 70 China United Network http://www.wocloud.cn/ ... 2 <0.1 71 Jituyun CDN https://www.jitucdn.com/ ... 9 <0.1

Data visualization

Simple visualization of data through pyecharts

def to_echarts(CDN_Manufacturer): from pyecharts.charts import Bar from pyecharts import options as opts # The built-in theme type can be viewed at pyecharts.globals.ThemeType from pyecharts.globals import ThemeType xaxis = [ cdn["Company Name"] for cdn in CDN_Manufacturer ][:10] yaxis1 = [ cdn["CDN website number"] for cdn in CDN_Manufacturer ][:10] yaxis2 = [ cdn["IP node"] for cdn in CDN_Manufacturer ][:10] bar = ( Bar(init_opts=opts. InitOpts(theme=ThemeType. LIGHT)) .add_xaxis(xaxis) .add_yaxis("Number of CDN websites", yaxis1) .add_yaxis("IP node", yaxis2) .set_global_opts(title_opts=opts.TitleOpts(title="main title", subtitle="subtitle")) ) bar. render()

You can also consider some other visualization tools

Matplotlib: Matplotlib is one of the most commonly used data visualization libraries in Python, providing a wide range of drawing functions, including line charts, scatter plots, bar charts, pie charts, etc. It can be used to create static charts and interactive graphs, and is highly customizable.

Seaborn: Seaborn is a statistical data visualization library based on Matplotlib, focusing on statistical charts and information visualization. Seaborn provides more advanced statistical chart types with better default styles and color themes.

Plotly: Plotly is an interactive visualization library for creating highly customizable charts and visualizations. Plotly provides a variety of chart types, including line charts, scatter charts, bar charts, heat maps, etc., and supports the creation of interactive dashboards and visualization applications.

Bokeh: Bokeh is a library for creating interactive charts and visualizations, with powerful drawing capabilities and cross-platform support. Bokeh can generate HTML, JavaScript, and WebGL, enabling cross-browser and cross-device visualizations.

Altair: Altair is a declarative data visualization library that uses simple Python syntax to generate visual charts. Altair is based on the Vega-Lite specification, with a clear syntax and a concise API.

Reference to part of the blog post

? The copyright of the content of the reference link in the article belongs to the original author. If there is any infringement, please report

? 2018-2023 [email protected], All rights reserved. Attribution-Non-Commercial-Share Alike (CC BY-NC-SA 4.0)