Data collection: Selenium obtains CDN merchant ranking information of a website

Write in front

  • Encountered at work, easy to organize
  • If you don’t understand enough, please help me to correct

For everyone, there is only one true responsibility: to find yourself. Then stick to it in your heart for the rest of your life, wholeheartedly, and never stop. All other roads are incomplete, they are ways of escaping, a cowardly return to the ideals of the masses, drifting with the tide, and fear of the heart–Hermann Hesse, Demian< /strong>

Collection process:

  1. Automatic login
  2. Get the current page data of the business ranking page
  3. Get the total number of pages, and the corresponding elements of the next page
  4. Loop through according to the total number of pages, and simulate clicking on the next page to obtain data paging data
  5. data summary
from selenium wire import webdriver
import json
import time
from selenium.webdriver.common.by import By
import pandas as pd


# Automatic login
driver = webdriver. Chrome()
with open('C:\Users\The mountains and rivers are safe\Documents\GitHub\reptile_demo\demo\cookie.txt', 'r' , encoding='u8') as f:
    cookies = json.load(f)

driver.get('https://cdn.chinaz.com/')
for cookies in cookies:
    driver.add_cookie(cookie)

driver.get('https://cdn.chinaz.com/')

time.sleep(6)
#CND Merchant ranking acquisition https://cdn.chinaz.com/
CDN_Manufacturer = []
new_div_element = driver.find_element(By.CSS_SELECTOR, ".toplist-main")
div_elements = new_div_element. find_elements(By. CSS_SELECTOR, ".ullist")
#CDN_Manufacturer. extend(div_elements)
for mdn_ms in div_elements:
    a_target = mdn_ms.find_element(By.CSS_SELECTOR,".tohome")
    home_url = a_target. get_attribute('href')
    print(mdn_ms. text)
    text_temp = str(mdn_ms.text).split("\\
")
    CDN_Manufacturer.append({<!-- -->
       "Company Name": text_temp[0],
       "Official website address": home_url,
       "Business qualification": text_temp[1],
       "Number of CDN websites": text_temp[2],
       "Site proportion": text_temp[3],
       "IP node":text_temp[4],
       "IP proportion":text_temp[5],

    })
sum_page = driver.find_element(By.XPATH,"//a[contains(@title, 'last page')]")
attribute_value = sum_page. get_attribute('val')

print(attribute_value)
for page in range(1,int(attribute_value)):
    next_page = driver.find_element(By.XPATH,"//a[contains(@title, 'next page')]")
    next_page. click()
    time. sleep(5)
    new_div_element = driver.find_element(By.CSS_SELECTOR, ".toplist-main")
    div_elements = new_div_element. find_elements(By. CSS_SELECTOR, ".ullist")
    #CDN_Manufacturer. extend(div_elements)
    for mdn_ms in div_elements:
        a_target = mdn_ms.find_element(By.CSS_SELECTOR,".tohome")
        home_url = a_target. get_attribute('href')
        print(mdn_ms. text)
        text_temp = str(mdn_ms.text).split("\\
")
        CDN_Manufacturer.append({<!-- -->
           "Company Name": text_temp[0],
           "Official website address": home_url,
           "Business qualification": text_temp[1],
           "Number of CDN websites": text_temp[2],
           "Site proportion": text_temp[3],
           "IP node":text_temp[4],
           "IP proportion":text_temp[5],

        })

#print(CDN_Manufacturer)
#a_list = page_element.find_elements(By.TAG_NAME,"a")
for mdn_ms in CDN_Manufacturer:
    #divs = mdn_ms.find_elements(By.XPATH,"//div")
    pass


df = pd. DataFrame(CDN_Manufacturer)

# Save the data as a CSV file
df.to_csv('CDN_Manufacturer.csv', index=False)

print("The data has been saved as a CSV file")


pd directly prints the generated result

The data has been saved as a CSV file
       Company name Official website address... IP node IP proportion
0 Baidu Cloud Acceleration https://cloud.baidu.com/product/cdn.html... 92100 4.7%
1 Alibaba Cloud https://www.aliyun.com/... 238994 12.3%
2 Tencent Cloud https://cloud.tencent.com/... 57212 2.9%
3 Know Chuangyu Cloud Defense https://www.yunaq.com/jsl/... 16333 0.8%
4 Wangsu http://www.chinanetcenter.com/ ... 67683 3.5%
.. ... ... ... ... ... ...
67 Ruijiang CDN http://www.efly.cc/ ... 1 <0.1
68 Linking Cloud Painting Department http://www.linkingcloud.com/ ... 6 <0.1
69 Zhengzhou Longling http://www.lonlife.cn/ ... 1 <0.1
70 China United Network http://www.wocloud.cn/ ... 2 <0.1
71 Jituyun CDN https://www.jitucdn.com/ ... 9 <0.1

Data visualization

Simple visualization of data through pyecharts

def to_echarts(CDN_Manufacturer):
    from pyecharts.charts import Bar
    from pyecharts import options as opts
    # The built-in theme type can be viewed at pyecharts.globals.ThemeType
    from pyecharts.globals import ThemeType
    xaxis = [ cdn["Company Name"] for cdn in CDN_Manufacturer ][:10]
    yaxis1 = [ cdn["CDN website number"] for cdn in CDN_Manufacturer ][:10]
    yaxis2 = [ cdn["IP node"] for cdn in CDN_Manufacturer ][:10]
    bar = (
        Bar(init_opts=opts. InitOpts(theme=ThemeType. LIGHT))
        .add_xaxis(xaxis)
        .add_yaxis("Number of CDN websites", yaxis1)
        .add_yaxis("IP node", yaxis2)
        .set_global_opts(title_opts=opts.TitleOpts(title="main title", subtitle="subtitle"))
)
    bar. render()

You can also consider some other visualization tools

Matplotlib: Matplotlib is one of the most commonly used data visualization libraries in Python, providing a wide range of drawing functions, including line charts, scatter plots, bar charts, pie charts, etc. It can be used to create static charts and interactive graphs, and is highly customizable.

Seaborn: Seaborn is a statistical data visualization library based on Matplotlib, focusing on statistical charts and information visualization. Seaborn provides more advanced statistical chart types with better default styles and color themes.

Plotly: Plotly is an interactive visualization library for creating highly customizable charts and visualizations. Plotly provides a variety of chart types, including line charts, scatter charts, bar charts, heat maps, etc., and supports the creation of interactive dashboards and visualization applications.

Bokeh: Bokeh is a library for creating interactive charts and visualizations, with powerful drawing capabilities and cross-platform support. Bokeh can generate HTML, JavaScript, and WebGL, enabling cross-browser and cross-device visualizations.

Altair: Altair is a declarative data visualization library that uses simple Python syntax to generate visual charts. Altair is based on the Vega-Lite specification, with a clear syntax and a concise API.

Reference to part of the blog post

? The copyright of the content of the reference link in the article belongs to the original author. If there is any infringement, please report

? 2018-2023 [email protected], All rights reserved. Attribution-Non-Commercial-Share Alike (CC BY-NC-SA 4.0)

syntaxbug.com © 2021 All Rights Reserved.