Scrapy+Selenium automatically obtains the quality score of personal CSDN articles

Foreword

This article will introduce how to use Scrapy and Selenium, two powerful Python tools, to automatically obtain the quality score of personal CSDN articles. We will discuss in detail the use of the Scrapy crawler framework and how to achieve this in conjunction with the Selenium browser automation tool. Instead of manually going through each post, we can easily obtain and record the quality score of the post to better understand how our blog is performing.
CSDN article quality score query link

Basic knowledge about Scrapy:
Crawler framework Scrapy study notes-1

Crawler framework Scrapy study notes-2

Article directory

    • Preface
    • 1. Installation of Scrapy
    • 2. Scrapy workflow
      • 2.1 Create project
      • 2.2 Enter the project directory
      • 2.3 Generate Spider
      • 2.4 Adjust Spider
      • 2.5 Adjust Settings configuration
      • 2.6 Run Scrapy program
      • 2.7 Find URL
      • 2.8 View and process the response and hand it over to the pipeline
    • 3. Get quality score using Selenium
    • Summarize

1. Installation of Scrapy

First, we need to install Scrapy. It is recommended to install in a separate virtual environment, you can use Virtualenv environment or Conda environment. Execute the following command to install Scrapy:

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple scrapy==2.5.1
pip install pyopenssl==22.0.0
pip install cryptography==36.0.2
scrapy version
scrapy version --verbose

2. Scrapy’s workflow

Scrapy is a powerful Python crawler framework. We can use it by following the following steps:

2.1 Create project

Create a Scrapy project using the following command:

scrapy startproject csdn

2.2 Enter the project directory

cd csdn

2.3 Generate Spider

Generate a Spider to define crawling rules:

scrapy genspider cs csdn.net

This will generate a spider file in which you can define your crawling rules.

2.4 Adjust Spider

In the generated spider file, you need to define the starting URL (start_urls) and how to parse the data (usually the parse function).

For example:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'cs'
    allowed_domains = ['csdn.net']
    start_urls = ['http://csdn.net/']
    
    def parse(self, response):
        pass

2.5 Adjust Settings configuration

In the settings.py file in the project directory, you can adjust various configuration options.

Here are some configuration options that need to be adjusted:

  • LOG_LEVEL: Set the log level, you can set it to “WARNING” to reduce log output.
LOG_LEVEL = "WARNING"
  • ROBOTSTXT_OBEY: Set whether to comply with the Robots protocol. If you do not want the crawler to comply with the Robots protocol, you can set it to False.
ROBOTSTXT_OBEY = False
  • ITEM_PIPELINES: Open pipelines to process crawled data. You can configure different pipelines to process different types of data.
ITEM_PIPELINES = {<!-- -->
   'csdn.pipelines.CsdnPipeline': 300,
}

2.6 Run Scrapy program

Use the following command to run the Scrapy program:

scrapy crawl csdn

2.7 Find URL

In this step, we need to find the URL used to get the article data. The URL can be found by following these steps:

We can first click the search button to preset a search value
Then click one by one in Fetch/XHR or JS, and it will be searched on the left after loading.

Here we finally find the connection in the header. We will find that the link is actually universal. You can skip the previous step and use this connection directly. You only need to replace your username.

https://blog.csdn.net/community/home-api/v1/get-business-list?page=1 & amp;size=20 & amp;businessType=blog & amp;orderby= & amp; noMore=false & amp;year= & amp;month= & amp;username=qq_42531954

In addition, here page=1 & size=20. When your articles are greater than 20, you can adjust the size. Here I have exactly 50 articles, so I adjusted size=50

2.8 View and process the response and hand it over to the pipeline

import scrapy


class CsSpider(scrapy.Spider):
    name = 'cs'
    allowed_domains = ['csdn.net']
    start_urls = ['https://blog.csdn.net/community/home-api/v1/get-business-list?page=1 & amp;size=50 & amp;businessType=blog & amp;orderby= & amp;noMore=false & amp;year= & amp;month= & amp;username=qq_42531954']

    def parse(self, response):
        print(response.text)

Analyze response.text and find that the return value is a dictionary
After a little processing, it is passed to the pipeline

import scrapy


class CsSpider(scrapy.Spider):
    name = 'cs'
    allowed_domains = ['csdn.net']
    start_urls = ['https://blog.csdn.net/community/home-api/v1/get-business-list?page=1 & amp;size=50 & amp;businessType=blog & amp;orderby= & amp;noMore=false & amp;year= & amp;month= & amp;username=qq_42531954']

    def parse(self, response):
        # print(response.text)
        data_list = response.json()["data"]["list"]
        for data in data_list:
            url = data["url"]
            title = data["title"]
            yield {<!-- --> # Ning Dian can act as item -> dict
                "url": url,
                "title": title
            }



csdn/csdn/pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class CsdnPipeline:
    def process_item(self, item, spider):
        print("I am a pipe, what I see is", item)
        with open("data.csv", mode="a", encoding="utf-8") as f:
            f.write(f"{item['url']},{item['title']}\\
")

        return item

3. Use Selenium to obtain quality score

If you have a problem with your chromedriver, you can find the solution from here
Automated management of chromedriver-perfect solution to version mismatch problem

Here is a code example using Selenium:

import csv
import time
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

option = Options()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_argument('--disable-blink-features=AutomationControlled')
option.add_argument('--headless') # Enable headless mode

driver = webdriver.Chrome(options=option)
# List used to store CSV data
data = []
#
# Open the CSV file and read the content
with open('data.csv','r', encoding='utf-8') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        # Each line includes two fields: link and title
        link, title = row
        # Add links and titles as tuples to the data list
        data.append((link, title))

# Use the browser to access the web page
driver.get("https://www.csdn.net/qc")
for link,title in data:
    driver.find_element(By.CSS_SELECTOR, ".el-input__inner").send_keys(f"{<!-- -->link}",link)
    driver.find_element(By.CSS_SELECTOR, ".trends-input-box-btn").click()
    time.sleep(0.5)
    soc = driver.find_element(By.XPATH, '//*[@id="floor-csdn-index_850"]/div/div[1]/div/div[2]/p[1]' ).text
    print(title,soc)
    time.sleep(1)
driver.quit()

This code is a Python script that uses the CSV module and Selenium library to automatically obtain the quality score of CSDN articles. Below I will explain the function and function of this code in detail line by line:

  1. Import necessary libraries:
import csv
import time
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

The CSV module (used to read CSV files), the time module (used to add delay waiting), and Selenium-related modules and classes are imported here.

  1. Configure Selenium options:
option = Options()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_argument('--disable-blink-features=AutomationControlled')
option.add_argument('--headless') # Enable headless mode

This part of the code configures Selenium options. Among them, Options is the option used to configure the Chrome browser, and add_experimental_option is used to set the experimental options of Chrome to avoid being detected as an automated program. --disable-blink-features=AutomationControlled is used to disable certain automation features, while --headless enables headless mode, allowing the browser to run in the background without The interface will be displayed.

  1. Create a Chrome WebDriver instance:
driver = webdriver.Chrome(options=option)

Here a Chrome WebDriver instance is created, using the above configuration options. WebDriver will be used to simulate browser operations.

  1. Create an empty list to store data:
data = []

This list will be used to store the data read from the CSV file.

  1. Open the CSV file and read the contents:
with open('data.csv', 'r', encoding='utf-8') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        link, title = row
        data.append((link, title))

This part of the code uses the open function to open the CSV file named “data.csv”, and uses csv.reader to read the contents of the file. Each row contains two fields: link and title, and these data are added to the data list created earlier.

  1. Use a browser to access the web page:
driver.get("https://www.csdn.net/qc")

This line of code uses WebDriver to open the homepage of the CSDN website and is ready to start searching for the quality scores of articles.

  1. Iterate through each row of data in the CSV file:
for link, title in data:
    driver.find_element(By.CSS_SELECTOR, ".el-input__inner").send_keys(f"{<!-- -->link}", link)
    driver.find_element(By.CSS_SELECTOR, ".trends-input-box-btn").click()
    time.sleep(0.5)
    soc = driver.find_element(By.XPATH, '//*[@id="floor-csdn-index_850"]/div/div[1]/div/div[2]/p[1]' ).text
    print(title, soc)
    time.sleep(1)

In this loop, we perform the following operations on each row of data in the CSV file:

  • Use driver.find_element to locate the search box via a CSS selector and enter the article link in the search box.
  • Use driver.find_element to locate the search button and simulate clicking the search button.
  • Delay 0.5 seconds to wait for the page to load.
  • Use XPath expressions to locate the quality score element and extract its text content.
  • Print the article title and quality score.
  • Delay of 1 second to ensure the website is not accessed frequently.
  1. Finally, use driver.quit() to close the browser window and release resources.

The main function of this code is to automatically access the CSDN website, search for article links, extract the quality score of the article, and print the results. This is useful for getting article quality scores in batches without having to manually view them one by one.
Of course, I will add a foreword, abstract, and summary to complete this article for you.

Summary

By studying this article, you will master the skills of using Scrapy and Selenium to automatically obtain the quality score of personal CSDN articles. This will help you better understand how your blog is performing and which articles are receiving more attention and reviews.

At the same time, you also learned how to set up a Scrapy crawler project, configure crawler rules, and how to process data. These skills are very useful for performing various network data collection tasks.

I hope this article will be helpful to your network data collection and analysis work, so that you can better understand your readers and viewers. If you have any questions or concerns, please ask and we will be happy to assist you.