selenium obtains CNKI document information based on [keyword]

Hello everyone, I am Xianyu

Xianyu has written several articles about CNKI crawlers before, and the backend response has been very good. Although, Xianyu couldn’t help but want to complain.

Some friends didn’t finish reading the article or even the code, so they asked me “Why can I only crawl so many pieces of literature information?” (Those who have read the code will find that the papers_need variable is defined in my code to set it. number of crawled articles), “Why can’t crawl other documents? I want to crawl XXX documents” (because the code is written to search for articles through [Document Sources in CNKI Advanced Search]), or some friends directly put the code Post the error to me and ask me what happened.

I think when you see other people’s code on the Internet, don’t just copy and paste it. You should make appropriate modifications to the code based on your own local environment. For example, when locating the Xpath element path, the Xpath path of the same element may not be the same in different computers or in different browsers. This path runs fine for me locally, but an error is reported when it reaches yours.

When looking at other people’s code, it’s a good idea to first figure out:

What others think
Why do others write this?
What is the logic behind writing this?

Take these CNKI crawler articles of mine as examples:

Why use selenium to crawl?
How to analyze web pages? How to position elements? (Xpath, CSS selectors, etc.)
How to simulate human operation of the browser (mouse movement, clicks, sliding windows, etc.) through selenium

Closer to home, Xianyu received a private message from a fan yesterday asking if he could search for literature based on [keywords]

Today’s article focuses on how to analyze the structure of web pages and then use selenium to search for documents based on CNKI keywords. As for crawling the searched documents, this article will not introduce too much, because it has been written in previous articles.

Requirements analysis

Let’s first look at how to do it if you want to search for documents by keywords?

HowNet: China National Knowledge Infrastructure (cnki.net)

First, we log in to the website and click [Advanced Search] (you can also directly click the [Topic] drop-down selection in the search box)

Then we click [Theme] -> Select [Keywords]

Enter the keywords you want to search for (for example: digital inclusive finance) and click [Search]

Web page analysis & element positioning

Combined with the previous demand analysis, we can analyze the web page and locate the corresponding elements.

The first is [Advanced Search]. There is a link for Advanced Search: Advanced Search-China National Knowledge Infrastructure (cnki.net), which saves a step.

Then we need to click [Theme] before the drop-down box will appear. When analyzing the web page, I found that when the drop-down box appeared, the in the tag

" The style attribute changes from "display: none;" to "display: block;"

After the pull box appears, we need to locate the tag [keyword]

# Keyword Xpath path or CSS selector
//*[@id="gradetxt"]/dd[1]/div[2]/div[1]/div[2]/ul/li[3]

li[data-val="KY"]

Then find the Xpath path of the [Search Box]. Here is an input element that receives data from the user

# Input box
//*[@id="gradetxt"]/dd[1]/div[2]/input

After passing the data into the input box, we need to click the [Search] button below

# Search
/html/body/div[2]/div/div[2]/div/div[1]/div[1]/div[2]/div[2]/input

After clicking search, we crawled the [number of documents]

# Number of documents
/html/body/div[3]/div[2]/div[2]/div[2]/form/div/div[1]/div[1]/span[1]/em

Code implementation

Selenium is an automated testing tool that can be used for automated web testing. Its essence is to drive the browser and completely simulate the browser’s operations (such as jump, input, click, pull-down, etc.) to achieve the result of web page rendering and can support multiple browsers.

Selenium is used in crawlers mainly to solve problems such as requests being unable to directly execute JavaScript code.

Import related libraries

import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.action_chains import ActionChains

Create browser object

Here I am using the Edge browser

def webserver():
    #get returns directly without waiting for the interface to load.
    desired_capabilities = DesiredCapabilities.EDGE
    desired_capabilities["pageLoadStrategy"] = "none"

    # Set up the Microsoft driver environment
    options = webdriver.EdgeOptions()
    
    #Set the browser not to load images to improve loading speed
    options.add_experimental_option("prefs", {<!-- -->"profile.managed_default_content_settings.images": 2})

    # Create a Microsoft driver
    driver = webdriver.Edge(options=options)
    return driver

Crawl the web

In fact, the logic is not difficult, just locate each element first and then use selenium to simulate the operation of manually clicking on the browser.

First open the page and wait a second or two for the page to fully load.

 driver.get("https://kns.cnki.net/kns8/AdvSearch")
    time.sleep(2)

Then the drop-down box is displayed. As we mentioned earlier: style in the tag

" When the attribute changes from "display: none;" to "display: block;", a drop-down box will appear

Here we modify the style attribute inside by executing the js script

 # Modify the properties to display the drop-down box
    opt = driver.find_element(By.CSS_SELECTOR, 'div.sort-list') # Position the drop-down box
    # Execute js script to modify attributes; arguments[0] represents the first attribute
    driver.execute_script("arguments[0].setAttribute('style', 'display: block;')", opt)

After the drop-down box is displayed, we need to click [Keywords] to switch to keyword search.

What needs to be noted here is that when I was testing, I found that there was a problem with the loading of the drop-down box. At this time, the code would report an error saying Element

...

is not clickable at point (189, 249)

This will make the program unable to click [keyword]

And I also found that if the loading is incomplete, you need to move the mouse to the drop-down box to let the drop-down box load completely. So here I used ActionChains in selenium to simulate mouse operations.

When using Selenium for automation, sometimes you will encounter situations where you need to simulate mouse operations, such as clicking, double-clicking, right-clicking, dragging, etc.

selenium provides us with a class to handle such events-ActionChains

Another thing to note is that if the mouse is just moved to [Keyword], the drop-down box still cannot be loaded correctly. It is best to move to the bottom of the drop-down box or the element behind the keyword. Here I move to [Communication] author】

# [Corresponding author] Positioning
/html/body/div[2]/div/div[2]/div/div[2]/div[1]/div[1]/div[2]/ul/li[8]

li[data-val="RP"]

After the drop-down box is loaded, locate [Keywords] and click

 # Move the mouse to the drop-down box
    ActionChains(driver).move_to_element(driver.find_element(By.CSS_SELECTOR, 'li[data-val="RP"]')).perform()

    # Find the [Keyword] option and click
    WebDriverWait(driver, 100).until(
        EC.visibility_of_element_located((By.CSS_SELECTOR, 'li[data-val="KY"]'))).click()

Locate the search box and enter the keywords we want to search for.

 # Pass in keywords
    WebDriverWait(driver, 100).until(
        EC.presence_of_element_located((By.XPATH, '''//*[@id="gradetxt"]/dd[1]/div[2]/input'''))
    ).send_keys(keyword)

    # Click to search
    WebDriverWait(driver, 100).until(
        EC.presence_of_element_located((By.XPATH, "/html/body/div[2]/div/div[2]/div/div[1]/div[1]/div[2]/div[2]/ input"))
    ).click()

After the search results come out, locate [number of documents] and obtain the corresponding number (text tag)

 # Get the total number of documents and pages
    res_unm = WebDriverWait(driver, 100).until(EC.presence_of_element_located(
        (By.XPATH, "/html/body/div[3]/div[2]/div[2]/div[2]/form/div/div[1]/div[1]/span[1] /em"))
    ).text

The complete code is as follows:

import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.action_chains import ActionChains


def webserver():
    #get returns directly without waiting for the interface to load.
    desired_capabilities = DesiredCapabilities.EDGE
    desired_capabilities["pageLoadStrategy"] = "none"

    # Set up the Microsoft driver environment
    options = webdriver.EdgeOptions()
    # Set the browser not to load images to improve speed
    options.add_experimental_option("prefs", {<!-- -->"profile.managed_default_content_settings.images": 2})

    # Create a Microsoft driver
    driver = webdriver.Edge(options=options)

    return driver


def open_page(driver, keyword):
    #Open the page and wait two seconds
    driver.get("https://kns.cnki.net/kns8/AdvSearch")
    time.sleep(2)

    # Modify the properties so that the drop-down box displays
    opt = driver.find_element(By.CSS_SELECTOR, 'div.sort-list') # Locate elements
    driver.execute_script("arguments[0].setAttribute('style', 'display: block;')", opt) # Execute the js script to modify the attributes; arguments[0] represents the first Attributes

    # Move the mouse to [Corresponding Author] in the drop-down box
    ActionChains(driver).move_to_element(driver.find_element(By.CSS_SELECTOR, 'li[data-val="RP"]')).perform()

    # Find the [Keyword] option and click
    WebDriverWait(driver, 100).until(
        EC.visibility_of_element_located((By.CSS_SELECTOR, 'li[data-val="KY"]'))).click()

    # Pass in keywords
    WebDriverWait(driver, 100).until(
        EC.presence_of_element_located((By.XPATH, '''//*[@id="gradetxt"]/dd[1]/div[2]/input'''))
    ).send_keys(keyword)

    # Click to search
    WebDriverWait(driver, 100).until(
        EC.presence_of_element_located((By.XPATH, "/html/body/div[2]/div/div[2]/div/div[1]/div[1]/div[2]/div[2]/ input"))
    ).click()

    # Click to switch to Chinese documents
    WebDriverWait(driver, 100).until(
        EC.presence_of_element_located((By.XPATH, "/html/body/div[3]/div[1]/div/div/div/a[1]"))
    ).click()

    # Get the total number of documents and pages
    res_unm = WebDriverWait(driver, 100).until(EC.presence_of_element_located(
        (By.XPATH, "/html/body/div[3]/div[2]/div[2]/div[2]/form/div/div[1]/div[1]/span[1] /em"))
    ).text

    # Remove commas in thousandths
    res_unm = int(res_unm.replace(",", ''))
    page_unm = int(res_unm / 20) + 1
    print(f"A total of {<!-- -->res_unm} results were found, {<!-- -->page_unm} pages.")


if __name__ == '__main__':
    keyword = "digital inclusive finance"
    driver = webserver()
    open_page(driver, keyword)

The result is as follows: