Xianyu has written several articles about CNKI crawlers before, and the backend response has been very good. Although, Xianyu couldn’t help but want to complain.
Some friends didn’t finish reading the article or even the code, so they asked me “Why can I only crawl so many pieces of literature information?” (Those who have read the code will find that the papers_need
variable is defined in my code to set it. number of crawled articles), “Why can’t crawl other documents? I want to crawl XXX documents” (because the code is written to search for articles through [Document Sources in CNKI Advanced Search]), or some friends directly put the code Post the error to me and ask me what happened.
I think when you see other people’s code on the Internet, don’t just copy and paste it. You should make appropriate modifications to the code based on your own local environment. For example, when locating the Xpath element path, the Xpath path of the same element may not be the same in different computers or in different browsers. This path runs fine for me locally, but an error is reported when it reaches yours.
Closer to home, Xianyu received a private message from a fan yesterday asking if he could search for literature based on [keywords]
Today’s article focuses on how to analyze the structure of web pages and then use selenium to search for documents based on CNKI keywords. As for crawling the searched documents, this article will not introduce too much, because it has been written in previous articles.
Let’s first look at how to do it if you want to search for documents by keywords?
First, we log in to the website and click [Advanced Search] (you can also directly click the [Topic] drop-down selection in the search box)
Then we click [Theme] -> Select [Keywords]
Enter the keywords you want to search for (for example: digital inclusive finance) and click [Search]
Combined with the previous demand analysis, we can analyze the web page and locate the corresponding elements.
The first is [Advanced Search]. There is a link for Advanced Search: Advanced Search-China National Knowledge Infrastructure (cnki.net), which saves a step.
Then we need to click [Theme] before the drop-down box will appear. When analyzing the web page, I found that when the drop-down box appeared, the in the tag
" The style attribute changes from
"display: none;"
to
"display: block;"
After the pull box appears, we need to locate the tag [keyword]
# Keyword Xpath path or CSS selector
//*[@id="gradetxt"]/dd[1]/div[2]/div[1]/div[2]/ul/li[3]
li[data-val="KY"]
Then find the Xpath path of the [Search Box]. Here is an input element that receives data from the user
# Input box
//*[@id="gradetxt"]/dd[1]/div[2]/input
After passing the data into the input box, we need to click the [Search] button below
# Search
/html/body/div[2]/div/div[2]/div/div[1]/div[1]/div[2]/div[2]/input
After clicking search, we crawled the [number of documents]
# Number of documents
/html/body/div[3]/div[2]/div[2]/div[2]/form/div/div[1]/div[1]/span[1]/em
Code implementation
Selenium is an automated testing tool that can be used for automated web testing. Its essence is to drive the browser and completely simulate the browser’s operations (such as jump, input, click, pull-down, etc.) to achieve the result of web page rendering and can support multiple browsers.
Selenium is used in crawlers mainly to solve problems such as requests being unable to directly execute JavaScript code.
Import related libraries
import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.action_chains import ActionChains
Create browser object
Here I am using the Edge browser
def webserver():
#get returns directly without waiting for the interface to load.
desired_capabilities = DesiredCapabilities.EDGE
desired_capabilities["pageLoadStrategy"] = "none"
# Set up the Microsoft driver environment
options = webdriver.EdgeOptions()
#Set the browser not to load images to improve loading speed
options.add_experimental_option("prefs", {<!-- -->"profile.managed_default_content_settings.images": 2})
# Create a Microsoft driver
driver = webdriver.Edge(options=options)
return driver
Crawl the web
In fact, the logic is not difficult, just locate each element first and then use selenium to simulate the operation of manually clicking on the browser.
First open the page and wait a second or two for the page to fully load.
driver.get("https://kns.cnki.net/kns8/AdvSearch")
time.sleep(2)
Then the drop-down box is displayed. As we mentioned earlier: style in the tag
" When the attribute changes from
"display: none;"
to
"display: block;"
, a drop-down box will appear
Here we modify the style
attribute inside by executing the js script
# Modify the properties to display the drop-down box
opt = driver.find_element(By.CSS_SELECTOR, 'div.sort-list') # Position the drop-down box
# Execute js script to modify attributes; arguments[0] represents the first attribute
driver.execute_script("arguments[0].setAttribute('style', 'display: block;')", opt)
After the drop-down box is displayed, we need to click [Keywords] to switch to keyword search.
What needs to be noted here is that when I was testing, I found that there was a problem with the loading of the drop-down box. At this time, the code would report an error saying Element
...
is not clickable at point (189, 249)
This will make the program unable to click [keyword]
And I also found that if the loading is incomplete, you need to move the mouse to the drop-down box to let the drop-down box load completely. So here I used ActionChains
in selenium to simulate mouse operations.
When using Selenium for automation, sometimes you will encounter situations where you need to simulate mouse operations, such as clicking, double-clicking, right-clicking, dragging, etc.
selenium provides us with a class to handle such events-ActionChains
Another thing to note is that if the mouse is just moved to [Keyword], the drop-down box still cannot be loaded correctly. It is best to move to the bottom of the drop-down box or the element behind the keyword. Here I move to [Communication] author】
# [Corresponding author] Positioning
/html/body/div[2]/div/div[2]/div/div[2]/div[1]/div[1]/div[2]/ul/li[8]
li[data-val="RP"]
After the drop-down box is loaded, locate [Keywords] and click
# Move the mouse to the drop-down box
ActionChains(driver).move_to_element(driver.find_element(By.CSS_SELECTOR, 'li[data-val="RP"]')).perform()
# Find the [Keyword] option and click
WebDriverWait(driver, 100).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, 'li[data-val="KY"]'))).click()
Locate the search box and enter the keywords we want to search for.
# Pass in keywords
WebDriverWait(driver, 100).until(
EC.presence_of_element_located((By.XPATH, '''//*[@id="gradetxt"]/dd[1]/div[2]/input'''))
).send_keys(keyword)
# Click to search
WebDriverWait(driver, 100).until(
EC.presence_of_element_located((By.XPATH, "/html/body/div[2]/div/div[2]/div/div[1]/div[1]/div[2]/div[2]/ input"))
).click()
After the search results come out, locate [number of documents] and obtain the corresponding number (text tag)
# Get the total number of documents and pages
res_unm = WebDriverWait(driver, 100).until(EC.presence_of_element_located(
(By.XPATH, "/html/body/div[3]/div[2]/div[2]/div[2]/form/div/div[1]/div[1]/span[1] /em"))
).text
The complete code is as follows:
import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.action_chains import ActionChains
def webserver():
#get returns directly without waiting for the interface to load.
desired_capabilities = DesiredCapabilities.EDGE
desired_capabilities["pageLoadStrategy"] = "none"
# Set up the Microsoft driver environment
options = webdriver.EdgeOptions()
# Set the browser not to load images to improve speed
options.add_experimental_option("prefs", {<!-- -->"profile.managed_default_content_settings.images": 2})
# Create a Microsoft driver
driver = webdriver.Edge(options=options)
return driver
def open_page(driver, keyword):
#Open the page and wait two seconds
driver.get("https://kns.cnki.net/kns8/AdvSearch")
time.sleep(2)
# Modify the properties so that the drop-down box displays
opt = driver.find_element(By.CSS_SELECTOR, 'div.sort-list') # Locate elements
driver.execute_script("arguments[0].setAttribute('style', 'display: block;')", opt) # Execute the js script to modify the attributes; arguments[0] represents the first Attributes
# Move the mouse to [Corresponding Author] in the drop-down box
ActionChains(driver).move_to_element(driver.find_element(By.CSS_SELECTOR, 'li[data-val="RP"]')).perform()
# Find the [Keyword] option and click
WebDriverWait(driver, 100).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, 'li[data-val="KY"]'))).click()
# Pass in keywords
WebDriverWait(driver, 100).until(
EC.presence_of_element_located((By.XPATH, '''//*[@id="gradetxt"]/dd[1]/div[2]/input'''))
).send_keys(keyword)
# Click to search
WebDriverWait(driver, 100).until(
EC.presence_of_element_located((By.XPATH, "/html/body/div[2]/div/div[2]/div/div[1]/div[1]/div[2]/div[2]/ input"))
).click()
# Click to switch to Chinese documents
WebDriverWait(driver, 100).until(
EC.presence_of_element_located((By.XPATH, "/html/body/div[3]/div[1]/div/div/div/a[1]"))
).click()
# Get the total number of documents and pages
res_unm = WebDriverWait(driver, 100).until(EC.presence_of_element_located(
(By.XPATH, "/html/body/div[3]/div[2]/div[2]/div[2]/form/div/div[1]/div[1]/span[1] /em"))
).text
# Remove commas in thousandths
res_unm = int(res_unm.replace(",", ''))
page_unm = int(res_unm / 20) + 1
print(f"A total of {<!-- -->res_unm} results were found, {<!-- -->page_unm} pages.")
if __name__ == '__main__':
keyword = "digital inclusive finance"
driver = webserver()
open_page(driver, keyword)
The result is as follows: