selenium obtains CNKI document information based on [keyword]

Hello everyone, I am Xianyu

Xianyu has written several articles about CNKI crawlers before, and the backend response has been very good. Although, Xianyu couldn’t help but want to complain.

Some friends didn’t finish reading the article or even the code, so they asked me “Why can I only crawl so many pieces of literature information?” (Those who have read the code will find that the papers_need variable is defined in my code to set it. number of crawled articles), “Why can’t crawl other documents? I want to crawl XXX documents” (because the code is written to search for articles through [Document Sources in CNKI Advanced Search]), or some friends directly put the code Post the error to me and ask me what happened.

I think when you see other people’s code on the Internet, don’t just copy and paste it. You should make appropriate modifications to the code based on your own local environment. For example, when locating the Xpath element path, the Xpath path of the same element may not be the same in different computers or in different browsers. This path runs fine for me locally, but an error is reported when it reaches yours.

When looking at other people’s code, it’s a good idea to first figure out:

  1. What others think
  2. Why do others write this?
  3. What is the logic behind writing this?

Take these CNKI crawler articles of mine as examples:

  1. Why use selenium to crawl?
  2. How to analyze web pages? How to position elements? (Xpath, CSS selectors, etc.)
  3. How to simulate human operation of the browser (mouse movement, clicks, sliding windows, etc.) through selenium

Closer to home, Xianyu received a private message from a fan yesterday asking if he could search for literature based on [keywords]

Today’s article focuses on how to analyze the structure of web pages and then use selenium to search for documents based on CNKI keywords. As for crawling the searched documents, this article will not introduce too much, because it has been written in previous articles.

Requirements analysis

Let’s first look at how to do it if you want to search for documents by keywords?

HowNet: China National Knowledge Infrastructure (cnki.net)

First, we log in to the website and click [Advanced Search] (you can also directly click the [Topic] drop-down selection in the search box)

Then we click [Theme] -> Select [Keywords]


Enter the keywords you want to search for (for example: digital inclusive finance) and click [Search]

Web page analysis & element positioning

Combined with the previous demand analysis, we can analyze the web page and locate the corresponding elements.

The first is [Advanced Search]. There is a link for Advanced Search: Advanced Search-China National Knowledge Infrastructure (cnki.net), which saves a step.

Then we need to click [Theme] before the drop-down box will appear. When analyzing the web page, I found that when the drop-down box appeared, the in the tag