Use Jsoup to crawl web page knowledge and convert it into Excel (Java version)

Background

With the advent of the post-epidemic era, in the past 2022, the national economic situation must have been a matter of great concern to many scholars and research subjects. These data are recorded on the website of the National Bureau of Statistics. By analyzing these data, we can verify and observe the current economic situation from a certain angle.

A total of 1,279 county-level units across the country have disclosed their GDP and general public budget revenue data in 2022. Based on these data, Enterprise Early Warning has sorted out the GDP ranking list of the top 100 counties in China and the general public budget revenue ranking list of the top 100 counties. Among them, Kunshan City topped the list with a GDP of 500.666 billion yuan, Jiangyin City and Jinjiang City ranked second and third among the top 100 counties, and Changsha County was the only one in Hunan Province that entered the top ten in the country (Top 7).

83a2bbd9ed74e7e42e46e3b02113ec70.png
fb44958df96aa295b20abca8204f5293.png

The first picture is released in the form of a picture, and the second is displayed in the form of an Html table. It is very inconvenient when analyzing usage data offline. As a programmer, this must not trouble you. We can use web scraping technology to organize the data.

This article will use the Java language as the programming language to explain the use of Jsoup to crawl web page knowledge. The article gives detailed sample codes, and I hope it will be helpful to everyone.

1. Getting to know Jsoup for the first time

1. Web page structure analysis

When using Jsoup to crawl a page, it is necessary to conduct a preliminary analysis of the structure of the webpage, so as to formulate a corresponding crawling strategy. First open the browser, enter the address of the target website, and at the same time open F12 to enter the debugging, and find the elements of the target web page.

cd8645ff7ba9b40e55d70785a3e30f6e.png

Open the table form under the div in the top 100 gdp table above, and find the following data

a0de3fcf5f8c4ba3dd573e6212a093d8.png

Similarly, the data processing of general public budget revenue is also handled in the same way, and will not be repeated here.

2. Java development Jsoup capture

1. Reference Jsoup-related dependencies

Here we use Maven’s jar for package dependency processing management. So first define Pom.xml, the key code is as follows:

<project xmlns="http://maven.apache.org/POM/4.0.0"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
 <modelVersion>4.0.0</modelVersion>
 <groupId>com.yelang</groupId>
 <artifactId>jsoupdemo</artifactId>
 <version>0.0.1-SNAPSHOT</version>
 
 <dependencies>
  <dependency>
   <groupId>org.jsoup</groupId>
   <artifactId>jsoup</artifactId>
   <version>1.11.3</version>
  </dependency>
 
  <dependency>
   <groupId>com.alibaba</groupId>
   <artifactId>easyexcel</artifactId>
   <version>3.0.5</version>
  </dependency>
 </dependencies>
 
</project>

2. Processing of information entity classes

The comparison found that the two tables deal with different specific indicators. The previous rankings, county names, and province names are all the same. So we use the object-oriented design method to develop the class of information processing. The corresponding class diagram is as follows:

e8ac9db0d0a449c559add7c22b4a8688.png

3. Data collection entity

package com.yelang.entity;
 
import java.io.Serializable;
import com.alibaba.excel.annotation.ExcelProperty;
public class CountyBase implements Serializable {
 private static final long serialVersionUID = -1760099890427975758L;
 
 @ExcelProperty(value= {"serial number"}, index = 1)
 private Integer index;
 
 @ExcelProperty(value= {"County-level area"}, index = 2)
 private String name;
 
 @ExcelProperty(value= {"Province"}, index = 3)
 private String province;
 
 public Integer getIndex() {
  return index;
 }
 
 public void setIndex(Integer index) {
  this. index = index;
 }
 
 public String getName() {
  return name;
 }
 
 public void setName(String name) {
  this.name = name;
 }
 
 public String getProvince() {
  return province;
 }
 
 public void setProvince(String province) {
  this.province = province;
 }
 
 public CountyBase(Integer index, String name, String province) {
  super();
  this. index = index;
  this.name = name;
  this.province = province;
 }
 
 public CountyBase() {
  super();
 }
 
}

In the above code, sorting, county-level regions, and provinces are abstracted as parent classes, and two subclasses are designed: GDP class and general public income class. It should be noted here that since we need to save the collected data to a local Excel table, here we use EasyExcel as the technical generation component. @ExcelProperty In this attribute, we define the Excel table header to be written and the specific sorting.

package com.yelang.entity;
 
import java.io.Serializable;
import com.alibaba.excel.annotation.ExcelProperty;
public class Gdp extends CountyBase implements Serializable {
 
 private static final long serialVersionUID = 5265057372502768147L;
 
 @ExcelProperty(value= {"GDP (100 million yuan)"}, index = 4)
 private String gdp;
 
 public String getGdp() {
  return gdp;
 }
 
 public void setGdp(String gdp) {
  this.gdp = gdp;
 }
 
 public Gdp(Integer index, String name, String province, String gdp) {
  super(index,name,province);
  this.gdp = gdp;
 }
 
 public Gdp(Integer index, String name, String province) {
  super(index, name, province);
 }
 
}
package com.yelang.entity;
 
import java.io.Serializable;
 
import com.alibaba.excel.annotation.ExcelProperty;
 
public class Gpbr extends CountyBase implements Serializable {
 
 private static final long serialVersionUID = 8612514686737317620L;
 
 @ExcelProperty(value= {"General public budget revenue (100 million yuan)"}, index = 4)
 private String gpbr;// General public budget revenue
 
 public String getGpbr() {
  return gpbr;
 }
 
 public void setGpbr(String gpbr) {
  this.gpbr = gpbr;
 }
 
 public Gpbr(Integer index, String name, String province, String gpbr) {
  super(index, name, province);
  this.gpbr = gpbr;
 }
 
 public Gpbr(Integer index, String name, String province) {
  super(index, name, province);
 }
}

4. Actual crawling

The following is the conversion code for processing GDP data. If you are not familiar with Jsoup, you can first familiarize yourself with the relevant syntax. If you have development experience similar to Jquery, you can get started with Jsoup very quickly.

static void grabGdp() {
  String target = "https://www.maigoo.com/news/665462.html";
  try {
            Document doc = Jsoup. connect(target)
                    .ignoreContentType(true)
                    .userAgent(FetchCsdnCookie.ua[1])
                    .timeout(300000)
                    .header("referer","https://www.maigoo.com")
                    .get();
            Elements elements = doc.select("#t_container > div:eq(3) table tr");
            List<Gdp> list = new ArrayList<Gdp>();
            for(int i = 1;i<elements. size();i ++ ) {
             Element tr = elements.get(i);//Get header
             Elements tds = tr. select("td");
             Integer index = Integer. valueOf(tds. get(0). text());
             String name = tds. get(1). text();
             String province = tds. get(2). text();
             String gdp = tds. get(3). text();
             Gdp county = new Gdp(index, name, province, gdp);
             list. add(county);
            }
            String fileName = "E:/gdptest/2023 National Top 100 Counties GDP Ranking List.xlsx";
            EasyExcel.write(fileName, Gdp.class).sheet("GDP Top 100 List").doWrite(list);
            System.out.println("Done...");
  } catch (Exception e) {
   System.out.println(e.getMessage());
   System.out.println("An exception occurred, continue to the next cycle");
  }
 }

What needs to be noted here is how to locate and grab elements of web pages in jsoup. Here above, we use a jquery-like Dom acquisition method.

Elements elements = doc. select("#t_container > div:eq(3) table tr");

Use this line to get each tr under the table, and then loop each td to get the corresponding data.

3. Process analysis and results

1. Analysis of collection process

Here, the method of debugging the source program is used to analyze the webpage. Use jsou for web page simulation access

114a882aca2080d75ad97a80aa636145.png

Use the select(xxx) method to get page elements,

eb4a07560737323a1292a84b0517e62f.png

Get the td cell data under tr,

b2117a77101d11748886fcf583df7006.png

2. Running results

After the above code runs, you can see the following two files on the destination disk,

3d6242958fa12e3e45773761400fe4b4.png

Open the above two excel files and you can see that the data you want to collect has been collected, and the order of the data is completely generated according to the order on the web page.

c33a9ea6d44d680283cfc870e6b88ad0.png
873737c0ace96c6cbd879c85fc936831.png

Summary

The above is the main content of this article. This article will use the Java language as the programming language to explain in detail how to use Jsoup to crawl web page knowledge, combine EasyExcel to convert web page tables into Excel tables, and give detailed sample codes. Due to the rush of writing, it is inevitable that there will be mistakes. We welcome criticism, correction and exchange.

Source: blog.csdn.net/yelangkingwuzuhu/

article/details/130901172


Popular content:

  • The payment system should be designed like this, a stable batch! !

  • Don’t mess with layering. Do you know where PO, VO, DAO, BO, DTO, and POJO should be used?

  • 4 billion QQ numbers, limited to 1G memory, how to deduplicate?

  • An open source anonymous chat tool, awesome

  • Warning: Use BigDecimal online with caution!

  • Efficient solution: It only takes 13 seconds to insert 300,000 pieces of data into MySQL


a86f388114555cdb03b50aef19690856.jpeg


I recently interviewed BAT, and compiled an interview material "Java Interview BAT Clearance Manual", covering Java core technology, JVM, Java concurrency, SSM, microservices, databases, data structures, etc.
How to get it: Click "Watching", follow the official account and reply to 666 to get it, and more content will be provided one after another. 

See you tomorrow (ω)