Jsoup parses the table form of html

jsoup parses the table form of html

jsoup description

A Java HTML parser
jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very labor-saving API, which can retrieve and manipulate data through DOM, CSS and jQuery-like operation methods.

Main function

  1. Parse HTML from a URL, file or string;
  2. Use DOM or CSS selectors to find and retrieve data;
  3. Can manipulate HTML elements, attributes, text;

Description of requirements

Now we need a batch of data from the upstream, we do some logical processing after parsing, and enter the database in batches; these data are excel, one by one, but it is not the standard xls or xlsx format, but the processed html format is processed into xls format, if we use easypoi or easyexcelAnalysis will show an error java.io.IOException: Your InputStream was neither an OLE2 stream, nor an OOXML stream, in short, these two parsing frameworks do not recognize , not the standard xls or xlsx, the solution is to save the data exported from the upstream as standard xls and then xlsx There will be no problem with the form, but, however, now you need to control it from the program.

Code operation

core api

Jsoup
The core public access point to the jsoup functionality.
Parse HTML into a Document. The parser will make a sensible, balanced document tree out of any HTML.

Document : Document object. Each HTML page is a document object, and Document is the top-level structure in the jsoup system.
Element: Element object. A Document can contain multiple Element objects, and you can use Element objects to traverse nodes to extract data or manipulate HTML directly.
Elements: A collection of element objects, similar to List.

Core method

eachText()

 /**
     * Get the text content of each of the matched elements. If an element has no text, then it is not included in the
     * result.
     * @return A list of each matched element's text content.
     * @see Element#text()
     * @see Element#hasText()
     * @see #text()
     */
    public List<String> eachText() {<!-- -->
        ArrayList<String> texts = new ArrayList<>(size());
        for (Element el: this) {<!-- -->
            if (el. hasText())
                texts.add(el.text());
        }
        return texts;
    }

select()

 /**
     * Find matching elements within this element list.
     * @param query A {@link Selector} query
     * @return the filtered list of elements, or an empty list if none match.
     */
    public Elements select(String query) {<!-- -->
        return Selector. select(query, this);
    }

1. The select() method can be used in Document, Element or Elements objects, and it is context-sensitive, so it can achieve filtering of specified elements, or use chain access.
2. The select() method will return an Elements collection and provide a set of methods to extract and process the results.

// Get html parsing from the file stream
    public static Document parse(InputStream in, String charsetName, String baseUri) throws IOException {<!-- -->
        return DataUtil.load(in, charsetName, baseUri);
    }
    // get html parse from file
    public static Document parse(File in, String charsetName) throws IOException {<!-- -->
        return DataUtil.load(in, charsetName, in.getAbsolutePath());
    }
    
    public static Document parse(File in, String charsetName, String baseUri) throws IOException {<!-- -->
        return DataUtil.load(in, charsetName, baseUri);
    }
    
   public static Document parse(InputStream in, String charsetName, String baseUri, Parser parser) throws IOException {<!-- -->
        return DataUtil.load(in, charsetName, baseUri, parser);
    }

package com.geekmice.springbootselfexercise.utils;

import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Objects;

import com.geekmice.springbootselfexercise.exception.UserDefinedException;
import lombok.extern.slf4j.Slf4j;
import org.apache.commons.collections4.CollectionUtils;
import org.apache.poi.ss.formula.functions.T;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/**
 * @BelongsProject: spring-boot-self-exercise
 * @BelongsPackage: com.geekmice.springbootselfexercise.utils
 * @Author: pingmingbo
 * @CreateTime: 2023-08-13 17:16
 * @Description: parse html
 * @Version: 1.0
 */
@Slf4j
public class ParseHtmlUtil {<!-- -->

    public static final String ERROR_MSG = "error mg:【{}】";

    /**
     * @param inputStream file stream
     * @return parsed data list
     * @throws IOException
     * @description Parse the excel in html format according to the file stream
     * Problem Description: Remove the first row of titles, blank lines, spaces, and null pointers
     */
    public static List<String> parseHandle(InputStream inputStream) {<!-- -->
        Document document;
        try {<!-- -->
            document = Jsoup. parse(inputStream, StandardCharsets. UTF_8. toString(), "");
        } catch (IOException e) {<!-- -->
            log.error(ERROR_MSG, e);
            throw new UserDefinedException(e.toString());
        }
        Elements trList = document. select("table"). select("tr");
        List<String> abcList = trList. eachText();
        if (CollectionUtils.isEmpty(abcList)) {<!-- -->
            throw new UserDefinedException("Parse file: file content does not exist");
        }
        abcList. remove(0);
        return abcList;
    }


}

Effect display

{<!-- -->
  "msg": "Operation succeeded",
  "code": 200,
  "data": [
    "2023-07-28 00:15 Shanghai Buyer 0 0",
    "2023-07-28 00:30 Shanghai Buyer 0 0",
     ....
    "2023-07-28 23:00 Sichuan mainnet seller 333.25 225.94",
    "2023-07-28 23:15 Sichuan mainnet seller 463.25 224.16",
    "2023-07-28 23:30 Sichuan mainnet seller 463.25 224.16",
    "2023-07-28 23:45 Sichuan mainnet seller 463.25 224.16",
    "2023-07-28 24:00 Sichuan mainnet seller 587.79 213.53"
  ]
}