Use jsoup to crawl and parse web page data

If you think the content of this blog is helpful or inspiring to you, please follow my blog to get the latest technical articles and tutorials as soon as possible. At the same time, you are also welcome to leave a message in the comment area to share your thoughts and suggestions. Thank you for your support!

1. What is jsoup, its functions and advantages

Jsoup is a Java-based HTML parser that can easily grab and parse data from web pages. Its main role is to help developers process HTML documents and extract the required data or information.

The main advantages of Jsoup are as follows:

  1. Ease of use: Jsoup provides an API similar to jQuery, making processing HTML documents very simple and easy to understand.
  2. Good compatibility: Jsoup can parse various types of documents such as HTML and XML, and supports standard CSS selectors, making it easier to select elements.
  3. Support modification: Jsoup can not only parse HTML documents, but also modify and manipulate them. For example, you can use Jsoup to modify elements, attributes, and text in web pages.
  4. Efficient performance: Jsoup’s code is concise and uses efficient algorithms and data structures, so it has good performance and low memory consumption.
  5. Open source and free: Jsoup is a completely open source software that can be used and modified for free, and the community supports it well, and you can get help if you encounter problems.

Jsoup is a powerful and easy-to-use HTML parser, suitable for various scenarios, such as data crawling, data mining, data analysis, etc.

2. How to use jsoup to grab HTML pages

  1. Reference jsoup dependencies: First, you need to reference jsoup dependencies in the project. You can add the following code to the project’s pom.xml file:
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>
</dependency>

Or download the jar package of jsoup and add it to the classpath of the project.

  1. Create a connection: Use the Jsoup.connect(url) method to create a connection object.
  2. Get the HTML document: Use the get() or post() method of the connection object to get the string form of the HTML document.
  3. Parsing HTML documents: Use the Jsoup.parse(html) method to parse the HTML document string into a Document object, and then you can use the API provided by Jsoup to select, traverse, and modify elements.

The following is a simple Java code example that demonstrates how to use jsoup to grab a web page and print its title and links:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;

public class JsoupDemo {
    public static void main(String[] args) {
        String url = "https://www.baidu.com/";
        try {
            Document doc = Jsoup.connect(url).get();
            String title = doc. title();
            System.out.println("The title of the page is: " + title);
            Elements links = doc. select("a[href]");
            for (Element link : links) {
                String linkHref = link.attr("href");
                String linkText = link. text();
                System.out.println("Link address: " + linkHref);
                System.out.println("Link text: " + linkText);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

3. Introduction to common APIs for parsing HTML with jsoup

jsoup is a powerful Java library that can be used to parse HTML documents. It provides many common APIs for selecting, traversing, and modifying elements and attributes in HTML documents. Here are a few commonly used APIs:

  1. Selector API: Used to select HTML elements based on CSS selector syntax.
  2. Attribute (Attribute) API: Used to get, set, and remove attributes of HTML elements.
  3. Traversal (Traversal) API: used to traverse the elements in the HTML document.
  4. Operation (Manipulation) API: Used to modify elements and attributes in HTML documents.

Next, we will introduce these APIs one by one and give corresponding code examples.

Selector API

The selector API allows you to select HTML elements using CSS selector syntax. It provides some methods, such as ??Document.select()? and ??Elements.select()??, which can be used to select HTML elements.

// Select all p elements according to the CSS selector
Elements elements = doc. select("p");

// Select all elements with class "example" according to the CSS selector
Elements elements = doc. select(".example");

// Select the element with the id "example" according to the CSS selector
Element element = doc. select("#example"). first();

Attribute API

The Attributes API is used to get, set, and remove attributes of HTML elements. You can use the ??Element.attr()? method to get or set the value of a single attribute, or use the ??Element.attributes()?? method to get all attributes.

// Get the attribute value of the element
String href = element.attr("href");

// Set the attribute value of the element
element.attr("href", "http://example.com");

// remove attribute from element
element. removeAttr("href");

// Get all attributes of the element
Attributes attributes = element. attributes();

Traversal API

The Traversal API is used to traverse elements in an HTML document. You can use ??Element.parent()?, ??Element.children()?, ??Element.nextElementSibling()?? and other methods to traverse the elements in the HTML document.

//Get the parent element of the element
Element parentElement = element. parent();

// Get the child elements of the element
Elements childrenElements = element. children();

// Get the next sibling element of the element
Element nextSiblingElement = element.nextElementSibling();

Manipulation API

The Manipulation API is used to modify elements and attributes in HTML documents. You can use ??Element.html()?, ??Element.text()?, ??Element.append()?? and other methods to operate elements and attributes in HTML documents.

// Get the HTML content of the element
String html = element. html();

// Get the text content of the element
String text = element. text();

// Append HTML content to the element
element.append("<p>this is a new paragraph</p>");

// Append text content to the element
element.appendText("This is a new text");

4. Common selectors and usage examples of jsoup

Element selector

Element selector refers to selecting one or some elements. Commonly used element selectors include ??tagname?, ??tagname.class?, ??tagname#id??, etc. For example:

// Select all a elements
Elements links = doc. select("a");

// Select all div elements whose class attribute is news
Elements divs = doc. select("div. news");

// Select the div element whose id attribute is header
Element header = doc.select("div#header").first();

Attribute selector

An attribute selector refers to selecting an element with a certain attribute. Commonly used attribute selectors include ??[attr]?, ??[attr=value]?, ??[attr~=value]??, etc. For example:

// Select all a elements whose href attribute starts with "http"
Elements links = doc. select("a[href^=http]");

// Select all div elements whose class attribute contains news
Elements divs = doc. select("div[class~=news]");

Combination selector

Combined selectors refer to combining multiple selectors together. Commonly used combined selectors include space, greater than sign, plus sign, etc. For example:

// Select the p element in all div elements
Elements paragraphs = doc. select("div p");

// Select all p elements whose immediate children are divs
Elements directParagraphs = doc. select("div > p");

// Select the p element immediately after the div element
Elements adjacentParagraphs = doc. select("div + p");

Pseudo-class selector

A pseudo-class selector refers to selecting elements that do not conform to the syntax of conventional CSS selectors. Commonly used pseudo-class selectors include ??:contains(text)?, ??:empty?, ??:not(selector)??, etc. For example:

// Select all a elements containing "example" text
Elements links = doc. select("a:contains(example)");

// Select all div elements that do not contain child elements
Elements divs = doc. select("div:empty");

// Select all elements that are not div elements
Elements notDivs = doc. select(":not(div)");

5. jsoup handles special characters and encoding issues in HTML pages

Special characters and encoding issues are common pain points when working with HTML pages. Special characters refer to some characters with special meaning in HTML documents, such as <, >, & amp;, etc. These characters need to be escaped to be displayed correctly. The encoding problem is caused by different encoding formats. If the correct encoding process is not performed, problems such as garbled characters will appear. The following describes how to use jsoup to deal with special characters and encoding issues in HTML pages.

Escape special characters

In jsoup, you can use the ??text()?? method to get the text content of the element, this method will automatically escape the special characters in the HTML page. For example:

Element element = doc.select("p").first();
String text = element. text();

In this example, if ??

The ? tag contains special characters, such as ??, ??>?, ?? & amp;?, etc., the ??text()? method will automatically escape them into ?? >? etc.

In addition, if we need to manually escape special characters, we can use the static method in the ??Entities?? class. For example:

String escaped = Entities. escape("<p>Hello, world!</p>");
System.out.println(escaped); // output: <p>Hello, world!</p>

In this example, the ??Entities.escape()? method will ??

The ? tag is escaped into ?? & amp;lt;p & amp;gt;??.

Handling encoding issues

When using jsoup to parse HTML pages, if the encoding format of the page is not UTF-8, you need to specify the correct encoding format, otherwise problems such as garbled characters will occur. The encoding format can be set through the ??charset()? method of the ??Connection? object or the ??outputSettings()?? method of the ??Document? object. For example:

Document doc = Jsoup. connect(url)
                .timeout(5000)
                .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
                .get();
doc.outputSettings().charset("GBK");

In this example, the ??outputSettings().charset()? method sets the encoding format to GBK. If the encoding format of the HTML page is ISO-8859-1 or other encoding formats, you can use the ??Charset.forName()?? method to specify the encoding format. For example:

Document doc = Jsoup. connect(url)
                .timeout(5000)
                .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
                .get();
doc.outputSettings().charset(Charset.forName("ISO-8859-1"));

It should be noted that if the tag is used in the page to specify the encoding format, jsoup will automatically recognize and use this encoding format. If no encoding format is specified on the page, jsoup will use UTF-8 encoding by default.

6. Jsoup crawler actual combat: Take the capture of Douban movie information as an example

1. Analyze the target website

First, we need to analyze the target website to determine the content and URL to crawl.

In this example, we want to grab the name, rating, director, starring role, year, and poster image URL of each movie in the Douban Movie Top250 page.

URL: https://movie.douban.com/top250

2. Send a request and parse the HTML page

Use jsoup to send request and get HTML page. Here we use the ??Jsoup.connect()? method to send a GET request, and use the ??Document?? class to parse the HTML page.

String url = "https://movie.douban.com/top250";
Document doc = Jsoup. connect(url)
                .timeout(5000)
                .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
                .get();

In this example, we use the ??timeout()? method to set the request timeout period, and the ??userAgent()?? method to set the User-Agent information so that the server can recognize it as a normal browser access.

3. Parse the HTML page

By analyzing the HTML page of the target website, we can use the selector and API provided by jsoup to obtain the required information.

Elements movieList = doc. select("ol. grid_view li");
for (Element movie : movieList) {
    String title = movie. select("div. hd a span. title"). text();
    String rate = movie. select("div. star span. rating_num"). text();
    String year = movie.select("div.bd p span.year").text();
    String directors = movie. select("div. bd p:first-child"). text();
    String actors = movie. select("div. bd p:nth-child(2)"). text();
    String imgUrl = movie.select("div.pic img").attr("src");
    
    System.out.println(title + " / " + rate + " / " + year + " / " + directors + " / " + actors + " / " + imgUrl);
}

In this example, we use the ??select()? method to get the HTML elements of each movie, and use the selector to select the required information. Use the ??text()? method to get the text content of the element, and use the ??attr()?? method to get the attribute value of the element.

4. Store data

After capturing the required data, we can choose to store the data in a database, file or other storage media. In this example, we simply output the data to the console.

5. Complete code

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;

public class DoubanSpider {
    public static void main(String[] args) throws IOException {
        String url = "https://movie.douban.com/top250";
        Document doc = Jsoup. connect(url)
                .timeout(5000)
                .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
                .get();
        Elements movieList = doc. select("ol. grid_view li");
        for (Element movie : movieList) {
            String title = movie. select("div. hd a span. title"). text();
            String rate = movie. select("div. star span. rating_num"). text();
            String year = movie.select("div.bd p span.year").text();
            String directors = movie. select("div. bd p:first-child"). text();
            String actors = movie. select("div. bd p:nth-child(2)"). text();
            String imgUrl = movie.select("div.pic img").attr("src");
            System.out.println(title + " / " + rate + " / " + year + " / " + directors + " / " + actors + " / " + imgUrl);
        }
    }
}

Taking the Top250 pages of Douban movies as an example, we introduce the basic process of using jsoup to implement crawlers, including steps such as sending requests, parsing HTML pages, and processing data. Using jsoup, we can quickly and easily implement crawlers and obtain the required data. However, it should be noted that crawler behavior needs to comply with relevant laws and regulations and website regulations, and must not be used for commercial or illegal purposes.

7. jsoup realizes automatic crawling and updating of website content

Suppose we need to regularly fetch the latest articles from a website and update them to our website. We can use jsoup to achieve the following steps:

  1. send request

Use jsoup to send a request to get the page with the latest articles. For example:

Document doc = Jsoup.connect("https://www.example.com/latest-articles").get();
  1. Parsing HTML pages

Parse the HTML page to get the list of articles. For example:

Elements articles = doc. select("div. article-list ul li");
  1. Data processing

Traverse the list of articles, obtain information such as article titles, links, and release times, and process them. For example:

for (Element article : articles) {
    String title = article. select("h3. title"). text();
    String url = article. select("a. link"). attr("href");
    String date = article. select("span. date"). text();

    // Perform data processing, such as removing HTML tags, date formatting, etc.
    //...
}
  1. Storing data

Store the processed data in our database or file system. For example:

// store the processed data in the database
db. save(title, url, date);
  1. timed task

Use scheduled tasks to perform the above steps regularly to realize automatic crawling and updating of website content. For example:

ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
scheduler.scheduleAtFixedRate(() -> {
    // Send request, parse HTML, process data, store data
    //...
}, 0, 1, TimeUnit.HOURS);

Through the above steps, we can realize the automatic crawling and updating of website content. It should be noted that crawler behavior needs to comply with relevant laws and regulations and website regulations, and must not be used for commercial or illegal purposes. At the same time, in order to avoid excessive burden on the target website, we need to control the frequency and amount of crawlers to avoid affecting the normal operation of the target website.

8. jsoup custom extension

1. Create a custom element

In this example, we will create a custom element ???, which will have an attribute named ??attribute1??, and can contain text content.

import org.jsoup.nodes.Element;
import org.jsoup.parser.Tag;

public class CustomElement extends Element {
    private static final String TAG_NAME = "custom";
    private static final String ATTRIBUTE_NAME = "attribute1";

    public CustomElement(String baseUri, String text) {
        super(Tag. valueOf(TAG_NAME), baseUri);
        this. text(text);
    }

    public String getAttribute1() {
        return this.attr(ATTRIBUTE_NAME);
    }

    public void setAttribute1(String value) {
        this.attr(ATTRIBUTE_NAME, value);
    }
}

In the above example, we overloaded the constructor of ??org.jsoup.nodes.Element? and passed the tag name and base URI. Additionally, we call the ??text()?? method, setting the passed text as the text content of the element.

We also added the ??getAttribute1()? and ??setAttribute1()? methods to read and set the value of the ??attribute1?? attribute.

Now, we can create a custom element with the following code:

CustomElement custom = new CustomElement("http://example.com", "This is custom text");
custom.setAttribute1("custom attribute value");

We can add custom elements to the document as follows:

Document doc = Jsoup. parse("<html><body><div></div></body></html>");
Element div = doc. select("div"). first();
div.appendChild(custom);

In the above example, we first parsed a simple HTML document using jsoup, then selected a ??

?? element in the document, and added the custom element as its child element.

We can now output the document as a string and see that our custom element has been properly serialized:

System.out.println(doc.outerHtml());
// output result: <html>
 <head></head>
 <body>
  <div>
   <custom attribute1="custom attribute value">
    This is custom text
   </custom>
  </div>
 </body>
</html>

2. Create a custom node accessor

We'll create a custom node accessor that will print each element's tag name and depth as it iterates through the HTML document.

import org.jsoup.nodes.*;
import org.jsoup.select.NodeVisitor;

public class CustomNodeVisitor implements NodeVisitor {

    @Override
    public void head(Node node, int depth) {
        if (node instanceof Element) {
            Element element = (Element) node;
            System.out.println("Start element: " + element.tagName() + ", depth: " + depth);
        }
    }

    @Override
    public void tail(Node node, int depth) {
        if (node instanceof Element) {
            Element element = (Element) node;
            System.out.println("End element: " + element.tagName() + ", depth: " + depth);
        }
    }
}

In the above example, we implemented the ??org.jsoup.nodes.NodeVisitor? interface and overloaded the ??head()? and ??tail()? methods. In the ??head()? method, we print the tag name and depth of the start element, and in the ??tail()?? method, we print the tag name and depth of the end element.

We can now apply custom node accessors to HTML documents:

Document doc = Jsoup. parse("<html><body><div><p>Paragraph 1</p><p>Paragraph 2</p></div></body></html>");
doc.traverse(new CustomNodeVisitor());

In the above example, we first parsed a simple HTML document using jsoup, and then passed the custom node accessor to the ??traverse()?? method to traverse each node in the document.

When we run this code, it outputs the label name and depth for each element, like so:

In the above output, we can see the tag name and depth of each element, as well as where the element starts and ends. This can help us better understand the structure of HTML documents.

9. Performance optimization of jsoup

jsoup has already done a lot of performance optimization and error handling skills, but in practical applications, developers still need to use jsoup according to certain specifications to achieve optimal performance and best error handling effects. Here are some performance optimization and error handling tips for jsoup:

  1. Use connection pooling: Connection pooling is a technology that reuses connections, which can reduce the creation and destruction of connections and improve performance. When using jsoup, it is recommended to use a connection pool to manage connections.
  2. Avoid too many selectors: Selectors are one of the important features of jsoup, but too many selectors can lead to performance degradation. It is recommended to use tag names and class names in selectors instead of attributes and pseudo-classes.
  3. Cache parsing results: When the same page needs to be parsed multiple times, the parsing results can be cached to avoid repeated parsing and improve performance.
  4. Error handling: jsoup provides many error handling mechanisms, such as the default error handler, HTML5 error handling mode, fault tolerance mode, etc. When using jsoup, you need to choose an appropriate error handling mechanism according to the actual situation to avoid unexpected errors in the program.
  5. Avoid memory leaks: When using jsoup, you need to pay attention to releasing resources in time to avoid memory leaks. For example, when parsing a large number of HTML pages, you can use the garbage collection mechanism to release useless objects in time.
  6. Use asynchronous IO: When processing a large number of HTML pages, you can use asynchronous IO technology to separate IO operations from business logic and improve performance.

If you think the content of this blog is helpful or inspiring to you, please follow my blog to get the latest technical articles and tutorials as soon as possible. At the same time, you are also welcome to leave a message in the comment area to share your thoughts and suggestions. Thank you for your support!

syntaxbug.com © 2021 All Rights Reserved.