Teach you to use jsoup to crawl novel website resources from 0

Preface: When I was making an e-book project, it was too troublesome to manually search for novel resources, so I thought of crawling novel resources from other websites. After we learn jsoup, we will find that it is actually an investigation of front-end technology, so we recommend a blog, which is suitable for friends who are not solid in front-end technology

The basic use of jsoup and API content_jsoup Chinese document_shijialeya.’s blog-CSDN blog

First experience

1. Import dependencies:

<!-- Maven coordinate address -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
</dependency>

2. Call Jsoup’s connect static function to create a connection, and pass the crawled target website as a parameter:

public class Demo {
    public static void main(String[] args) {
        Connection connect = Jsoup.connect("https://www.cnblogs.com/hanzhe");
           // Get the URL address of the URL
    }
}

In order to prevent the crawler from being restricted, the request header is set here to imitate the browser client, which can be modified by referring to the request, for example:

public class Demo {
    public static void main(String[] args) {
        Connection connect = Jsoup.connect("https://www.cnblogs.com/hanzhe")
                // I only set one here, if you encounter problems with crawling, you can add header information at any time
                .header("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56") ;
    }
}

3. Then call the execute method to start crawling, and retrieve the crawled data through body:

public class Demo {
    public static void main(String[] args) throws IOException {
        Connection connect = Jsoup.connect("https://www.cnblogs.com/hanzhe")
                .header("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56") ;
        String body = connect. execute(). body();
        System.out.println(body);
    }
}

connect.execute().body(): Obtain a body string, which is a string converted from the entire page source code of the obtained URL

4. But it was said before that Jsoup can extract the content of the webpage just like operating JS, so we need to parse the content before obtaining the crawled content:

public class Demo {
    public static void main(String[] args) throws IOException {
        Connection connect = Jsoup.connect("https://www.cnblogs.com/hanzhe")
                .header("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56") ;
        // Use the parse function to parse the crawled content
        Element body = connect. execute(). parse(). body();
        System.out.println(body);
    }
}

connect.execute().parse().body(): Get the element object of the entire page, and you can perform JS code operations on the element.

It is obvious that the parsed HTML has been formatted, it looks very neat, and the return value has changed from a string to an ELement instance, and content filtering can be realized by operating the instance

5. Test crawling page essay:

Open the F12 developer tool and try to get information related to the title of the essay:

Observe that the title of each essay is decorated with two classes postTitle2 vertical-middle, we can use the selector to find all titles:

6. We will use this selector to crawl all essay titles in Jsoup:

public class Demo {
    public static void main(String[] args) throws IOException {
        Connection connect = Jsoup.connect("https://www.cnblogs.com/hanzhe")
                .header("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56") ;
        Element body = connect. execute(). parse(). body();
        Elements elementsByClass = body.getElementsByClass("postTitle2 vertical-middle");
        elementsByClass.forEach(item->{
            System.out.println(item.text());
        });
    }
}

We can perform JS code operations on Element objects

So far, we have a preliminary understanding of jsoup:

To put it bluntly, after grabbing the source code of the page, operating it is actually a test of our front-end knowledge

Actual Combat – Crawling Favorite Novels

process:

① Obtain the URL address of the search bar, and we will search it for keywords

② Obtain the URL of the web page after the search, which is the address of the novel after our search

③Start crawling novels

package org.claris.core.utils;

import lombok. AllArgsConstructor;
import lombok.Data;
import lombok. NoArgsConstructor;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.internal.StringUtil;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.List;
import java.util.stream.Collectors;

public class HtmlParseUtil {

    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    static class novel {

        //cover path
        private String bookImage;

        //path
        private String bookUrl;

        //book title
        private String bookName;

        //author
        private String bookAuthor;

        //Introduction
        private String bookInfo;

        //type
        private String bookType;

        //reading volume
        private Long visitCount;
    }

    //The website name of the crawler target website
    private static String novel_name = "";


    //Get the url address of the search bar, splicing keywords for retrieval
    private static String searchUrl = "http://www.iktxt.com/search.html?keywords=";

    //Get the url address of the novel web page
    private static String bookUrl = "http://www.iktxt.com";

    //Get the url address of the reading address
    private static String readUrl = "http://www.iktxt.com";


    //Output file name (usually book title)
    private static String fileName = "";

    //Space, four grid position
    private static String space = " ";

    // file output stream
    private static FileWriter writer;

    //counter
    private static int pageCount = 1;

    public static void main(String[] args) throws IOException {
        parse("Fights Break the Sky");
    }

    /**
     * execute program
     *
     * @param keyword
     * @throws IOException
     */
    public static void parse(String keyword) throws IOException {
        //Enter keywords to get book information
        List<novel> novelList = getBookList(keyword);

        //initializer
        getWriter();
        long l = System. currentTimeMillis();
        String url = novelList. get(0). bookUrl;
        if (novelList != null) {
            //Crawl novels in a loop
            do {
                Element element = nextPage(url);
                outputToFile(element);
                url = hasNext(element);
            } while (url != null);
            writer. close();
            long time = (System. currentTimeMillis() - l) / 1000;
            System.out.println("\\
\\
 successfully crawled all chapters! Time-consuming" + time + "seconds");
        }
    }


    /**
     * Get book information
     *
     * @param keyword search keyword
     * @throws IOException
     */
    public static List<novel> getBookList(String keyword) throws IOException {
        //Splice search bar URL address
        searchUrl = searchUrl + keyword;
        //Get the current search result page
        Elements elements = nextPage(searchUrl).getElementsByClass("rt_750 f-lt");

        List<Element> elementList = elements. stream(). collect(Collectors. toList());

        novel novel = new novel();
        // get the cover
        List<novel> novelList = elementList.stream().map(
                element -> {
                    // get the cover
                    String imageUrl = element
                            .getElementsByTag("img")
                            .attr("src");
                    //Set the cover
                    novel.setBookImage(imageUrl);

                    // get the path
                    String path = element
                            .getElementsByClass("t")
                            .get(0).getElementsByTag("a")
                            .attr("href");
                    // set the path
                    try {
                        String readPath = getReadPath(bookUrl + path);
                        novel.setBookUrl(readPath);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }

                    // get book title
                    String book_name = element
                            .getElementsByClass("t")
                            .get(0)
                            .getElementsByTag("span")
                            .text();
                    //Set book title
                    fileName = book_name;
                    novel.setBookName(book_name);

                    // Get the author
                    String author = element
                            .getElementsByClass("author")
                            .get(0)
                            .getElementsByTag("span")
                            .text();
                    // set the author
                    novel.setBookAuthor(author);

                    // Get profile
                    String info = element
                            .getElementsByClass("intro")
                            .text();
                    //Set profile
                    novel.setBookInfo(info);

                    //get type
                    String type = element
                            .getElementsByClass("author")
                            .get(0)
                            .getElementsByTag("a")
                            .text();
                    //set type
                    novel. setBookType(type);

                    //Get reading volume
                    Long visitCount = Long. valueOf(element
                            .getElementsByClass("update")
                            .get(0)
                            .getElementsByTag("small")
                            .text()
                            .replace("Read(", "")
                            .replace(")", ""));
                    //Set the reading volume
                    novel. setVisitCount(visitCount);

                    return novel;
                }
        ).collect(Collectors.toList());
        System.out.println("novelList = " + novelList);
        return novelList;
    }

    /**
     * Get the reading address URL
     *
     * @param url url after search
     * @return read address URL
     */
    public static String getReadPath(String url) throws IOException {
        String path = nextPage(url)
                .getElementsByClass("item")
                .get(0)
                .getElementsByTag("a")
                .attr("href");
        return readUrl + path;
    }


    //Get the output stream
    public static void getWriter() throws IOException {
        String path = "D:/" + fileName + ".txt";
        File file = new File(path);
        if (file. exists()) {
            System.out.println("The target book already exists! Please modify the file name or delete the original book " + path);
            System. exit(0);
        }
        writer = new FileWriter(file);
    }


    // crawl the page
    private static Element nextPage(String url) throws IOException {
        // Obtain a connection instance and forge the identity of the browser
        Connection conn = Jsoup. connect(url)
                .header("Accept", "text/html, application/xhtml + xml, application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange ;v=b3;q=0.9")
                .header("Accept-Encoding", "gzip, deflate")
                .header("Accept-Language", "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6")
                .header("Cache-Control", "max-age=0")
                .header("Connection", "keep-alive")
                .header("Host", url)
                .header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56") ;
        return conn.execute().parse().body();
    }

    // Get the title of the current chapter
    private static String getTitle(Element element) {
        String title = element.select(".title h1").text();
        if (StringUtil. isBlank(title)) {
            return title;
        }
        return title;
    }

    // Get the specific content of the chapter
    public static String getContent(Element element) {
        String content = element
                .getElementsByClass("entry content_yh")
                .get(0)
                .getElementsByTag("p")
                .html();
        return content;
    }

    //Is there a next page? If yes, return the URL address of the next page, if not, return NULL
    private static String hasNext(Element element) {

        // Get the address of the next chapter
        String attr = element
                .getElementsByClass("page_box")
                .get(0)
                .getElementsByTag("li")
                .get(2)
                .getElementsByTag("a")
                .attr("onclick")
                .replace("window.location.href='", "")
                .replace("'", "");
        //By observing that when the next chapter exists, the URL will end with .html, and if it does not exist, it will jump to the home page. Use this feature to judge whether there is a next chapter
        System.out.println("attr = " + attr);
        return readUrl + attr;
    }


    // output to file
    public static void outputToFile(Element element) throws IOException {
        String title = getTitle(element);
        String content = getContent(element);
        String text = space + title + "\r\\
\r\\
" + content;
        writer.write(text);
        writer. flush();
        System.out.printf("===>[%s] crawling completed, crawling the next chapter (operation %d)", title, pageCount + + );
    }


}