Preface: When I was making an e-book project, it was too troublesome to manually search for novel resources, so I thought of crawling novel resources from other websites. After we learn jsoup, we will find that it is actually an investigation of front-end technology, so we recommend a blog, which is suitable for friends who are not solid in front-end technology
The basic use of jsoup and API content_jsoup Chinese document_shijialeya.’s blog-CSDN blog
First experience
1. Import dependencies:
<!-- Maven coordinate address --> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.13.1</version> </dependency>
2. Call Jsoup’s connect
static function to create a connection, and pass the crawled target website as a parameter:
public class Demo { public static void main(String[] args) { Connection connect = Jsoup.connect("https://www.cnblogs.com/hanzhe"); // Get the URL address of the URL } }
In order to prevent the crawler from being restricted, the request header is set here to imitate the browser client, which can be modified by referring to the request, for example:
public class Demo { public static void main(String[] args) { Connection connect = Jsoup.connect("https://www.cnblogs.com/hanzhe") // I only set one here, if you encounter problems with crawling, you can add header information at any time .header("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56") ; } }
3. Then call the execute
method to start crawling, and retrieve the crawled data through body
:
public class Demo { public static void main(String[] args) throws IOException { Connection connect = Jsoup.connect("https://www.cnblogs.com/hanzhe") .header("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56") ; String body = connect. execute(). body(); System.out.println(body); } }
connect.execute().body(): Obtain a body string, which is a string converted from the entire page source code of the obtained URL
4. But it was said before that Jsoup can extract the content of the webpage just like operating JS, so we need to parse the content before obtaining the crawled content:
public class Demo { public static void main(String[] args) throws IOException { Connection connect = Jsoup.connect("https://www.cnblogs.com/hanzhe") .header("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56") ; // Use the parse function to parse the crawled content Element body = connect. execute(). parse(). body(); System.out.println(body); } }
connect.execute().parse().body(): Get the element object of the entire page, and you can perform JS code operations on the element.
It is obvious that the parsed HTML has been formatted, it looks very neat, and the return value has changed from a string to an ELement
instance, and content filtering can be realized by operating the instance
5. Test crawling page essay:
Open the F12 developer tool and try to get information related to the title of the essay:
Observe that the title of each essay is decorated with two classes postTitle2 vertical-middle
, we can use the selector to find all titles:
6. We will use this selector to crawl all essay titles in Jsoup:
public class Demo { public static void main(String[] args) throws IOException { Connection connect = Jsoup.connect("https://www.cnblogs.com/hanzhe") .header("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56") ; Element body = connect. execute(). parse(). body(); Elements elementsByClass = body.getElementsByClass("postTitle2 vertical-middle"); elementsByClass.forEach(item->{ System.out.println(item.text()); }); } }
We can perform JS code operations on Element objects
So far, we have a preliminary understanding of jsoup:
To put it bluntly, after grabbing the source code of the page, operating it is actually a test of our front-end knowledge
Actual Combat – Crawling Favorite Novels
process:
① Obtain the URL address of the search bar, and we will search it for keywords
② Obtain the URL of the web page after the search, which is the address of the novel after our search
③Start crawling novels
package org.claris.core.utils; import lombok. AllArgsConstructor; import lombok.Data; import lombok. NoArgsConstructor; import org.jsoup.Connection; import org.jsoup.Jsoup; import org.jsoup.internal.StringUtil; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.File; import java.io.FileWriter; import java.io.IOException; import java.util.List; import java.util.stream.Collectors; public class HtmlParseUtil { @Data @AllArgsConstructor @NoArgsConstructor static class novel { //cover path private String bookImage; //path private String bookUrl; //book title private String bookName; //author private String bookAuthor; //Introduction private String bookInfo; //type private String bookType; //reading volume private Long visitCount; } //The website name of the crawler target website private static String novel_name = ""; //Get the url address of the search bar, splicing keywords for retrieval private static String searchUrl = "http://www.iktxt.com/search.html?keywords="; //Get the url address of the novel web page private static String bookUrl = "http://www.iktxt.com"; //Get the url address of the reading address private static String readUrl = "http://www.iktxt.com"; //Output file name (usually book title) private static String fileName = ""; //Space, four grid position private static String space = " "; // file output stream private static FileWriter writer; //counter private static int pageCount = 1; public static void main(String[] args) throws IOException { parse("Fights Break the Sky"); } /** * execute program * * @param keyword * @throws IOException */ public static void parse(String keyword) throws IOException { //Enter keywords to get book information List<novel> novelList = getBookList(keyword); //initializer getWriter(); long l = System. currentTimeMillis(); String url = novelList. get(0). bookUrl; if (novelList != null) { //Crawl novels in a loop do { Element element = nextPage(url); outputToFile(element); url = hasNext(element); } while (url != null); writer. close(); long time = (System. currentTimeMillis() - l) / 1000; System.out.println("\\ \\ successfully crawled all chapters! Time-consuming" + time + "seconds"); } } /** * Get book information * * @param keyword search keyword * @throws IOException */ public static List<novel> getBookList(String keyword) throws IOException { //Splice search bar URL address searchUrl = searchUrl + keyword; //Get the current search result page Elements elements = nextPage(searchUrl).getElementsByClass("rt_750 f-lt"); List<Element> elementList = elements. stream(). collect(Collectors. toList()); novel novel = new novel(); // get the cover List<novel> novelList = elementList.stream().map( element -> { // get the cover String imageUrl = element .getElementsByTag("img") .attr("src"); //Set the cover novel.setBookImage(imageUrl); // get the path String path = element .getElementsByClass("t") .get(0).getElementsByTag("a") .attr("href"); // set the path try { String readPath = getReadPath(bookUrl + path); novel.setBookUrl(readPath); } catch (IOException e) { e.printStackTrace(); } // get book title String book_name = element .getElementsByClass("t") .get(0) .getElementsByTag("span") .text(); //Set book title fileName = book_name; novel.setBookName(book_name); // Get the author String author = element .getElementsByClass("author") .get(0) .getElementsByTag("span") .text(); // set the author novel.setBookAuthor(author); // Get profile String info = element .getElementsByClass("intro") .text(); //Set profile novel.setBookInfo(info); //get type String type = element .getElementsByClass("author") .get(0) .getElementsByTag("a") .text(); //set type novel. setBookType(type); //Get reading volume Long visitCount = Long. valueOf(element .getElementsByClass("update") .get(0) .getElementsByTag("small") .text() .replace("Read(", "") .replace(")", "")); //Set the reading volume novel. setVisitCount(visitCount); return novel; } ).collect(Collectors.toList()); System.out.println("novelList = " + novelList); return novelList; } /** * Get the reading address URL * * @param url url after search * @return read address URL */ public static String getReadPath(String url) throws IOException { String path = nextPage(url) .getElementsByClass("item") .get(0) .getElementsByTag("a") .attr("href"); return readUrl + path; } //Get the output stream public static void getWriter() throws IOException { String path = "D:/" + fileName + ".txt"; File file = new File(path); if (file. exists()) { System.out.println("The target book already exists! Please modify the file name or delete the original book " + path); System. exit(0); } writer = new FileWriter(file); } // crawl the page private static Element nextPage(String url) throws IOException { // Obtain a connection instance and forge the identity of the browser Connection conn = Jsoup. connect(url) .header("Accept", "text/html, application/xhtml + xml, application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange ;v=b3;q=0.9") .header("Accept-Encoding", "gzip, deflate") .header("Accept-Language", "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6") .header("Cache-Control", "max-age=0") .header("Connection", "keep-alive") .header("Host", url) .header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56") ; return conn.execute().parse().body(); } // Get the title of the current chapter private static String getTitle(Element element) { String title = element.select(".title h1").text(); if (StringUtil. isBlank(title)) { return title; } return title; } // Get the specific content of the chapter public static String getContent(Element element) { String content = element .getElementsByClass("entry content_yh") .get(0) .getElementsByTag("p") .html(); return content; } //Is there a next page? If yes, return the URL address of the next page, if not, return NULL private static String hasNext(Element element) { // Get the address of the next chapter String attr = element .getElementsByClass("page_box") .get(0) .getElementsByTag("li") .get(2) .getElementsByTag("a") .attr("onclick") .replace("window.location.href='", "") .replace("'", ""); //By observing that when the next chapter exists, the URL will end with .html, and if it does not exist, it will jump to the home page. Use this feature to judge whether there is a next chapter System.out.println("attr = " + attr); return readUrl + attr; } // output to file public static void outputToFile(Element element) throws IOException { String title = getTitle(element); String content = getContent(element); String text = space + title + "\r\\ \r\\ " + content; writer.write(text); writer. flush(); System.out.printf("===>[%s] crawling completed, crawling the next chapter (operation %d)", title, pageCount + + ); } }