Android Jsoup crawls webpage data and its limitations, and the idea of interface crawling data

1. Jsoup

jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very labor-saving API, which can retrieve and manipulate data through DOM, CSS and jQuery-like operation methods.

The requirement is to obtain the leaderboard data on a certain website and use it as an App display, so the Jsoup framework comes to mind.

I see that there are actually a lot of Jsoup blogs on the Internet, which are quite good, but there are some differences, and some will be wrong. I still recommend going to the official website to learn. The content is very small and very simple:

Load a Document from a URL: jsoup Java HTML parser

It talks about how to load content from html files, url addresses, and files. It also talks about extracting and manipulating data through DOM, CSS, and operations similar to jQuery. It is very detailed and simple , it’s all in English, if you can’t understand it, you can install a translation plug-in, or turn over the wall in the google browser, and he will translate it for you.

I won’t go into details here, many are very complicated but also very basic, let’s talk about an example: for example, get the top 100 singer data (ranking, cover, name) of the 2022 billboard list

Top Artists – Billboard

Implementation code: Kotlin

1. Use the getListData method to get the singer information list;

2. Dispatchers.IO – This dispatcher is optimized for performing disk or network I/O outside of the main thread. Examples include using Room components, reading data from or writing data to files, and running any network operations.

3. Jsoup.connect(url).get() gets the Document object. After getting the object, you can use those methods to get the data you want.

 fun getListData(result:RequestCallBack<MutableList<ArtistInfo>>): List<ArtistInfo>? {
        CoroutineScope(Dispatchers.IO). launch {
            val artistList = mutableListOf<ArtistInfo>()
            val url = "https://www.billboard.com/charts/year-end/top-artists/"
            try {
                val document = Jsoup.connect(url).get()
                Log.d("TTTT", "jsoup:$document")
                val listDoc: Elements =
                    document.getElementsByClass("o-chart-results-list-row-container")
                Log.d("TTTT", "jsoup:size ${listDoc.size}")
                val artistSize = if (listDoc. size > 50) 50 else listDoc. size
                for (i in 0 until artistSize) {
                    val sortNum = listDoc[i].select("span").text()
                    Log.d("TTTT", "jsoup:$sortNum")
                    val artistName = listDoc[i].select("h3").text()
                    Log.d("TTTT", "jsoup:$artistName")
                    val img = listDoc[i].select("img").attr("data-lazy-src").toString()
                    Log.d("TTTT", "jsoup:$img")
                    val artistInfo = ArtistInfo(). apply {
                        this.artist = artistName
                        this.rank = sortNum.toInt()
                        this.coverOnline = img
                    }
                    artistList.add(artistInfo)//join the list
                    App.getInstance().getArtistDao().insertOrReplace(artistInfo)//Single singer data is inserted into the database, using greenDao
                }
                result.success(artistList)//Return the artist list data, displayed on the UI

            } catch (exception: IOException) {
                exception. printStackTrace()
            }
        }
        return null
    }

Of course, how to know what kind of label and which class the data of the webpage is in? Naturally, go to the browser F12 to view the data.

1. First of all, check the source code of the current webpage with element, and click on the yellow arrow next to it, and then click on the element on the left webpage, and you will know where its data is.

2. In the network, click on the returned data, select response, you can see the data returned by the request, and you can also observe its data structure (there are js, json and other data formats)

The options in this box are very useful, you can see the request method, parameters, preview page, return data, etc.

2. Limitations

I am very happy to use Jsoup. When I crawl most of the domestic websites, most of the data can be obtained.

However, when I crawled some webpages, I found that some data could not be obtained, and I found that some data was not obtained at one time. Press F12 to find that he will request a lot of data in parallel and then display them. So, can Jsoup really get the data of dynamic web pages? ? ? I crawled dynamic webpages in Baidu Jsoup and found a bunch of articles, but I saw that the data crawled in the articles were all obtained at one time, which was really confusing. Does Jsoup have limitations?

So I personally think that the Jsoup crawler has certain limitations, that is, you can crawl static web pages through the jsoup crawler, so you can only crawl part of the data of the current page without special preparation. If you need to crawl all the data of a website, it may be easier to crawl to the proper data information of the website by uploading the correct parameters through the interface.

emmmmm, when crawling the music data of a website, I haven’t used Jsoup for a long time. Finally, F12 queries the interface, and the parameters use the data obtained by the interface.

It is cumbersome to obtain data through the interface, but if the website does not implement anti-crawler, it is still no problem to obtain data.

tips: Jsoup may be able to configure the request body (data) and configure the headers to obtain the data of dynamic web pages, but personally feel that it is a bit troublesome, and there are a lot of pages to crawl, so I chose the interface

3. Interface crawling data ideas

Main process:

1. The target web page, look at the picture below, observe the data in the header, etc., obtain the interface, required parameters, etc.

2. At this time, you will find that the Json data in the response returned by the interface, how do you look at it? ? Here is a super useful website.

Online JSON Viewer

3. Don’t worry if you have a URL, first use postman to simulate accessing data to see if the data you have configured can really access the URL and get the data you want. The data obtained by postman can also be placed in the above URL to view the data structure, because the data crawled by the URL interface will be very large, and postman is not easy to observe.

4. Create the request/return entity class, and then use the android network framework to access it.

Here we only provide ideas, not the specific implementation. With the idea, it will be done quickly.

Personal crawling data is only for personal learning, veterans remember to do it for themselves, hahaha