Data crawling concept and Java and Python language implementation

Data crawling

1. Reptile concept

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture directly Upload (img-IQ7YLiIA-1684933894430)(C:\Users\14461\AppData\Roaming\Typora\typora-user-images\image-20230524205050509.png)]

Web crawler (also known as web spider, web robot, more often called web chaser in the FOAF community) is a program or script that automatically grabs information on the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, emulator, or worm.

Crawlers in a broad sense: search engines, for data acquisition of all websites on the wide area network

Crawlers in a narrow sense: 12306 ticket grabbing, for data acquisition of a certain or a certain type of website

Three key steps of crawlers:

1. Acquisition of website source code
2. Analysis of the source code
3. Data persistence

2. HTTP protocol



3. Common crawler techniques

  • Java
URLConnection provided under the java.net package
HttpClient Apache provides
Jsoup XML Parser for Java
Webmagic distributed Crawler framework (based on Scrapy)
Selenium Selenium is a web application testing tool. Selenium tests run directly in the browser, just like real users.
  • Python
urllib Basic crawler tool
requests + bs4 crawling + parsing
Scrapy Distributed crawler, website-level crawling
PySpider Distributed crawler with WebUI interface (CN)

4. The first crawler


The above is to use the URLConnection provided by JDK to get the source code of the website

5. Data extraction techniques

RE regular expression
CSS CSS selector
XPATH parse XML document

6. Basic use of Jsoup

6.1 Get the source code of the web page according to the URL

6.2 Basic use of CSS selector

The specific usage of CSS selector can refer to the Chinese website

?https://www.open-open.com/jsoup/selector-syntax.html

7. Crawl Discuz forum data

Crawl all the posts in a certain section in the Discuz forum, and the Discuz website is shown in the figure below.

The content to be crawled is the id of the post | the name of the poster | the id of the poster | the title of the post | the content of the post

7.1 Analyze the source code of a specific post

  • id of the post

  • Poster’s name and poster’s id

  • The title of the post is at

  • content of the post

7.2 Write code to crawl the content of a single post page

7.3 Get URLs of all posts in the list page

Crawling idea: To get all the urls of the current list page, you need to get all the a tags under td with class=icn

7.4 Write the code to crawl the url of all posts on the list page

7.5 Java code

  • CrawlDiscuz.java
package cn.crawl.demo;

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class CrawlDiscuz implements Runnable {<!-- -->
// start page
private int startPage;
// end page
private int endPage;

public CrawlDiscuz(int startPage, int endPage) {<!-- -->
this.startPage = startPage;
this. endPage = endPage;
}

/**
* According to the incoming page range, crawl the posts of all list pages within the range
*
* @param startPage
* @param endPage
* @throws Exception
*/
public void getPage(int startPage, int endPage) throws Exception {<!-- -->
// create file write object
FileWriter fw = new FileWriter("d:/discuz.txt", true);
// Create a file output stream with a buffer
BufferedWriter bw = new BufferedWriter(fw);
// Create a print stream
PrintWriter pw = new PrintWriter(bw);
for (int i = startPage; i <= endPage; i ++ ) {<!-- -->
Document doc = Jsoup.connect(
"https://www.discuz.net/forum-2-" + i + ".html").get();
// Get all child links under the current list page
Elements aList = doc. select(".icn > a");
System.out.println("total: " + aList.size());
// fetch connection
for (Element e : aList) {<!-- -->
String url = "https://www.discuz.net/" + e.attr("href");
System.out.println("get ----> " + url);
// Call the method to crawl the content of the post
String content = getContent(url);
// write crawl data to file
pw. print(content);
// write newline
pw. println();
}
// refresh stream
pw.flush();
}
// close the stream
pw. close();

}

/**
* Crawl the content of the post according to the url of the incoming post
* @param url
* @return
* @throws IOException
*/
public static String getContent(String url) throws IOException {<!-- -->
Document doc = Jsoup.connect(url).get();
Element a1 = doc.selectFirst(".authi > a");
// If you get it, crawl it
if (a1 != null) {<!-- -->
// Get the author and user id of the post
String author = a1. text();
// uid
String uid = a1.attr("href").split("uid=")[1];
// Get the a tag that stores the post id
Element a2 = doc.selectFirst("[title=Print]");
// get post id
String tid = a2.attr("href").split("tid=")[1];
// Get the title of the post
String title = doc.selectFirst(".ts").text();
// Get the content of the post
String content = doc.selectFirst(".t_f").text();
// assemble data
StringBuilder sb = new StringBuilder().append(author).append("\t")
.append(uid).append("\t").append(tid).append("\t")
.append(title).append("\t").append(content).append("\t");
return sb.toString();
} else {<!-- -->
// If you can't get it, crawl it because some posts need to log in
return "";
}
}

// thread's run method
@Override
public void run() {<!-- -->
try {<!-- -->
// Call the method of crawling list page posts
getPage(startPage, endPage);
} catch (Exception e) {<!-- -->
e.printStackTrace();
}
}

}
  • Test.java
package cn.crawl.demo;
public class Test {<!-- -->
public static void main(String[] args) throws Exception {<!-- -->
// Create a thread for crawling data
for(int i = 1;i <= 500;i ++ ){<!-- -->
if(i % 10 == 0){<!-- -->
new Thread(new CrawlDiscuz(i - 9, i)).start();
}
}
}
}

7.6 Python code

# -*- coding: utf-8
import time
import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool as ThreadPool

# Get the content of the webpage according to the url
def getHtml(url):
    text = ""
    try:
        r = requests. get(url);
        r.encoding = r.apparent_encoding;
        text = r.text;
        print("get url ---->%s"%(url))
    except:
        print("Can't climb anymore");
    return text;

# How to get url
def getUrl(page):
    urls = []
    for i in range(1,page):
        url = "https://www.discuz.net/forum-2-%s.html"%(i);
        print("get page ----> %s"%(url))
        soup = BeautifulSoup(getHtml(url), "lxml")
        #Extract web links
        aList = soup. select(".icn > a")
        for a in aList:
            urls.append("https://www.discuz.net/%s"%(a['href']))
    return urls

# method to get data
def getContent(url):
    # Generate a soup object based on the source code of the webpage, parse the webpage and use the parser as lxml
    soup = BeautifulSoup(getHtml(url), "lxml")
    
    # Get the a label where the user name and user id are located
    a = soup. select_one(".authi > a")
    if a != None:
        # get title
        title = soup.select_one("#thread_subject").text
        uid = a['href']. split("uid=")[1]
        uname = a.text
        # get post id
        tid = url. split("-")[1]. split("-")[0]
        write_content("%s,%s,%s,%s"%(tid,title,uid,uname))
    
# method to write data to a file
def write_content(text):
    f = open("D:\res.txt", "a", encoding='utf-8')
    f. write(text)
    f.write("\
")
    f. close()
    

if __name__ == '__main__':
    # Starting time
    start = time. time()
    urls = getUrl(500)
    # create thread pool
    pool = ThreadPool(20)
    pool. map(getContent, urls)
    pool. close()
    # End Time
    end = time. time()
    print("Program time: %s"%(end - start))

8. Python crawler

8.1 Python’s first crawler

Python’s Requests library is mainly used for crawling. The official Requests document link is as follows.

https://requests.readthedocs.io/zh_CN/latest/user/quickstart.html

import requests
 
# Use the get method to get the source code of the website and return a response
r = requests.get("http://www.baidu.com")
print(r. text)

8.2 Use of BeautifulSoup

Analysis of web page source code mainly uses the BeautifulSoup module of the bs4 library, the official document is as follows.

https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

import requests
from bs4 import BeautifulSoup

# Use the get method to get the source code of the website and return a response
r = requests.get("http://125.221.38.2/")
html = r.text
print(html)
# Create a soup object for parsing html documents
soup = BeautifulSoup(html,"html. parser")
# Get the title tag in html
title = soup.title
print(title. text)
# Get the body tag in html
body = soup.body
print(body)
  • Use of CSS selectors
import requests
from bs4 import BeautifulSoup
# Use the get method to get the source code of the website and return a response
r = requests.get("http://125.221.38.2/")
html = r.text

# Create a soup object according to the source code of html
soup = BeautifulSoup(html,"html. parser")
# Get the corresponding element according to the class of the element select_one will only select the first matching element
div = soup. select_one(".message")
# Get all the p tag elements under the current div select will return all matching elements
pList = div. select("p")
# traverse the p tags
for i in range(0,len(pList)-1):
    print(pList[i].text)

9. Summary

This article briefly introduces the basic principles of crawlers, and the way Java and Python implement web crawlers. As the Internet environment is becoming more and more standardized, crawling has become a high-risk operation. For data crawlers themselves, if they cause harm to the crawled target website, they may have to bear corresponding legal responsibilities. For crawled websites In other words, it will cause excessive load on the server and affect its own business system.