Data crawling
1. Reptile concept
Web crawler (also known as web spider, web robot, more often called web chaser in the FOAF community) is a program or script that automatically grabs information on the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, emulator, or worm.
Crawlers in a broad sense: search engines, for data acquisition of all websites on the wide area network
Crawlers in a narrow sense: 12306 ticket grabbing, for data acquisition of a certain or a certain type of website
Three key steps of crawlers:
1. Acquisition of website source code
2. Analysis of the source code
3. Data persistence
2. HTTP protocol
3. Common crawler techniques
- Java
URLConnection | provided under the java.net package |
---|---|
HttpClient | Apache provides |
Jsoup | XML Parser for Java |
Webmagic | distributed Crawler framework (based on Scrapy) |
Selenium | Selenium is a web application testing tool. Selenium tests run directly in the browser, just like real users. |
- Python
urllib | Basic crawler tool |
---|---|
requests + bs4 | crawling + parsing |
Scrapy | Distributed crawler, website-level crawling |
PySpider | Distributed crawler with WebUI interface (CN) |
4. The first crawler
The above is to use the URLConnection provided by JDK to get the source code of the website
5. Data extraction techniques
RE | regular expression |
---|---|
CSS | CSS selector |
XPATH | parse XML document |
6. Basic use of Jsoup
6.1 Get the source code of the web page according to the URL
6.2 Basic use of CSS selector
The specific usage of CSS selector can refer to the Chinese website
?https://www.open-open.com/jsoup/selector-syntax.html
7. Crawl Discuz forum data
Crawl all the posts in a certain section in the Discuz forum, and the Discuz website is shown in the figure below.
The content to be crawled is the id of the post | the name of the poster | the id of the poster | the title of the post | the content of the post
7.1 Analyze the source code of a specific post
- id of the post
-
Poster’s name and poster’s id
-
The title of the post is at
-
content of the post
7.2 Write code to crawl the content of a single post page
7.3 Get URLs of all posts in the list page
Crawling idea: To get all the urls of the current list page, you need to get all the a tags under td with class=icn
7.4 Write the code to crawl the url of all posts on the list page
7.5 Java code
- CrawlDiscuz.java
package cn.crawl.demo; import java.io.BufferedWriter; import java.io.FileWriter; import java.io.IOException; import java.io.PrintWriter; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class CrawlDiscuz implements Runnable {<!-- --> // start page private int startPage; // end page private int endPage; public CrawlDiscuz(int startPage, int endPage) {<!-- --> this.startPage = startPage; this. endPage = endPage; } /** * According to the incoming page range, crawl the posts of all list pages within the range * * @param startPage * @param endPage * @throws Exception */ public void getPage(int startPage, int endPage) throws Exception {<!-- --> // create file write object FileWriter fw = new FileWriter("d:/discuz.txt", true); // Create a file output stream with a buffer BufferedWriter bw = new BufferedWriter(fw); // Create a print stream PrintWriter pw = new PrintWriter(bw); for (int i = startPage; i <= endPage; i ++ ) {<!-- --> Document doc = Jsoup.connect( "https://www.discuz.net/forum-2-" + i + ".html").get(); // Get all child links under the current list page Elements aList = doc. select(".icn > a"); System.out.println("total: " + aList.size()); // fetch connection for (Element e : aList) {<!-- --> String url = "https://www.discuz.net/" + e.attr("href"); System.out.println("get ----> " + url); // Call the method to crawl the content of the post String content = getContent(url); // write crawl data to file pw. print(content); // write newline pw. println(); } // refresh stream pw.flush(); } // close the stream pw. close(); } /** * Crawl the content of the post according to the url of the incoming post * @param url * @return * @throws IOException */ public static String getContent(String url) throws IOException {<!-- --> Document doc = Jsoup.connect(url).get(); Element a1 = doc.selectFirst(".authi > a"); // If you get it, crawl it if (a1 != null) {<!-- --> // Get the author and user id of the post String author = a1. text(); // uid String uid = a1.attr("href").split("uid=")[1]; // Get the a tag that stores the post id Element a2 = doc.selectFirst("[title=Print]"); // get post id String tid = a2.attr("href").split("tid=")[1]; // Get the title of the post String title = doc.selectFirst(".ts").text(); // Get the content of the post String content = doc.selectFirst(".t_f").text(); // assemble data StringBuilder sb = new StringBuilder().append(author).append("\t") .append(uid).append("\t").append(tid).append("\t") .append(title).append("\t").append(content).append("\t"); return sb.toString(); } else {<!-- --> // If you can't get it, crawl it because some posts need to log in return ""; } } // thread's run method @Override public void run() {<!-- --> try {<!-- --> // Call the method of crawling list page posts getPage(startPage, endPage); } catch (Exception e) {<!-- --> e.printStackTrace(); } } }
- Test.java
package cn.crawl.demo; public class Test {<!-- --> public static void main(String[] args) throws Exception {<!-- --> // Create a thread for crawling data for(int i = 1;i <= 500;i ++ ){<!-- --> if(i % 10 == 0){<!-- --> new Thread(new CrawlDiscuz(i - 9, i)).start(); } } } }
7.6 Python code
# -*- coding: utf-8 import time import requests from bs4 import BeautifulSoup from multiprocessing import Pool as ThreadPool # Get the content of the webpage according to the url def getHtml(url): text = "" try: r = requests. get(url); r.encoding = r.apparent_encoding; text = r.text; print("get url ---->%s"%(url)) except: print("Can't climb anymore"); return text; # How to get url def getUrl(page): urls = [] for i in range(1,page): url = "https://www.discuz.net/forum-2-%s.html"%(i); print("get page ----> %s"%(url)) soup = BeautifulSoup(getHtml(url), "lxml") #Extract web links aList = soup. select(".icn > a") for a in aList: urls.append("https://www.discuz.net/%s"%(a['href'])) return urls # method to get data def getContent(url): # Generate a soup object based on the source code of the webpage, parse the webpage and use the parser as lxml soup = BeautifulSoup(getHtml(url), "lxml") # Get the a label where the user name and user id are located a = soup. select_one(".authi > a") if a != None: # get title title = soup.select_one("#thread_subject").text uid = a['href']. split("uid=")[1] uname = a.text # get post id tid = url. split("-")[1]. split("-")[0] write_content("%s,%s,%s,%s"%(tid,title,uid,uname)) # method to write data to a file def write_content(text): f = open("D:\res.txt", "a", encoding='utf-8') f. write(text) f.write("\ ") f. close() if __name__ == '__main__': # Starting time start = time. time() urls = getUrl(500) # create thread pool pool = ThreadPool(20) pool. map(getContent, urls) pool. close() # End Time end = time. time() print("Program time: %s"%(end - start))
8. Python crawler
8.1 Python’s first crawler
Python’s Requests library is mainly used for crawling. The official Requests document link is as follows.
https://requests.readthedocs.io/zh_CN/latest/user/quickstart.html
import requests # Use the get method to get the source code of the website and return a response r = requests.get("http://www.baidu.com") print(r. text)
8.2 Use of BeautifulSoup
Analysis of web page source code mainly uses the BeautifulSoup module of the bs4 library, the official document is as follows.
https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
import requests from bs4 import BeautifulSoup # Use the get method to get the source code of the website and return a response r = requests.get("http://125.221.38.2/") html = r.text print(html) # Create a soup object for parsing html documents soup = BeautifulSoup(html,"html. parser") # Get the title tag in html title = soup.title print(title. text) # Get the body tag in html body = soup.body print(body)
- Use of CSS selectors
import requests from bs4 import BeautifulSoup # Use the get method to get the source code of the website and return a response r = requests.get("http://125.221.38.2/") html = r.text # Create a soup object according to the source code of html soup = BeautifulSoup(html,"html. parser") # Get the corresponding element according to the class of the element select_one will only select the first matching element div = soup. select_one(".message") # Get all the p tag elements under the current div select will return all matching elements pList = div. select("p") # traverse the p tags for i in range(0,len(pList)-1): print(pList[i].text)
9. Summary
This article briefly introduces the basic principles of crawlers, and the way Java and Python implement web crawlers. As the Internet environment is becoming more and more standardized, crawling has become a high-risk operation. For data crawlers themselves, if they cause harm to the crawled target website, they may have to bear corresponding legal responsibilities. For crawled websites In other words, it will cause excessive load on the server and affect its own business system.