Swift uses Embassy library for data collection: hot news automatic generator

Overview

A crawler is a software that automatically crawls data from web pages. Crawlers can be used for various purposes such as search engines, data analysis, content aggregation, etc. This article will introduce how to use Swift language and Embassy library to write a simple crawler program that can collect hot information from news websites and generate a simple news summary.

Text

Swift language and Embassy library

Swift is a modern, high-performance, secure, and expressive programming language mainly used to develop applications for platforms such as iOS, macOS, watchOS, and tvOS. Swift can also be used to develop applications such as server-side and command-line tools. Swift supports multiple programming paradigms, such as object-oriented, functional, protocol-oriented, etc. Swift also provides a powerful error handling mechanism that allows developers to handle exceptions more easily.

Embassy is a network library based on Swift NIO that allows developers to easily create asynchronous network applications. Embassy provides an event loop that can handle multiple network requests and responses in a single thread. Embassy also provides an HTTP client that can send HTTP requests and receive HTTP responses. Embassy also supports HTTPS, WebSocket, HTTP/2 and other protocols.

Design and implementation of crawler program

This article will use Swift language and Embassy library to write a crawler program that can collect hot information from the Sina news website and generate a simple news summary. The program is designed and implemented as follows:

First, create an event loop to handle network requests and responses.
Then, create an HTTP client that sends HTTP requests and receives HTTP responses.
Then, use a crawler proxy to randomly select the proxy IP address through the proxy IP pool to avoid being blocked by the target website.
Then, create a URL queue to store the URL addresses to be crawled.
Next, create a parser to parse the HTML document and extract information such as news titles, links, time, and content.
Then, create a generator that generates a simple news summary based on the news content.
Finally, create a main function that starts the event loop, takes the URL address from the URL queue, sends the HTTP request, and processes the HTTP response.

The following is the code implementation of the program (with Chinese comments):

//Import Embassy library
import Embassy

//Create an event loop
let loop = try SelectorEventLoop(selector: try KqueueSelector())

//Create an HTTP client
let httpClient = DefaultHTTPClient(eventLoop: loop)

//Create a URL queue
let urlQueue = [
    "https://news.sina.com.cn/",
    // ...
]

//Create a parser
func parse(html: String) -> (title: String, link: String, time: String, content: String)? {<!-- -->
    // Use regular expressions or other methods to parse HTML documents and extract information such as news titles, links, time and content
    // If the parsing is successful, return a tuple; if the parsing fails, return nil
    // This is just an example, the actual parsing method may require more complex logic
    let pattern = "<h1><a href="(.*?)".*?>(.*?)</a></h1>.*?<span class="time"> (.*?)</span>.*?<p class="content">(.*?)</p>"
    let regex = try? NSRegularExpression(pattern: pattern, options: [])
    if let match = regex?.firstMatch(in: html, options: [], range: NSRange(location: 0, length: html.count)) {<!-- -->
        let link = (html as NSString).substring(with: match.range(at: 1))
        let title = (html as NSString).substring(with: match.range(at: 2))
        let time = (html as NSString).substring(with: match.range(at: 3))
        let content = (html as NSString).substring(with: match.range(at: 4))
        return (title, link, time, content)
    } else {<!-- -->
        return nil
    }
}

//Create a generator
func generate(content: String) -> String {<!-- -->
    // Use natural language processing or other methods to generate a simple news summary based on the news content
    // This is just an example, the actual generation method may require more complex logic
    //A simple rule is used here: take the first three sentences in the news content as the news summary
    let sentences = content.components(separatedBy: ".")
    if sentences.count >= 3 {<!-- -->
        return sentences[0...2].joined(separator: ".") + "."
    } else {<!-- -->
        return content
    }
}

//Create a main function
func main() {<!-- -->
    // Start event loop
    loop.runForever {<!-- --> error in
        print(error)
    }

    // Get the URL address from the URL queue
    for url in urlQueue {<!-- -->
        // Use the proxy IP pool (refer to the domain name, port, username, and password of Yiniu Cloud crawler proxy. You need to register on the official website and obtain it for free)
        let proxy = "http://16YUN:[email protected]:7102"
        //Send HTTP request and process HTTP response
        httpClient.request(
            method: "GET",
            url: url,
            headers: ["User-Agent": "Mozilla/5.0"],
            proxyURLString: proxy,
            body: nil
        ) {<!-- --> response, error in
            if let error = error {<!-- -->
                print(error)
            } else if let response = response {<!-- -->
                print("Status code:", response.statusCode)
                print("Headers:", response.headers)
                var data = Data()
                response.body.drain {<!-- --> chunk, error in
                    if let chunk = chunk {<!-- -->
                        data.append(chunk)
                    } else if let error = error {<!-- -->
                        print(error)
                    } else {<!-- -->
                        //Convert data to string
                        if let html = String(data: data, encoding: .utf8) {<!-- -->
                            // Call the parser, parse the HTML document, and extract news information
                            if let news = parse(html: html) {<!-- -->
                                print("Title:", news.title)
                                print("Link:", news.link)
                                print("Time:", news.time)
                                print("Content:", news.content)
                                // Call the generator to generate a simple news summary based on the news content
                                let summary = generate(content: news.content)
                                print("Summary:", summary)
                            } else {<!-- -->
                                print("Failed to parse HTML")
                            }
                        } else {<!-- -->
                            print("Failed to convert data to string")
                        }
                    }
                }
            } else {<!-- -->
                print("No response")
            }
        }
    }
}

// call main function
main()

Conclusion

This article introduces how to use Swift language and Embassy library to write a simple crawler program that can collect hot information from news websites and generate a simple news summary. This article also provides the code implementation of the program, with Chinese comments. If you are interested in crawler technology, you can continue to study and explore in depth.