Scrapy+eChart automatically crawls and generates network security word cloud

Due to work reasons, I have recently begun to pay attention to some security consulting websites. Firstly, I want to learn more about industry security consulting to improve my own security knowledge. Secondly, I also need to collect vulnerability intelligence from various security websites.
As a novice in the field of security intelligence, faced with a large number of security inquiries, I still feel somewhat overwhelmed and overwhelmed. I had nothing to do on the weekend, and I suddenly had a whim. If I built a crawler, I could first crawl down network security articles, and then use machine learning to analyze the articles first, automatically extract the main component keywords of the articles, and then select them based on actual needs. Reading related articles can save a lot of time.
If you can extract the keywords of the article, you can also summarize the overall security situation and public opinion based on the keywords of recent articles, which feels quite reliable.

Overall idea

As mentioned before, the idea is actually very simple:

Use Scrapy to crawl the title and content of the article on the security consulting website.
Segment the content of the article
Extract keywords using TF-IDF algorithm
Save keywords to database
Finally, you can use visualization to display the keywords that have appeared more frequently recently.
It doesn’t look too difficult, there is a link to the code at the end of the article.

Scrapy crawler

Scrapy is a very commonly used python crawler framework. Writing crawlers based on scrapy can save a lot of code and time. The principle will not be described here. Interested students can learn about Scrapy by themselves. Here is only a picture.

Scrapy architecture

Install Scrapy

The author installed Scrapy based on python3.6, so the premise is that your machine has a python3 environment installed. The installation method of scrapy is very simple. You can use pip to install it with one click.

pip3 install scrapy

After installation, students who are not familiar with scrapy can first take a look at the official sample program to get familiar with it, and execute the following command in cmd to generate the sample program.

scrapy startproject tutorial

A complete sample tutorial can be automatically created in the current directory. Here we can see the directory structure of the entire crawler as follows:

tutorial/
    scrapy.cfg # deploy configuration file
    tutorial/ # project's Python module, you'll import your code from here
        __init__.py
        items.py # project items definition file
        pipelines.py # project pipelines file
        settings.py # project settings file
        spiders/ # a directory where you'll later put your spiders
            __init__.py

Analyze web pages

This example takes the “E Security” website as an example. The quality of the security consultation they provide is quite good and is updated every day. Taking a quick look at the structure of the website, you will find that there are more than ten security consulting categories on the navigation bar of this site. Click on it and you will find that the URL of each category is roughly https://www.easyaq.com/type/*.shtml, and each category There are several related articles and links under each category. The idea is very clear at this point. First traverse these article categories, then dynamically obtain the article links under each category, and then access the article links one by one and save the content. Let’s analyze the main code.

Crawling web pages

The main code of the crawler is as follows. The actual code of the crawler developed using the scrapy framework is very streamlined.

import scrapy
from scrapy import Request, Selector
from sec_news_scrapy.items import SecNewsItem

class SecNewsSpider(scrapy.Spider):
    name = "security"
    allowed_domains = ["easyaq.com"]
    start_urls = []
    for i in range(2, 17):
        req_url = 'https://www.easyaq.com/type/%s.shtml' % i
        start_urls.append(req_url)

    def parse(self, response):
        topics = []
        for sel in response.xpath('//*[@id="infocat"]/div[@class="listnews bt"]/div[@class="listdeteal"]/h3/a'):
            topic = {'title': sel.xpath('text()').extract(), 'link': sel.xpath('@href').extract()}
            topics.append(topic)

        for topic in topics:
            yield Request(url=topic['link'][0], meta={'topic': topic}, dont_filter=False, callback=self.parse_page)

    def parse_page(self, response):
        topic = response.meta['topic']
        selector = Selector(response)

        item = SecNewsItem()
        item['title'] = selector.xpath("//div[@class='article_tittle']/div[@class='inner']/h1/text()").extract()
        item['content'] = "".join(selector.xpath('//div[@class="content-text"]/p/text()').extract())
        item['uri'] = topic['link'][0]
        print('Finish scan title:' + item['title'][0])
        yield item

We enumerate the URLs of all categories on the website and put them in start_url. Parse is the entrance for the framework to perform crawler tasks. The framework will automatically access the page set in the previous start_url and return a response object. From this object, useful information can be extracted through xpath. information.
Here we need to analyze the title of the article and the access URI from the HTML of each type of page. Google Chrome provides a good xpath generation tool that can quickly extract the xpath of the target. Press F12 in the browser to see the web page. HTML source code, find the content that needs to be extracted, and right-click to extract xpath.

Obtaining the uri of the article content is not finished yet. We need to further access the uri and record the content of the article for further analysis. The parse_page function here is used for content extraction. The method is the same as above, with the help of chrome’s xpath analysis. The tool can quickly extract the content of the article.
After the content is extracted, the content is stored in Item. Item is another component of the Scrapy framework. It is similar to the dictionary type. It is mainly used to define the format of the transferred data, and the transfer is for the next step of data persistence.

Data persistence

Item.py

class SecNewsItem(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()
    uri = scrapy.Field()
    pass

pipeline.py

import jieba
import jieba.analyse
importpymysql
import re

def dbHandle():
    conn = pymysql.connect(
        host="localhost",
        user="root",
        passwd="1234",
        charset="utf8",
        db='secnews',
        port=3306)
    return conn

def is_figure(str):
    value = re.compile(r'^\d + $')
    if value.match(str):
        return True
    else:
        return False

def save_key_word(item):
    words = jieba.analyse.extract_tags(item['content'], topK=50, withWeight=True)

    conn = dbHandle()
    cursor = conn.cursor()
    sql = "insert ignore into t_security_news_words(title, `key`, val) values (%s,%s,%s)"
    try:
        for word in words:
            if is_figure(word[0]):
                continue
            cursor.execute(sql, (item['title'][0], word[0], int(word[1] * 1000)))
        cursor.connection.commit()
    except BaseException as e:
        print("Storage error", e, "<<<<<<The reason is here")
        conn.rollback()

def save_article(item):
    conn = dbHandle()
    cursor = conn.cursor()
    sql = "insert ignore into t_security_news_article(title, content, uri) values (%s,%s,%s)"
    try:
        cursor.execute(sql, (item['title'][0], item['content'], item['uri']))
        cursor.connection.commit()
    except BaseException as e:
        print("Storage error", e, "<<<<<<The reason is here")
        conn.rollback()

class TutorialPipeline(object):
    def process_item(self, item, spider):
        save_key_word(item)
        save_article(item)
        return item

settings.py

ITEM_PIPELINES = {
    'sec_news_scrapy.pipelines.TutorialPipeline': 300,
}

The items collected in the crawler main program will be passed here. There are two steps here: save_key_word and save_article. The latter saves the title, content, and uri of the article into the MySQL table; here we focus on the former save_key_word function.
Our goal is to automatically analyze the keywords related to the topic in the article and analyze the weight of each word. Specifically, it includes the following steps:

Word segmentation: There are many Chinese word segmentation tools. Here I choose to use jieba to implement it.
Extract keywords: The TF/IDF algorithm has been implemented in jieba. We use this algorithm to select the top 50 words from each article and add weights. Extracting keywords in this way can also directly filter out common mention words. Of course, Jieba also supports custom stop words.

words = jieba.analyse.extract_tags(item['content'], topK=50, withWeight=True)

Extract keywords

Data storage: Extract the required information. The next step is to save the information to MySQL. You can use pymysql to operate MySQL under python3.

Article list

keyword list

Keyword visualization-word cloud

Through the above program, we can crawl all the security advisory articles on the website into the database and extract 50 keywords from each article. Next, we hope to display these keywords in a visual way and highlight the keywords with high frequency, so it is natural to think of using word cloud display.
Here we use the echarts-wordcloud component provided by eChart. The method is very simple. Statistics are collected from the keyword table of MySQL, the k-v string is generated and replaced directly into the HTML page with regular expressions. Of course, the more elegant method here should be to use ajax to fetch data from the DB. I will take a trick here.

def get_key_word_from_db():
    words = {}
    conn = dbHandle()
    try:
        with conn.cursor() as cursor:
            cursor.execute(
                "select `key`, sum(val) as s from t_security_news_words group by `key` order by s desc limit 300")
            for res in cursor.fetchall():
                words[res[0]] = int(res[1])
        return words
    except BaseException as e:
        print("Storage error", e, "<<<<<<The reason is here")
        conn.rollback()
        return {}
    finally:
        conn.close()

Click here to view the dynamic effect. The word cloud associates words with the font size according to the frequency or weight of occurrence. The higher the frequency, the larger the font size. From this, we can roughly perceive some of the current security trends in the industry. Of course, this is just an example. .

word cloud visualization

Debugging skills

There are many IDE options for python. I chose pycharm. When debugging a scrapy program, you need to use the scrapy engine to start, so you cannot debug with the default pycharm. You need to make some settings, as shown in the figure below.
run -> Edit Configurations
Script fills in the location of cmdline.py in the scrapy installation directory; Script parameters are the parameters used when executing scrapy, security is the name of our crawler; Working directory writes the root directory of the crawler.

After configuration, you can directly use pycharm to start debugging, run -> debug ‘xxx’

For complete code examples, including echart’s part, please see github

Author: Huawei Cloud Enjoyment Expert Chrysanthemum Tea