The average query speed is increased by 700%. Qi’anxin upgraded the log security analysis system based on Apache Doris.

Introduction to this article:

The arrival of the digital intelligence era has made network security an important area that cannot be ignored. As a leading network security solution leader, Qi’anxin is committed to providing enterprises with advanced and comprehensive network security protection. Its log analysis system plays a key role in network security. Through in-depth analysis of operational log data, it can detect vulnerabilities and abnormal behaviors to generate key insights to help enterprises establish effective defense strategies. This article will deeply explore the key advantages of QiAnxin in network security and log analysis solutions, and understand how the new integrated log storage and analysis platform built on Apache Doris can monitor and analyze log events in real time, strengthen the tracking and response to suspicious activities, and improve the system. Security and responsiveness.

Author|Qi Anxin Server Technical Expert Shu Peng

QiAnXin is the leader in China’s enterprise-level network security market, focusing on providing next-generation network security products and services to government and enterprise users. At present, the core product Tianqing Terminal Security System has been deployed by 40 million government and enterprise users in China, has deployed more than 1 million servers nationwide, and serves more than 400,000 large institutions. As a national network security team, Qi’anxin is determined to build a safe cyberspace for the country and continues to lead in technology research and development in the fields of terminal security, cloud security, threat intelligence, and situational awareness.

As the digital transformation of modern enterprises continues to deepen, the widespread application of innovative technologies such as big data, the Internet of Things, and 5G has accelerated the pace of digital transformation of enterprises. This has caused the original network boundaries to be broken, and terminal devices from multiple sources have become a new Security boundaries.

The defense performance of network security systems is inseparable from log analysis. When network devices, operating systems, and applications are running, a large number of operating logs will be generated, which contain rich data value. Maximizing the use of operation log data can effectively detect security risks of internal systems, restore attack paths, trace back attack entrances, etc., which can further improve system security and ensure corporate network security. Therefore, the log analysis system plays an indispensable role in it. .

This article will introduce Qi Anxin’s practical experience in network security scenarios, carrying out architecture upgrade iterations and building a new integrated log storage and analysis platform based on Apache Doris.

Early architecture pain points and needs

The architecture of the security log platform is shown in the figure below. The original device and system logs first go through the business processing link, including operations such as normalization and dimension expansion. These processing steps are designed to convert logs from different devices and systems into secure logs in semi-structured JSON format and write them to a Kafka message queue.

The latest logs will be written to the real-time data warehouse. Security analysts can interactively query the latest data in the real-time data warehouse through the analysis platform to conduct security analysis such as attack analysis and traceability. In addition, the offline data warehouse is used to save historical data to support offline analysis of long-term data mining.

WechatIMG494.jpg

In the above log data platform, the writing speed and query analysis efficiency of log data are crucial for upper-level business personnel to monitor and analyze security events in real time. This is also the main pain point we currently face.

On the one hand, the security log data produced every day reaches hundreds of billions, which puts great pressure on writing. Initially, we chose to use a Fork version of Apache Doris to store log data. However, in actual applications, as the amount of new logs added every day continued to grow, the storage speed gradually decreased, the cluster writing pressure was too high, and the data backlog was serious during peak periods. , which has a great impact on the stability of the cluster, and when the data pressure is high, the query efficiency cannot be guaranteed to be effective. Subsequently, we expanded the cluster multiple times, gradually expanding from 3 nodes to 13 nodes. Although the machine cost has greatly exceeded expectations, the write efficiency has not substantially improved.

On the other hand, business personnel often need to perform keyword matching on text fields (such as URL, payload, etc.) when analyzing security logs. In the original system, full scanning and brute-force matching could only be performed through SQL LIKE. The overall query performance was poor. Querying hundreds of billions of data tables took close to minutes or even hundreds of seconds. Even after filtering a large amount of data according to time intervals, the query It still takes several seconds to tens of seconds. Once concurrent queries are encountered, the performance will further deteriorate, making it difficult to meet the needs of daily security analysis.

In addition to writing and query efficiency, operation and maintenance monitoring is also one of our pain points. The visual operation and maintenance system provided by this vendor requires commercial license authorization, which is not friendly to users in the open source community, and cluster maintenance is in an original manual state.

Thinking about architecture selection and upgrade

In order to solve the pain points of past versions and meet the requirements for more efficient and real-time log analysis, we urgently need to upgrade the early system. At the same time, facing security log analysis scenarios, we have also put forward higher requirements for the architecture of the new log analysis platform:

  • Writing performance: On the one hand, the system needs to support real-time writing and storage of massive virus killing events and other data to meet the timeliness requirements of analysis; on the other hand, the system needs to support rich data based on the log data Schema Free feature Type writing and changes.

  • Query performance: Since log query analysis involves full-text retrieval of text types, JSON data, and range queries of dates or ordinary values, the system needs to provide fuzzy query capabilities for strings and also needs to support flexible Create indexes with rich types to speed up the filtering of massive data and improve query efficiency.

  • Storage cost: The device generates a large amount of log data every day. In order to mine this valuable log information, business personnel also need to filter and analyze the data and trace back abnormal logs, which makes log storage The scale is large and the storage period is relatively long, so cost-effective storage cost is also one of the goals of system construction.

  • Operation and maintenance costs: The ease of operation and maintenance of the system itself and whether it has appropriate management and control tools can help us further improve efficiency.

In the process of continuing to pay attention to the industry’s OLAP database, we found that Apache Doris has developed very rapidly in the past year. The latest version 2.0 also takes log storage and retrieval analysis as a new focus, launching inverted index, NGram BloomFilter index, etc. Features, the performance of keyword retrieval and LIKE text matching has been greatly improved, which is very consistent with our pain point of slow text retrieval, so we started the upgrade journey of the new architecture.

Architecture upgrade journey

As mentioned above, in the overall architecture selection process, our main concerns include writing performance, query performance, data storage cost, operation and maintenance cost, etc. During the architecture upgrade process, we chose the latest version 2.0 of Apache Doris at that time. The specific upgrade benefits are as follows.

01 Writing performance improved by over 200%

In order to evaluate the extreme writing performance of Apache Doris, we initially used 3 servers with the same configuration as the online system to access real online writing traffic from Kafka. During the test, when the CPU writing efficiency reached 100%, the writing The throughput reached 1.08 million items/s and 1.15 GB/s, and the visibility delay of written data was maintained at the second level.

The original system cluster running online has a scale of 13 units. Under the same data writing situation, the CPU utilization is about 30% and the write throughput is only 300,000 items/s. In addition, the CPU load is high during peak periods and the system response is slow. The problem.

Based on the test results, we estimate that after the architectureis replaced with Apache Doris, the same 30% CPU usage will be maintained, only 3 servers will be needed to meet the writing requirements, and the machine resource cost will be saved by at least 70%. It is worth noting that in the test, the inverted index was enabled for half of the fields in the Apache Doris table. If the inverted index was not enabled, the write performance could be improved by about 50% based on the previous level.

02 Storage costs reduced by nearly 40%

After seeing the significant improvement in write performance, Apache Doris storage space usage also brought us surprises. With the inverted index turned on, the storage space is slightly lower than the original system without the inverted index, and the compression ratio is increased from 1: 4.3 to 1: 5.7.

By comparing the file sizes stored on disk by Apache Doris, the index file (.idx) and data file (.dat) sizes of the same data are almost the same. In other words, the data expansion rate of Doris is about 1 times after adding indexes. Compared with the 3-5 times expansion rate of many databases and retrieval engines, Doris’s data storage space usage is relatively low. After research, it was found that Apache Doris uses columnar storage and ZSTD compression algorithm to optimize storage space usage. Doris stores both the original data and the inverted index in the form of columns, so that data in the same column is stored in adjacent locations, thereby achieving a higher compression rate.

ZSTD is an excellent new compression algorithm that uses an intelligent optimization algorithm. Compared with the common GZIP algorithm, ZSTD has a higher compression rate and faster decompression speed, especially when processing log scenarios.

03 Query performance improved by an average of 690%

For the query performance that the business is most concerned about, we analyzed 79 SQL statements after deduplication from the online query logs, and compared and tested Apache Doris and The query of the original system takes time.

We found that compared with the original system, all query statements have been significantly improved, the overall query performance has increased by nearly 7 times, and the performance of 26 SQL query statements has increased by more than 10 times, of which 8 SQL queries The improvement is 10-20 times, 14 SQL queries are improved 20-50 times, and 4 SQL queries are improved more than 50 times. The SQL query statement with the largest difference is Q43. The execution time in the original system is close to one minute, but in Apache Doris it takes less than 1 second. The performance difference is as high as 88 times.

WechatIMG495.jpg

For queries with high performance improvement, we conducted a comparative analysis and found several common points:

Inverted index accelerates keyword search: Q23, Q24, Q30, Q31, Q42, Q43, Q50, etc.

1 -- For example, q43 increased by 88.2 times
2 
3 SELECT count() from table2
4 WHERE ( event_time >= 1693065600000 and event_time < 1693152000000)
5 AND (rule_hit_big MATCH 'xxxx');

This technology of keyword retrieval based on inverted index has significant advantages over basic text matching after violent scanning. On the one hand, it greatly reduces the amount of data that needs to be read; on the other hand, during the query process There is no need to perform text matching operations, so query efficiency is often improved by an order of magnitude or even higher.

WechatIMG496.jpg

NGram BloomFilter index acceleration of LIKE: Q75, Q76, Q77, Q78, etc.

1 -- For example, q75 increases by 44.4 times
2 
3 SELECT * FROM table1
4 WHERE ent_id = 'xxxxx'
5 AND event_date = '2023-08-27'
6 AND file_level = 70
7 AND rule_group_id LIKE 'adid:%'
8 ORDER BY event_time LIMIT 100;

For scenarios where the search is not a complete keyword, LIKE is still a useful query method. Apache Doris’s NGram BloomFilter index can speed up regular LIKE.

NGram BloomFilter index is different from ordinary BloomFilter index. Instead of putting the entire text into BloomFilter, it divides the text into consecutive substrings, each substring is of length n, and puts them into NGram BloomFilter. For the query of cola LIKE '%pattern%', divide 'pattern' into substrings of length n in the same way, and determine whether each substring exists in BloomFilter. , if a substring does not exist, it means that there is no data block matching 'pattern' in the data block corresponding to BloomFilter, so the query can be accelerated by skipping the data block scanning step.

Query optimization of the latest TopN logs that meet the conditions: Q19-Q29, etc.

1 -- For example, q22, increased by 50.3 times
2 
3 SELECT * FROM table1
4 where event_date = '2023-08-27' and file_level = 70
5 and ent_id = 'nnnnnnn' and file_name = 'xxx.exe'
6 order by event_time limit 100;

This kind of SELECT * FROM t WHERE xxx ORDER BY xx LIMIT n query is used very frequently when looking for the latest n logs that meet certain conditions. Apache Doris has carried out a SQL query mode for this kind of query. Special optimization determines the dynamic range of the sorting field based on the intermediate state of the query, and uses automatic dynamic predicate pushdown to avoid reading all the data to sort and get TopN, thereby reducing the amount of data that needs to be read (sometimes even by an order of magnitude) ), thereby improving query efficiency.

04 Visual operation and maintenance control and visual query WebUI, minimizing the cost of operation and maintenance and exploration and analysis

In order to improve the efficiency of daily cluster maintenance, we use Feilun Technology’s free and open visual cluster management tool Cluster Manager for Apache Doris (hereinafter referred to as Doris Manager). The functions provided by Doris Manager can meet the needs of cluster monitoring, inspection, configuration modification, expansion and contraction, upgrade and other operations in daily operation and maintenance, reducing the trouble of manual operation of logging in to the machine and the risk of misoperation.

WechatIMG497.jpg

In addition to managing and controlling the Apache Doris cluster, Doris Manager also integrates a visual log exploration and analysis WebUI similar to Kibana, which is very friendly to users accustomed to ELK log analysis. It supports keyword retrieval, trend chart display, trend chart drag and drop date range, and detailed log flattening. Interactive and convenient exploratory analysis such as paving and folding display, field value filtering, etc. is very consistent with the analysis needs of log scenario exploration and drill-down.

WechatIMG498.jpg

Summary and Planning

Following the release of Apache Doris 2.0-alpha, 2.0-beta, and 2.0 official versions, we conducted detailed evaluations based on business scenarios and provided feedback to the community on many optimization suggestions, which received positive responses and solutions. After a month of trial operation of the system, we officially used version 2.0.1 in the production environment, replaced the original system cluster, completed the architecture upgrade, and achieved various benefits such as writing performance, query performance, storage cost, operation and maintenance cost, etc. :

  • Writing performance increased by more than 3 times: Currently, QiAnXin’s log analysis platform has an average of hundreds of billions of new security log data every day. Through Doris’ Routine Load, the data can be stably written into the database in real time. , ensuring low-latency and high-throughput writing of data.

  • Query performance has improved by an average of 7 times: Query response time has been significantly reduced, with an average improvement of 7 times compared with previous query efficiency. Among them, the full-text retrieval speed that the business is particularly concerned about has increased by more than 20 times, helping logs Analytics and cybersecurity operational efficiency.

  • Efficient and convenient visual management: The Cluster Manager for Apache Doris tool provides a visual cluster monitoring and alarm platform to meet a series of operations such as daily cluster monitoring. At the same time, the various functions of WebUI provide analysts with simple operation and easy use. Convenient interactive analysis. All in all, Doris’ ease of use and flexibility greatly reduce the learning and usage costs for development, operation and maintenance, and analysts.

In the future, we will also explore more capabilities of Apache Doris in log analysis scenarios. We will expand the related applications of the JSON data type and strengthen the system’s ability to conduct in-depth analysis of semi-structured data. At the same time, we are alsolooking forward to the new Variant variable data type in Apache Doris 2.1, which supports the storage of JSON data of any structure, supports changes in the number and type of fields, and allows business personnel to flexibly define special characters to change Well implement the analysis needs of semi-structured data Schema Free.

We are very grateful to the SelectDB team for their continued technical support, which has helped QiAnxin move towards “systematic defense and digital operation” network log security management, helping customers accurately identify, protect and supervise network equipment and various systems, ensuring that business personnel can operate at any time. Safe, reliable, and stable access to data and business can be achieved at all times.

Finally, we will continue to participate in the construction of the Apache Doris community and contribute relevant results back to the community. We hope that Apache Doris will develop rapidly and get better and better!

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. MySQL entry skill treeMySQL logQuery log 77309 people are learning the system