etcd implements large-scale service governance application practice

Introduction: Service governance is currently more and more valued by enterprise construction, especially now that various technologies such as cloud native and microservices are applied by more enterprises. The content of this article is based on the large model of Baidu Mini Program Team Some summaries of practical experience in service governance, combined with the current popular distributed open source kv product etcd, will not only deeply analyze the implementation principles of the two core technologies of ectd, Raft and boltdb, but also disclose the real practical experience of service governance, hoping to help everyone Get more help on the road to service governance.

1. Introduction to the concept of service governance

Service governance is a part of IT governance. It focuses on relevant elements in the service life cycle. Its key links include service registration and discovery, smooth service upgrade, traffic monitoring, traffic control, fault location, security, etc.

Services need “governance”, but governance requires cost. If a service has simple business logic, clear operation process, and timely location and rollback when problems occur, the cost of service governance may be very low, or even only need Manual processing is fine, but in complex businesses, service providers and service users may run in different processes (even on different physical nodes), and be developed and maintained by different teams. Both teamwork and service collaboration require a lot of coordination. The more coordination work, the higher the complexity, so there is a demand for service governance. By building a unified service governance platform, the service governance capabilities of the business can be effectively improved, including collaborative standardization, real-time monitoring, and continuous optimization of calls The efficiency of the link, as well as assisting in reducing the complexity of dependencies and avoiding risks. In large-scale business systems, service governance is already an indispensable part of the technical architecture and one of the most important infrastructures of the entire business system.

The above picture is the netflix service topology map circulating on the Internet. The dense white dots in the figure are the service nodes of netflix. The connection between nodes indicates that there are calls between services. Nodes and connections constitute a complex service call chain. Such a huge application system must be managed through a powerful service governance platform.

Service governance is essentially the management and control of the service life cycle, so the core requirement of the service governance platform is how to solve the pain points in the service life cycle, which includes the following aspects:

1. Registration and discovery

The service caller must obtain the address of the service provider before invoking the service, that is, the caller needs to “discover” the service provider in a way, which is service discovery. To complete service discovery, it is necessary to store the information of the service provider in a certain carrier. The action of this storage is “service registration”, and the stored carrier is called “service registration center”. In any service management platform, “registration center” is an essential module. Service registration and discovery are the most basic functions in service governance. In the service life cycle, it is responsible for the initial link of the service.

2. Traffic monitoring
After the service registration is discovered, it is the invocation of the service, and a large number of service invocations form traffic. Traffic monitoring is to clearly control the calling relationship and status of many services. It mainly includes call topology, call tracking, logs, monitoring alarms, etc. Service governance monitors the service call relationship as a whole through the call topology, and quickly finds and locates problems by establishing a monitoring system, so as to perceive the operating status of the business system as a whole . In the service life cycle, traffic monitoring is responsible for the awareness of the running status of the service.

3. Traffic scheduling
During the operation of the business system, there are often hot issues such as sales promotions, seckill, and celebrity scandals, or emergencies such as network disconnection, power outages, and large-scale system upgrades in the computer room, resulting in sudden increases in the traffic of some services in the business system. In this way, the traffic of the service needs to be scheduled and managed. Traffic management includes two aspects: from the perspective of a single service at the micro level, it is the management of the service invocation process, including when to adopt which load balancing strategy, routing strategy, and fuse current limiting strategy. These strategies are collectively called the invocation strategy; Generally speaking, it is the management of traffic distribution. According to certain traffic characteristics and traffic proportions, grayscale releases, blue-green releases, etc. can be performed. These are called traffic distribution strategies. Service invocation strategy and traffic distribution strategy all need to be analyzed through the invocation data collected by traffic monitoring, so as to make decisions, and then implement them on the service governance platform. Traffic scheduling is responsible for the running state management of services.

4. Service control
How the traffic scheduling strategy takes effect on the service provider and caller can be restarted to take effect, or it can take effect in real time in the running state. This depends on the service governance platform’s control over the service. After the service governance platform fully builds the service governance capability, The policy of service governance can be distributed to the service in real time and take effect immediately.

5. Service security
Each service bears its own business responsibilities. Some business-sensitive services need to authenticate and authorize access to other services, which is a security issue.

This article calls thousands of services a large-scale application system. The system is characterized by a large number of services, a large number of service instances, and a large number of service calls. When the service management platform manages the services of this type of business system, it needs Facing the great challenge of:

1. High reliability
Large-scale business systems, massive service calls, and intricate call relationships have high requirements for service reliability. Many basic-level services require 99.99% reliability. Therefore, the service management platform for maintaining these services also requires reliability. Very high, and to achieve such high reliability, the service management platform itself needs to implement multi-level deployment, multi-region hot backup, downgrade isolation, smooth online and other solutions.

2. High performance
On the premise of ensuring reliability, service governance must also have high performance. For example, in the monitoring data, it can quickly and accurately perceive a single point of failure of a service, so that traffic can be distributed to other processes of the service. go up. If the number of services in the business system is not large and the call volume is not high, then the amount of monitoring data will not be large, and the single point of failure of the service can be easily found. However, in the real-time massive call data, some conventional query methods It takes a lot of time, and by the time a single point of failure is perceived, irreparable business losses may have been caused. Therefore, performance is an important indicator for considering the governance capabilities of the service governance platform. How to ensure high performance, high-speed storage, multi-level cache, and linear deployment are all essential.

3. High expansion
High scalability includes two aspects: the services of large-scale application systems may be developed and maintained by multiple teams, and their levels and technical capabilities are also uneven. Therefore, the service governance platform needs to provide compatibility and expansion capabilities. , to manage different services as much as possible; at the same time, when the service volume of the business system increases, the service management platform should also have the ability to expand synchronously to ensure its high reliability and performance.

Facing the governance challenges of massive services, the service governance platform also needs a powerful and easy-to-use storage tool to deal with, and etcd is a good choice.

2. Introduction to etcd

2.1 etcd development background and introduction of related competing products

In 2013, when the CoreOS entrepreneurial team built an open source, lightweight operating system ContainerLinux, in order to deal with the coordination problem between multiple copies of user services, a self-developed high-availability KV distribution for configuration sharing and service discovery was developed. Type storage component – etcd.

Below we also made a comparison between Zookeeper and Consul:

· ZooKeeper
ZooKeeper fully meets the requirements in terms of high availability, data consistency, and functions. However, the reasons why CoreOS insists on self-developing etcd are as follows:

1. ZooKeeper does not support changing members safely through the API. It is necessary to manually modify the node configuration and restart the process. If the operation is wrong, it may cause online failures such as split brains. The online changes of the runtime configuration have expected goals, and the maintenance cost of ZooKeeper is relatively high in this regard.

High-load read and write performance, ZooKeeper does not perform well in the case of large-scale instance connections.

etcd name is composed of “/etc” folder and “d” distributed system. The “/etc” folder is used to store configuration data of a single system, while “etcd” is used to store configuration data of a large-scale distributed system. The etcd cluster can provide high stability, high reliability, high scalability and high performance Distributed KV storage service. etcd is implemented based on the replication state machine. It consists of a Raft consistency module, a log module, and a state machine based on boltdb persistent storage. It can be applied to configuration management, service discovery, and distributed consistency of distributed systems.

ZooKeeper, like etcd, can solve problems such as distributed system consistency and metadata storage, but etcd has the following advantages over ZooKeeper:

1. Dynamic cluster membership reconfiguration

2. Stable reading and writing ability under high load

3. Multi-version concurrency control data model

4. Reliable key monitoring

5. The Lease primitive separates the connection from the session

6. Distributed locks ensure API security

7. ZooKeeper uses its own RPC protocol, which is limited in use; while the etcd client protocol is based on gRPC and can support multiple languages.

·Consul

Consul and etcd solve different problems. etcd is used for distributed consistent KV storage, while Consul focuses on end-to-end service discovery. It provides built-in health checks, failure detection and DNS services, etc. In addition, Consul uses RESTful HTTP APIs Provide KV storage capacity. However, when the KV usage reaches the million level, problems such as high latency and memory pressure will appear.

In terms of consensus algorithms, etcd and Consul implement data replication based on the Raft algorithm, while ZooKeeper implements it based on the Zab algorithm. The Raft algorithm consists of Leader election, log synchronization, and security, while the Zab protocol consists of Leader election, discovery, synchronization, and broadcast.

In terms of distributed CAP, etcd, Consul, and ZooKeeper are all CP systems. When a network partition occurs, new data cannot be written.

The following table is a comparative analysis of the key capabilities of the three:

2.2 etcd core technology introduction

Achieving high data availability and strong consistency based on Raft protocol

Early data storage services introduced multi-copy replication technology solutions to solve single-point problems, but both master-slave replication and decentralized replication have certain defects. The operation and maintenance of master-slave replication is difficult, and it is difficult to balance consistency and availability; decentralized replication, there are various write conflicts that require business processing. The distributed consensus algorithm is the key to solving the problems of multi-copy replication. The distributed consensus algorithm, also known as the consensus algorithm, was first proposed based on the background of the replication state machine. As the first consensus algorithm, Paxos is too complicated, difficult to understand, and difficult to implement in engineering. The Raft algorithm proposed by Diego of Stanford University breaks down the problem into three sub-problems, which is easy to understand and reduces the difficulty of engineering implementation. These three sub-issues are: Leader election, log replication, and security.

Leader Election

The Raft protocol in etcd (version 3.4 + ) defines four states for cluster nodes: Leader, Follower, Candidate, and PreCandidate.

Under normal circumstances, the Leader node will regularly broadcast heartbeat messages to the Follower nodes according to the heartbeat interval to maintain the Leader status. After receiving it, the Follower replies a heartbeat response packet message to the Leader. The leader will have a term number (term), which means that it starts from an election, and the node that wins the election will act as the leader during this term. The term number increases monotonically, and acts as a logical clock in the Raft algorithm to compare the old and new data of each node, identify expired Leaders, and so on.

When the Leader node is abnormal, the Follower node will receive the Leader’s heartbeat message timeout. When the timeout time is greater than the election timeout time, it will enter the PreCandidate state, and will not automatically increase the term number. It will only initiate a pre-vote (poll, to prevent the node data from being far away from lagging behind other nodes and initiate an invalid election), after obtaining the approval of most nodes, enter the Candidate state. The node entering the Candidate state will wait for a random time, and then initiate the election process, self-increment the term number, vote for itself, and vote for others Nodes send election voting information.

When node B receives the election message of node A, there are two situations:

1. Node B judges that the data of node A is at least as new as itself, the term number of node A is greater than the term number of node B, and node B has not voted for other candidates, it can vote for node A, and node A is supported by the majority of nodes in the cluster , can become the new Leader.

2. If node B also initiates an election and votes for itself, then it will refuse to vote for node A. At this time, if no node can get the support of the majority of votes, it can only wait for the election to time out and start a new round of election.

Log Replication

The Raft log structure is shown in the following figure:

The Raft log is composed of entries in an ordered index, and each log entry contains the term number and proposal content. The Leader keeps track of the progress information of each Follower by maintaining two fields. One is NextIndex, which means that the Leader sends to the Follower node The next log entry index; the other is MatchIndex, which indicates the maximum log entry index that this Follower node has replicated.

This article takes the whole process from submitting the “hello=world” proposal to receiving the response as an example to briefly introduce the etcd log replication process:

1. After the Leader receives the proposal information submitted by the Client, it generates a log entry, traverses the log progress of each Follower at the same time, and generates an RPC message for appending logs to each Follower;

2. Broadcast the RPC message of appending logs to each Follower through the network module;

3. After the Follower receives the additional log message and persists it, it replies that the Leader has copied the largest log entry index, that is, MatchIndex;

4. After receiving the response from the Follower, the Leader updates the MatchIndex corresponding to the Follower;

5. Based on the MatchIndex information submitted by each Follower, the Leader calculates the submitted index position of the log entry, which means that the log entry is persisted by more than half of the nodes;

6. The Leader informs each Follower that the log index position has been submitted through a heartbeat;

7. When the client’s proposal is marked as submitted, the leader replies to the client that the proposal is approved.

Through the above process, the Leader synchronizes the log entries to each Follower to ensure the data consistency of the etcd cluster.

Security

etcd ensures the security of the Raft algorithm by adding a series of rules to elections and log replication.

Election Rules:

1. For a term number, only one Leader can be elected, and the Leader election requires the support of more than half of the nodes in the cluster;

2. When the node receives the election vote, if the term number of the latest log entry of the candidate is smaller than itself, it refuses to vote. If the term number is the same but the log is shorter than itself, it also refuses to vote.

Log Replication Rules:

1. Leader complete characteristics, if a log entry has been submitted in a certain term number, then this log entry must appear in all Leaders with a larger term number;

2. The append-only principle, the Leader can only append log entries, but cannot delete persistent log entries;

3. Log matching feature. When the Leader sends log append information, it will bring the index position (indicated by P) and term number of the previous log entry. After the Follower receives the Leader’s log append information, it will verify the term of the index position P Whether the number is consistent with the Leader, only if they are consistent can they be appended.

boltdb storage technology

Another core technology of ectd is boltdb storage, which provides efficient b + tree retrieval capabilities and supports transaction operations. It is one of the key capabilities supporting etcd’s high-performance read and write.

The implementation of boltdb refers to the design idea of LMDB (LightningMemory-MappedDatabase), which is based on an efficient and fast memory-mapped database solution. Based on the structure design of B + tree. The data file design bolt uses a separate memory-mapped file to implement a copy-on-write B+ tree, which makes reading faster. Moreover, the loading time of BoltDB is very fast, especially when recovering from crash, because it does not need to read the log (in fact, it does not have it at all) to find the last successful transaction, it only reads from two B + trees The root node read ID.

File Storage Design

Due to the single-file mapping storage, the bolt processes the file into blocks according to the specified length, and each block stores different content types. By default, a length of 4096 bytes is used for chunking. The beginning of each block has a separate pageid (int64) identification.

There are several types of file blocks:

Data file panorama structure

Description:

Metapage is fixed at page0 and page1
Pagesize is fixed at 4096 bytes
The bolt’s file writing adopts the local order (little endian) mode, for example, the content written in hexadecimal 0x0827 (2087) is 2708000000000000

The advantage of the single-file solution is that operations such as merging and deleting files are not required, and only the extension length needs to be added to the original file.

Query Design

boltdb provides a very efficient query capability, you can take a look at its object design:

In terms of object design, when boltdb loads, it first loads meta data into memory, then locates the location of the data block according to the bucket, then locates the location of the branchnode according to the value of the key, and then locates the leaf value node.

Let’s take query as an example to explain. The following is a basic query sample code:

tx, err := db.Begin(true) // start transaction
 
  if err != nil {
 
      return
 
  }
 
  b := tx.Bucket([]byte("MyBucket")) // query bucket by name
 
  v := b.Get([]byte("answer20")) // query based on key
 
  fmt. Println(string(v))
 
  tx. Commit()

Corresponding to the above code, the following sequence diagram can provide a more detailed understanding of the operation process of a query:

The most critical code above is the search method, and the following are the main code snippets, which have been added with comments for easy reading.

func (c *Cursor) search(key []byte, pgid pgid) {
    p, n := c.bucket.pageNode(pgid)
    if p != nil & amp; & amp; (p.flags & amp;(branchPageFlag|leafPageFlag)) == 0 {
        panic(fmt.Sprintf("invalid page type: %d: %x", p.id, p.flags))
    }
    // Push the current query node (page, node) onto the stack
    e := elemRef{page: p, node: n}
    c.stack = append(c.stack, e)
 
    // If we're on a leaf page/node then find the specific node.
    if e.isLeaf() {
        c.nsearch(key)
        return
    }
    // if node cached sea by node's inodes field
    if n != nil {
        c. searchNode(key, n)
        return
    }
    // recursively to load branch page and call search child node again
    c. searchPage(key, p)
}

Third, Baidu builds a large-scale service governance construction idea based on etcd

3.1 Specific challenges

Tianlu is a set of solutions developed and built by the Baidu Mini Program Team for the needs of large-scale business service governance. One of its goals is to become a standard model of Baidu’s service governance. Tianlu-mesher consists of five parts: registration center, visual management platform, SDK framework, unified gateway, and tianlu-mesher. Currently, it has access to 150+ product lines, and the number of instances has reached hundreds of thousands. With the increase in the number of teams accessing the platform and the rapid growth of service instances, how to easily collaborate among a large number of teams and achieve high availability and high performance of the large-scale service governance platform has always been a challenge that TELLO continues to face.

3.2 Ideas and plans for overall architecture construction

As a service management platform, Tianlu’s core concept is to provide convenient calls for all services, unified service monitoring and management, and simplify service development and maintenance costs. We think about building a large-scale service governance platform based on etcd from the following different aspects: high availability, high performance, high scalability, and ease of use.

·High availability

Considering the high network delay of etcd cross-computer room calls, we adopt single-computer room deployment, and we also implement the master-standby cluster switching solution to solve the problem that the etcd cluster is unavailable when all instances in the single-machine room are down.
In order to reduce the strong dependence of the platform on etcd, we made a solution to downgrade etcd to use cache. When it is monitored that the performance of the etcd cluster cannot support the current platform, using the cache to store instance data can allow the operation and maintenance personnel to operate normally without affecting the system before restoring etcd; after etcd is restored, switch back to the etcd cluster to continue working.

High performance

The kv query performance of the etcd cluster is very high, and the qps can reach more than 10,000. In order to solve performance problems under extreme concurrency, the registry uses multi-level caching to improve query efficiency and reduce the access pressure on etcd.
Calls between services use a direct connection method, the request does not need to be forwarded through the registration center, and there is basically no time loss for the call.

High expansion

Considering that the number of service instances will reach millions in the future, we need to consider the high scalability of the architecture.

· Ease of use

Users can view the registered services through the visual management platform, and can also update the configuration of the service governance strategy in real time through the management platform, and adjust the service governance strategy in real time.
Connect the call log to the trace platform, and users can find the records of the entire call chain on the trace platform through the traceId, which is convenient for quick problem location when errors occur.
Multilingual SDK, supports multiple rpc technologies, including Baidu’s self-developed rpc technology brpc and http jsonrpc protocols, etc.

Structure diagram

3.3 Key indicators and operation and maintenance goals

In addition, in order to better implement the operation and maintenance of the service governance platform, the following key assessment indicators and operation and maintenance requirements are also required.

Key metrics:

· Availability of more than 99.99%;

· The flat sound is below 100ms.

Operation and maintenance goals:

Fault detection early

· Configure monitoring alarms, including registry instance health, etcd ping, memory and cpu monitoring.

Fast troubleshooting

· Automatic processing: Through the callback mechanism of noah, some faults are automatically processed to improve the processing speed.

· Manual processing: watch mechanism.

Fourth, Summary

At present, service governance is increasingly valued by enterprises, especially now that various technologies such as cloud native and microservices are applied by more enterprises, but there are still many challenges to be truly applied and well integrated. In addition to a set of mature service governance products, the team’s overall cognition of service governance, deep accumulation of technical experience, and ability to follow service-oriented design capabilities will all affect the final implementation effect. This article only gives you some inspiration on the selection of service governance products, hoping to help you walk better and more steadily on the road of service governance.