Why does Lettuce lead to longer downtime?

Author: Yang Bodong (Fan Che)

This article details the optimization process of Alibaba Cloud Database Tair/Redis to reduce the recovery time of Alibaba Cloud Database Tair/Redis from the initial 900s to 120s and then to 30s in unanticipated downtime switching scenarios using long-connect clients, involving product optimization and open source product problem repair. and many other aspects.

1. Background

Lettuce1 is an excellent Redis2 Java client that supports synchronous, asynchronous, streaming and other programming interfaces, and is very popular among users. Starting in 2020, as the number of its users increased, many users reported that when using the Lettuce client, in the case of certain Redis failures and downtime, Lettuce would continue to time out for up to 15 minutes, causing business problems. unavailable.

Alibaba Cloud database engineers also received customer feedback, so we began an in-depth investigation and continued to track and solve this problem. Finally, as recently as September, the issue was effectively resolved. Below we describe this problem using the standard version architecture of Redis (note that this problem still exists even in non-cloud environments).

(Figure 1. Redis Standard Edition dual-copy switching process)

In the Redis standard version architecture, the open source SDK obtains the VIP address through domain name resolution, establishes a connection to Ali-LB, and then to Redis Master (1’ and 1 connections in the figure correspond).
When the Master goes down directly due to an unexpected failure, there is a probability that RST will not be generated.
The HA component detects Master downtime and calls the Ali-LB switch_rs interface to switch the backend connection from Master to Replica.
After the switch is completed, Ali-LB will not actively release the old client connection on the front end. The packets sent by the client to Ali-LB will be discarded by default because the backend is unavailable, so the client will continue to time out. At this time, if a new connection is established (for example, 4′, the connection will be established to the new Master), there will be no problem, but the Lettuce client will not re-establish the connection after timeout, so there is a problem with the old connection.
Until Ali-LB’s est_timeout(default 900s) is reached, Ali-LB will reply RST to disconnect, and then the client will recover.

Note: For some network card outages, network partition failures, etc., RST will not be generated probabilistically. In most downtimes, the operating system will send RST to the client before exiting, so this problem will not occur only after switching or downtime; in the case of normal switching, since the Master can serve, in step 3, the HA component will actively send Give the client kill command to the old Master, allowing the client to initiate a reconnection recovery.

2. Problem Analysis

First of all, this is a design flaw in the Lettuce client. The reason is described later in the comparative analysis with other clients.
Secondly, this is an immature mechanism of Ali-LB (it remains silent after switching and does not close its connection with the Client). Therefore, all database products using Ali-LB will encounter it, including RDS. MySQL etc.
Since the unavailability of 900s has a great impact on Tair, for example, if the user has 10,000 QPS, then 900s involves about 10 million QPS, so we are the first to promote the solution of this problem.

2.1 Why are there no problems with Jedis and Redisson clients?

Jedis is a connection pool mode. After the bottom layer times out, the current connection will be destroyed. The next time the connection is re-established, it will connect to the new switching node and recover.

Jedis connection pool mode

try {
    jedis = jedisPool.getResource(); // Get a connection before querying
    // jedis.xxx // Execute operation query
} catch (Exception e) {
    e.printStackTrace(); // Timeout, command error, etc.
} finally {
    if (jedis != null) {
        // close here, if the connection is normal, return to the connection pool
        // If the connection is abnormal, the connection will be destroyed
        jedis.close();
    }
}

Redisson itself supports sending pings to the server at intervals to determine whether it is alive. If the connection fails, it will initiate a reconnection.

Redisson’s PingConnectionInterval parameter

// PingConnectionInterval: How many ms is the interval between sending PING packets to the server. On this connection, if the connection fails, reconnect. The default is 30000.
config.useSingleServer().setAddress(uri).setPingConnectionInterval(1000);
RedissonClient connect = Redisson.create(config);

2.2 Can I keep alive by configuring TCP’s KeepAlive?

The conclusion is that it does not work, because the priority of TCP Retransmission Package is higher than KeepAlive, that is, if it is an active connection, when this problem occurs, TCP Retran will be started first, depending on the tcp_retries23 parameter ( The default is 15 times, which takes 924.6 s).

(Figure 2. Flow chart of active connection black hole problem)

T1: Client sends set key value to Ali-LB
T2: Ali-LB replied ok
T3: Client sends get key to Ali-LB, but the backend switches at this time. After that, Ali-LB does not have any response, and the client times out.
T4: Start the first tcp retran
T5: Start the second tcp retran
T6: It is still in tcp retran at this time, but because the Ali-LB est_timeout time has been reached, Ali-LB replies with RST and the client will recover. If Ali-LB never replies to the RST, TCP will actively disconnect and reconnect after the retransmission is completed, and it can also be restored.

Therefore, if the client wants to solve this problem, it cannot be accomplished by relying on TCP KeepAlive. You can also refer to this question on Zhihu “TCP already has the SO_KEEPALIVE option, why does it need to add a heartbeat packet mechanism to the application layer” 4 , and Lettuce has supported the option of setting KeepAlive since version 6.1.05, but as analyzed previously, this does not solve the problem of active connections. Therefore, we submitted a detailed issue6 to Lettuce to describe the problem, reproduction method, cause, and possible repair methods. The author also agreed with the problem.

3. Problem Solving

3.1 Emergency hemostasis

Since there is no other effective method, we can only adjust the est_timeout to 120s (cannot be smaller, otherwise the normal silent connection will be disconnected), which means that the user will be damaged up to 135s (120s + 15s detection, note: after unavailability, the detection must be completed before switching can be initiated).
The official website documentation does not recommend users to use Lettuce.

3.2 Client side repair

Try One: Add PingConnectionInterval for Lettuce

In the above analysis, we mentioned that if the client wants to solve this problem, it needs to implement the liveness determination mechanism of the application layer. In short, the client will indirectly insert the liveness determination packet into the connection with the server. Note that here, use The connection must be an existing connection between the client and the server, not a separate new connection, otherwise it will be misjudged, because the problem is a black hole in the connection dimension. If a new connection is used to judge, the server will return normal results. .

After submitting commit7, the author is not very satisfied with this plan. He thinks:

This repair method is more complicated.
Since Lettuce supports Command Listener, he believes that users can close the connection themselves after the Command times out.
Redis itself has some Block commands, such as xread and brpop. At this time, the connection is hung and exploration cannot be carried out.

After communication, we refuse to allow users to close the connection themselves through the Command Listener method because the modification is complicated. This means that each user must change the code in order to use Lettuce safely, and the cost will be very high. However, the block command cannot be used through this solution. The problem to be solved does exist, so it has been put on hold for the time being.

Try 2: Use TCP_USER_TIMEOUT

TCP USER TIMEOUT is a TCP option specified in RFC 54288, which is used to extend the “User Timeout” parameter in the TCP RFC 7939 protocol itself (the original protocol does not allow configuration of parameter size). It is used to control the survival time of data packets that have been sent but have not yet been ACKed. If this time is exceeded, the connection will be forcibly closed. It can be used to solve the above-mentioned problem of high Retran priority that KeepAlive cannot solve. The following is the case of KeepAlive working with Retran and TCP USER TIMEOUT.

After confirming that TCP_USER_TIMEOUT could solve this problem, we communicated with the author again, and the author agreed to the repair access. We submitted PR10, which was eventually merged. Later, we also verified that the effect of the repair was in line with expectations. The black hole problem can be solved by using the following version, but you need to rely on netty-transport-native-epoll:4.1.65.Final:linux-x86_64. When EPOLL is available, use the following code to enable it. tcpUserTimeout can be configured according to the specific business conditions. It is recommended to be 30s. .

Turn on TCP_USER_TIMEOUT

bootstrap.option(EpollChannelOption.TCP_USER_TIMEOUT, tcpUserTimeout);

Lettuce repaired SNAPSHOT version

<dependency>
    <groupId>io.lettuce</groupId>
    <artifactId>lettuce-core</artifactId>
    <version>6.3.0.BUILD-SNAPSHOT</version>
</dependency>
<dependency>
    <groupId>io.netty</groupId>
    <artifactId>netty-transport-native-epoll</artifactId>
    <version>4.1.65.Final</version>
    <classifier>linux-x86_64</classifier>
</dependency>

3.3 Ali-LB repair solution

Ali-LB has launched the Connection Draining function to address this problem. Connection Draining means connection draining and is used for graceful shutdown.

Graceful shutdown means that the back-end server is usually available. As shown in the figure below, there are 4 servers behind an Ali-LB. Perform a scaling operation to remove Server 4. For Requests 4 and 6 that will be sent to this server (same as on the connection) ), within the draining configured time (0-900s), Server4 will still respond to the Request, and will not disconnect until the draining time is reached. Note: After draining, the new The link will no longer be dispatched to Server4, so subsequent requests such as 7, 8, and 9 will not be sent to Server4. This is also a prerequisite for emptying.

Therefore, once draining is enabled, the client will receive the RST of Ali-LB after the draining time is reached at the latest.

(Figure 3. Connection Draining diagram)

Compared with the est_timeout mechanism, the advantage of Connection Draining is that it reduces misjudgments and delivers to the best of its ability.

(Table 1. est_timeout vs. connection draining)

After the Ali-LB team launched Connection Draining, we cooperated with the verification to shorten the failure time from 120s to 30s, which complies with the SLA of the Redis product. It has been released to the entire network. This also solves the problem of other Redis convergence connection SDKs and the entire database product. The connection black hole problem.

4. Summary

This article details the principles and solutions to the Lettuce client black hole problem:

From the client side: You can upgrade Lettuce to the latest 6.3.0 version and turn on the TCP_USER_TIMEOUT parameter. On Alibaba Cloud, there is no need to modify the code. Ali-LB’s Connection Draining will actively avoid this problem (No user upgrade is required, Alibaba Cloud will actively and gradually change it).
A vicious bug in a widely used software package can do a lot of harm. For example, Lettuce this time was originally the most commonly used Redis SDK in Spring Boot. Due to the author’s pretentiousness or truthfulness (see 6), many cloud users suffered a large number of vicious failures in the past few years. , we have not only seen consultation and promotion from Azure, AWS and Huawei in the promotion, but also seen countless developers looking forward to Fix. Redis and Tair should also increase their investment in community SDKs, especially self-developed and self-controlled SDKs.

It took about 2 years from the discovery to the repair of this problem, and it was finally solved. The road has been long and difficult, but the end is coming soon!

Reference reading

[01] https://github.com/lettuce-io/lettuce-core

[02] https://github.com/redis/redis

[03] tcp_retries2

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

[04] “TCP already has the SO_KEEPALIVE option, why do we need to add a heartbeat packet mechanism to the application layer? 》

https://www.zhihu.com/question/40602902/answer/209148428

[05] https://github.com/lettuce-io/lettuce-core/issues/1437

[06] https://github.com/lettuce-io/lettuce-core/issues/2082

[07] https://github.com/yangbodong22011/lettuce-core/commit/23bafbb9255c87ed96a6476c260b299f852ee88a

[08] TCP_USER_TIMEOUT

https://www.rfc-editor.org/rfc/rfc5482.html

[09] https://www.rfc-editor.org/rfc/rfc793

[10] https://github.com/lettuce-io/lettuce-core/pull/2499