13 Case Studies | How does TCP congestion control cause business performance jitter?

This lesson will share with you the relationship between TCP congestion control and business performance jitter.

TCP congestion control is the core of the TCP protocol and is a very complex process. If you don’t understand TCP congestion control, you don’t understand the TCP protocol. The purpose of this lesson is to introduce some pitfalls that we need to avoid in TCP congestion control through some cases, as well as some points that need to be paid attention to when tuning TCP performance.

Because there are many cases that cause problems during TCP transmission, we will not take these cases and analyze them step by step. Instead, we hope to make a layer of abstraction on these cases and combine these cases with specific knowledge points. This will be more systematic. Moreover, after understanding these knowledge points, the case analysis process is relatively simple.

In the first two lectures (Lectures 11 and 12), we talked about possible problem points that need attention in the single-machine dimension. However, network transmission is a more complex process, which involves more problems and is more difficult to analyze. I believe many people have had this experience:

While waiting for the elevator, I was chatting with someone on WeChat. After entering the elevator, I couldn’t send the WeChat message;

Sharing the same network with my roommate, when I was having fun playing online games, the game suddenly became very stuck. It turned out that my roommate was downloading movies;

Use ftp to upload a file to the server, but I didn’t expect it to take a long time to upload;

In these problems, TCP’s congestion control comes into play.

How does TCP congestion control affect business network performance?

Let’s first look at the general principles of TCP congestion control.

TCP congestion control

The above figure is a simple diagram of TCP congestion control, which is roughly divided into four stages.

1. Slow start

After the TCP connection is established, the sender enters the slow start stage, and then gradually increases the number of packets sent (TCP Segments). Every time RTT (round-trip time) passes during this stage, the number of packets sent will double. As shown below:

TCP Slow Start diagram

The number of initially sent packets is determined by init_cwnd (initial congestion window). This value is set to 10 (TCP_INIT_CWND) in the Linux kernel. This is an empirical value summarized by Google researchers. This empirical value is also Written into RFC6928. Moreover, the Linux kernel also changed it from the default value of 3 to 10 recommended by Google in version 2.6.38. If you are interested, you can take a look at this commit: tcp: Increase the initial congestion window to 10.

Increasing init_cwnd can significantly improve network performance, because many TCP Segments can be sent at once in the initial stage. For more detailed reasons, please refer to the explanation of RFC6928.

If your kernel version is older (lower than CentOS-6 kernel version), you may consider increasing init_cwnd to 10. If you want to increase it to a larger value, it’s not impossible, but you need to do more experiments based on your network conditions to get a more ideal value. Because if the initial congestion window is set too large, it may cause a high TCP retransmission rate. Of course, you can also adjust this value more flexibly through ip route, or even configure it as a sysctl control item.

Increasing the value of init_cwnd will be very effective in improving the network performance of short connections, especially short connections where the amount of data can be sent in the slow start stage. For example, for services such as http, the amount of data requested by http short connections is generally not large. Usually the transfer can be completed during the slow start phase, which can be observed through tcpdump.

In the slow start phase, when the congestion window (cwnd) increases to a threshold (ssthresh, slow start threshold), TCP congestion control enters the next phase: congestion avoidance (Congestion Avoidance).

2. Congestion avoidance

At this stage, cwnd no longer increases exponentially, but an RTT increases by 1, that is, cwnd is slowly increased to prevent network congestion. Network congestion is inevitable, and due to the complexity of network links, out-of-order packets may even occur. One of the causes of out-of-order packets is shown in the figure below:

TCP out-of-order messages

In the figure above, the sender sends 4 TCP segments at once, but the second segment is discarded during the transmission process, so the receiver cannot receive the segment. However, the 3rd and 4th TCP segments can be received. At this time, 3 and 4 are out-of-order messages, and they will be added to the ofo queue (out-of-order queue) at the receiving end.

Problems such as packet loss are more likely to occur in mobile network environments, especially in an environment with poor network conditions. For example, in an elevator, the packet loss rate will be very high, and a high packet loss rate will cause the network response to be extremely slow. . It is rare for data packets to be dropped on network links in services within the data center. The packet loss problem I am talking about is mainly for services such as gateway services that are connected to the external network.

For our gateway services, we have also done some TCP unilateral optimization work ourselves, mainly optimizing the Cubic congestion control algorithm to alleviate network performance degradation caused by packet loss. In addition, BBR, a new congestion control algorithm open sourced by Google a few years ago, can also theoretically alleviate the TCP packet loss problem. However, in our practice, the effect of BBR was not good, so we did not use it in the end. .

Let’s go back to the picture above. Because the receiving end has not received the second segment, the receiving end will ack the second segment every time it receives a new segment, which is ack 17. Immediately afterwards, the sender will receive three identical acks (ack 17). After three consecutive response acks appear, the sender will judge that the data packet has been lost, and then enter the next stage: fast retransmission.

3. Fast retransmission and fast recovery

Fast retransmission and fast recovery work together. They are optimized to deal with packet loss. In this case, since the network is not congested, the congestion window does not have to be restored to the initial value. The basis for judging packet loss is receiving 3 identical acks.

Google engineers also proposed an improved strategy for TCP fast retransmission: tcp early retrans, which allows TCP connections under some circumstances to bypass the retransmission delay (RTO) for fast retransmission. Kernels after version 3.6 support this feature. Therefore, if you are still using CentOS-6, you will not be able to enjoy the network performance improvement it brings. You can upgrade your operating system to CentOS-7 or the latest CentOS-8. In addition, one more thing, Google’s technical strength in networking is unmatched by other companies. The maintainer of the Linux kernel TCP subsystem is also a Google engineer (Eric Dumazet).

In addition to fast retransmission, there is also a retransmission mechanism called timeout retransmission. However, this is a very bad situation. If a data packet is sent and its ack is not received for more than a period of time (RTO), it is considered that the network is congested. At this time, you need to restore cwnd to its initial value and adjust the size of cwnd again from slow start.

RTO generally occurs when network links are congested. If the data volume of a certain connection is too large, it may cause data packets of other connections to be queued, resulting in large delays. We mentioned at the beginning that the example of downloading movies affecting others’ playing online games is for this reason.

Regarding RTO, it is also an optimization point. If the RTO is too large, the business may be blocked for a long time, so an improvement was introduced in the 3.1 version of the kernel to adjust the initial value of the RTO from 3s to 1s, which can significantly save the blocking time of the business. However, RTO=1s is still a bit too high in some scenarios, especially in environments such as data centers where network quality is relatively stable.

We have had such a case in the production environment: business personnel reported that the business RT jittered severely. After preliminary investigation using strace, we found that the process was blocked in packet sending functions such as send(). Then we used tcpdump to capture packets and found that after sending data, the sender was unable to get a response from the peer until the RTO time and retransmitted again. At the same time, we also tried to use tcpdump to capture packets on the opposite end, and found that it took a long time for the opposite end to receive the data packets. Therefore, we judge that the network is congested, resulting in the peer not receiving data packets in time.

So, is there any solution to this situation where network congestion causes business to be blocked for too long? One solution is to create a TCP connection and use SO_SNDTIMEO to set the sending timeout to prevent the application from blocking the sending end for too long when sending packets, as shown below:

ret = setsockopt(sockfd, SOL_SOCKET, SO_SNDTIMEO, & timeout, len);

When the business discovers that the TCP connection has timed out, it will actively disconnect the connection and then try to use other connections.

This approach can set the RTO time for a certain TCP connection. So, is there any way to set the global RTO time (set once, all TCP connections will take effect)? The answer is yes, which requires modifying the kernel. In response to this type of demand, our practice in the production environment is to change TCP RTO min, TCP RTO max, and TCP RTO init into variables that can be flexibly controlled using sysctl, so as to make adjustments according to the actual situation, such as for data centers. For internal servers, we can appropriately adjust these values to reduce business blocking time.

The above four stages are the basis of TCP congestion control. Generally speaking, congestion control is to flexibly adjust the congestion window according to the data transmission status of TCP, thereby controlling the sender’s behavior of sending data packets. In other words, the size of the congestion window can indicate the congestion of the network transmission link. The size of the TCP connection cwnd can be viewed through the ss command:

$ss-nipt
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 36 172.23.245.7:22 172.30.16.162:60490
users:(("sshd",pid=19256,fd=3))
   cubic wscale:5,7 rto:272 rtt:71.53/1.068 ato:40 mss:1248 rcvmss:1248 advmss:1448 cwnd:10 bytes_acked:19591 bytes_received:2817 segs_out:64 segs_in:80 data_segs_out:57 data_segs_in:28 send 1.4Mbps lastsnd:6 lastrcv:6 lastack:6 pacing_rate 2.8Mbps delivery_rate 1.5Mbps app_limited busy:2016ms unacked:1 rcv_space:14600 minrtt:69.402

Through this command, we can find that the cwnd of this TCP connection is 10.

If you want to track real-time changes in the congestion window, there is another better way: track through the tcp_probe tracepoint:

/sys/kernel/debug/tracing/events/tcp/tcp_probe

However, this tracepoint is only supported by kernel versions after 4.16. If your kernel version is older, you can also use the tcp_probe kernel module (net/ipv4/tcp_probe.c) for tracing.

In addition to network conditions, the sender also needs to know the processing capabilities of the receiver. If the receiver’s processing capabilities are poor, the sender must slow down its packet sending speed, otherwise the data packets will be squeezed into the receiver’s buffer or even discarded by the receiver. The processing capability of the receiver is represented by another window – rwnd (receiver window). So, how does the receiver’s rwnd affect the sender’s behavior?

How does the receiver influence the sender to send data?

Likewise, a simple diagram is drawn to show how the receiver’s rwnd affects the sender:

rwnd and cwnd

As shown in the figure above, after receiving the data packet, the receiver will send an ack back to the sender, and then write its own rwnd size to the win field in the TCP header, so that the sender can know the reception based on this field. Square rwnd. Next, when the sender sends the next TCP segment, it will first compare the sender’s cwnd and the receiver’s rwnd, find the smaller value between the two, and then control the number of TCP segments sent not to exceed this Smaller value.

Regarding the impact of the receiver’s rwnd on the sender’s sending behavior, we have encountered such a case: business feedback said that the server sent packets to the client very slowly, but the server itself was not busy, and the network seemed to have no problems, so it was not clear What causes it. In this regard, we used tcpdump to capture packets on the server and found that win is often 0 in the ack responded by the Client, which means that the Client’s receiving window is 0. So we went to the Client to investigate, and finally found that there was a bug in the Client code, which prevented the received data packets from being read in time.

For this behavior, a patch was also written for the Linux kernel to monitor it: tcp: add SNMP counter for zero-window drops. This patch adds a new SNMP count: TCPZeroWindowDrop. If the receiving window is too small to receive packets in the system, this event will be generated, and then the event can be viewed through the TCPZeroWindowDrop field in /proc/net/netstat.

Because the TCP header size is limited, and the win field is only 16 bits, the maximum size that win can represent is only 65535 (64K), so if we want to support a larger receiving window to meet the high-performance network, we need Open the following configuration item. This option is also turned on by default in the system:

net.ipv4.tcp_window_scaling = 1

If you want to know more about the detailed design of this option, you can refer to RFC1323.

Okay, let’s stop here first about the impact of TCP congestion control on business network performance.

Summary

TCP congestion control is a very complex behavior, and the content covered in this lesson is only some of the basic parts. I hope that these basic knowledge can give you a general understanding of TCP congestion control. To summarize the key points of this lesson:

Network congestion will be reflected in the congestion window (cwnd) of the TCP connection, which will affect the sender’s packet sending behavior;

The processing capability of the receiver will also be fed back to the sender, and this processing is represented by rwnd. rwnd and cwnd will work together on the sender to determine the maximum number of TCP packets the sender can send;

The dynamic changes of TCP congestion control can be observed in real time through the tcp_probe tracepoint (corresponding to the kernel version 4.16+) or the tcp_probe kernel module (corresponding to the kernel version before 4.16). Through tcp_probe, the TCP connection can be well observed. Data transfer status.