12 | Invalid connection: use Keep-Alive or apply heartbeat to detect?

In the previous article, we talked about how to use close and shutdown to complete the closing of the connection. In most cases, we will prefer shutdown to complete the closing of the connection in one direction. After the other end has finished processing, we will complete the other one. Directional closure.

In many cases, one end of the connection needs to always sense the status of the connection. If the connection is invalid, the application may need to report an error or reinitiate the connection.

In this article, I will take you through the experience of detecting connection status and provide best practices for detecting connection status.

Start with an example

Let’s start today’s topic with an example.

I have previously worked on a project based on the NATS messaging system. Multiple message providers (pub) and subscribers (sub) are connected to the NATS messaging system, and message delivery and subscription processing are completed through this system.

Suddenly one day, a fault was reported online and a process could not be processed normally. After investigation, it was found that the message was correctly delivered to the NATS server, but the message subscriber did not receive the message and failed to process it, causing the process to fail.

After observing the message subscriber, we found that although the connection between the message subscriber and the NATS server appears to be “normal”, in fact, this connection is invalid. why? This is because the NATS server crashed, and the connection between the NATS server and the message subscriber was interrupted. The FIN packet failed to reach the message subscriber due to abnormal circumstances. The result was that the message subscriber has been maintaining an “outdated” message. Connect and will not receive messages from the NATS server.

The root cause of this failure is that, as a client of the NATS server, the message subscriber did not detect the validity of the connection in time, which caused the problem.

Maintaining the detection of connection validity is a point that we must pay attention to in actual combat.

TCP Keep-Alive option

Many people who are new to TCP programming will be surprised to find that on a “silent” connection with no data reading or writing, there is no way to find out whether the TCP connection is valid or invalid. For example, if the client suddenly crashes, the server may maintain a useless TCP connection for several days. The example mentioned earlier is one such scenario.

So is there a way to enable a similar “polling” mechanism and let TCP tell us whether the connection is “alive”?

This is what TCP’s keep-alive mechanism is meant to solve. In fact, TCP has a keep-alive mechanism called Keep-Alive.

The principle of this mechanism is as follows:

Define a time period. During this period, if there is no connection-related activity, the TCP keep-alive mechanism will take effect. Every time interval, a detection message is sent. The detection message contains very little data. If If no response is received for several consecutive detection messages, the current TCP connection is considered dead, and the system kernel notifies the upper-layer application of the error message.

The above-mentioned definable variables are called keep-alive time, keep-alive interval and keep-alive detection times respectively. In Linux systems, these variables correspond to the sysctl variables net.ipv4.tcp_keepalive_time, net.ipv4.tcp_keepalive_intvl, and net.ipv4.tcp_keepalve_probes respectively. The default settings are 7200 seconds (2 hours), 75 seconds, and 9 probes.

If TCP keepalive is enabled, you need to consider the following situations:

First, the peer program is working normally. When the TCP keep-alive probe message is sent to the peer, the peer will respond normally, so that the TCP keep-alive time will be reset and wait for the next TCP keep-alive time to arrive.

Second, the peer program crashes and restarts. When the TCP keep-alive probe message is sent to the peer, the peer can respond, but since there is no valid information about the connection, a RST packet will be generated, and it will be quickly discovered that the TCP connection has been reset.

The third situation is that the peer program crashes, or the peer packet is unreachable due to other reasons. When the TCP keep-alive detection message is sent to the peer, there is no response. Several times in a row, after the number of keep-alive detections is reached, TCP will report that the TCP connection has died.

The TCP keepalive mechanism is turned off by default. When we choose to turn it on, it can be turned on in both directions of the connection, or in one direction alone. If the server-to-client detection is turned on, the “dirty data” retained on the server can be cleared when the client is abnormally disconnected; and if the client-to-server detection is turned on, the “dirty data” retained on the server can be cleared when the server is unresponsive. , reinitiate the connection.

Why doesn’t TCP provide a keep-alive mechanism with good frequency? My understanding is that early network bandwidth was very limited. If a high-frequency keep-alive mechanism was provided, it would be a serious waste of limited bandwidth.

Application layer exploration

If TCP’s own keep-Alive mechanism is used, in a Linux system, it will take at least 2 hours, 11 minutes and 15 seconds before a “dead” connection is discovered. How is this time calculated? It’s actually 2 hours plus 75 seconds times 9. In fact, for many systems with sensitive latency requirements, this time interval is unacceptable.

Therefore, better solutions must be found at the application layer.

We can complete connection detection at the application layer by simulating the TCP Keep-Alive mechanism in the application.

We can design a PING-PONG mechanism. The party that needs to be kept alive, such as the client, initiates a PING operation on the connection after the keep-alive time is reached. If the server responds to the PING operation, the keep-alive time is reset. Otherwise, the number of detections is counted. If the final number of detections reaches the preset value of the number of keep-alive detections, the connection is considered invalid.

There are two key points here:

The first is the need to use a timer, which can be achieved by using the mechanism of I/O multiplexing itself; the second is the need to design a PING-PONG protocol.

Below we try to complete such a design.

Message format design

Our program is the client that initiates the keepalive, and a message object is defined for this purpose. You can see this message object. This message object is a structure. The first 4 bytes identify the message type. For simplicity, four message types: MSG_PING, MSG_PONG, MSG_TYPE 1 and MSG_TYPE 2 are designed here.

typedef struct {
    u_int32_t type;
    char data[1024];
} messageObject;

#define MSG_PING 1
#define MSG_PONG 2
#define MSG_TYPE1 11
#define MSG_TYPE2 21

Client programming

The client completely simulates the TCP Keep-Alive mechanism. After the keep-alive time is reached, the number of live probes is increased by 1, and a PING format message is sent to the server. After that, it is continuously sent to the server at the preset keep-alive time interval. Message in PING format. If a response from the server can be received, the keep-alive will end and the keep-alive time will be set to 0.

Here we use the timer that comes with the select I/O multiplexing function. The select function will be introduced in detail later.

#include "lib/common.h"
#include "message_objecte.h"

#defineMAXLINE 4096
#define KEEP_ALIVE_TIME 10
#define KEEP_ALIVE_INTERVAL 3
#define KEEP_ALIVE_PROBETIMES 3


int main(int argc, char **argv) {
    if (argc != 2) {
        error(1, 0, "usage: tcpclient <IPaddress>");
    }

    int socket_fd;
    socket_fd = socket(AF_INET, SOCK_STREAM, 0);

    struct sockaddr_in server_addr;
    bzero( & amp;server_addr, sizeof(server_addr));
    server_addr.sin_family = AF_INET;
    server_addr.sin_port = htons(SERV_PORT);
    inet_pton(AF_INET, argv[1], & amp;server_addr.sin_addr);

    socklen_t server_len = sizeof(server_addr);
    int connect_rt = connect(socket_fd, (struct sockaddr *) & amp;server_addr, server_len);
    if (connect_rt < 0) {
        error(1, errno, "connect failed ");
    }

    char recv_line[MAXLINE + 1];
    int n;

    fd_set readmask;
    fd_set allreads;

    struct timeval tv;
    int heartbeats = 0;

    tv.tv_sec = KEEP_ALIVE_TIME;
    tv.tv_usec = 0;

    messageObject messageObject;

    FD_ZERO( & amp;allreads);
    FD_SET(socket_fd, & amp;allreads);
    for (;;) {
        readmask = allreads;
        int rc = select(socket_fd + 1, & amp;readmask, NULL, NULL, & amp;tv);
        if (rc < 0) {
            error(1, errno, "select failed");
        }
        if (rc == 0) {
            if ( + + heartbeats > KEEP_ALIVE_PROBETIMES) {
                error(1, 0, "connection dead\\
");
            }
            printf("sending heartbeat #%d\\
", heartbeats);
            messageObject.type = htonl(MSG_PING);
            rc = send(socket_fd, (char *) & amp;messageObject, sizeof(messageObject), 0);
            if (rc < 0) {
                error(1, errno, "send failure");
            }
            tv.tv_sec = KEEP_ALIVE_INTERVAL;
            continue;
        }
        if (FD_ISSET(socket_fd, & amp;readmask)) {
            n = read(socket_fd, recv_line, MAXLINE);
            if (n < 0) {
                error(1, errno, "read error");
            } else if (n == 0) {
                error(1, 0, "server terminated \\
");
            }
            printf("received heartbeat, make heartbeats to 0 \\
");
            heartbeats = 0;
            tv.tv_sec = KEEP_ALIVE_TIME;
        }
    }
}

This program is mainly divided into three parts:

The first part is the creation of the socket and connection establishment:

Lines 15-16 create a TCP socket;

Lines 18-22 create the IPv4 target address, which is actually the server address. Note that the incoming parameters are used as the server address;

Lines 24-28 initiate a connection to the server.

The second part is prepared for the select timer:

Lines 39-40 set the timeout to KEEP_ALIVE_TIME, which is equivalent to the keep-alive time;

Lines 44-45, initialize the socket of the select function.

The most important part is the third part, which needs to process heartbeat messages:

Line 48 calls the select function to sense I/O events. The I/O events here, in addition to read operations on the socket, are also timeout events set in lines 39-40. When the KEEP_ALIVE_TIME period arrives, the select function will return 0, and then enter the processing of lines 53-63;

In lines 53-63, the client has not received any feedback on the current connection during KEEP_ALIVE_TIME, so it initiates a PING message and tries to ask the server: “Hey, are you still alive?” Here we pass a type Complete the PING operation for the message object of MSG_PING, and then we will see how the server-side program responds to this PING operation;

Lines 65-74 are the client’s processing after receiving the server-side program. For the sake of simplicity, there is no further conversion and analysis of message formats here. In actual work, the packets actually need to be parsed and processed. Only responses of the PONG type are considered to be the results of PING detection. It is believed that since the message from the server is received, the connection is normal, so the live detection counter and the live detection time are set to zero and wait for the next live detection time.

Server-side programming

The server-side program accepts a parameter. This parameter is set relatively large to simulate a situation where the connection does not respond. After receiving various messages from the client, the server-side program processes them. If it finds a PING type message, it will reply with a PONG message after sleeping for a period of time, telling the client: “Well, I am still alive.” Of course, if the sleep time is long, the client will not be able to quickly know whether the server is alive. This is just a way for us to simulate unresponsive connections. In actual circumstances, it should be a system crash or network abnormality.

#include "lib/common.h"
#include "message_objecte.h"

static int count;

int main(int argc, char **argv) {
    if (argc != 2) {
        error(1, 0, "usage: tcpsever <sleepingtime>");
    }

    int sleepingTime = atoi(argv[1]);

    int listenfd;
    listenfd = socket(AF_INET, SOCK_STREAM, 0);

    struct sockaddr_in server_addr;
    bzero( & amp;server_addr, sizeof(server_addr));
    server_addr.sin_family = AF_INET;
    server_addr.sin_addr.s_addr = htonl(INADDR_ANY);
    server_addr.sin_port = htons(SERV_PORT);

    int rt1 = bind(listenfd, (struct sockaddr *) & amp;server_addr, sizeof(server_addr));
    if (rt1 < 0) {
        error(1, errno, "bind failed ");
    }

    int rt2 = listen(listenfd, LISTENQ);
    if (rt2 < 0) {
        error(1, errno, "listen failed ");
    }

    int connfd;
    struct sockaddr_in client_addr;
    socklen_t client_len = sizeof(client_addr);

    if ((connfd = accept(listenfd, (struct sockaddr *) & amp;client_addr, & amp;client_len)) < 0) {
        error(1, errno, "bind failed ");
    }

    messageObject message;
    count = 0;

    for (;;) {
        int n = read(connfd, (char *) & amp;message, sizeof(messageObject));
        if (n < 0) {
            error(1, errno, "error read");
        } else if (n == 0) {
            error(1, 0, "client closed \\
");
        }

        printf("received %d bytes\\
", n);
        count + + ;

        switch (ntohl(message.type)) {
            case MSG_TYPE1:
                printf("process MSG_TYPE1 \\
");
                break;

            case MSG_TYPE2:
                printf("process MSG_TYPE2 \\
");
                break;

            case MSG_PING: {
                messageObject pong_message;
                pong_message.type = MSG_PONG;
                sleep(sleepingTime);
                ssize_t rc = send(connfd, (char *) & amp;pong_message, sizeof(pong_message), 0);
                if (rc < 0)
                    error(1, errno, "send failure");
                break;
            }

            default :
                error(1, 0, "unknown message type (%d)\\
", ntohl(message.type));
        }

    }

}

The server-side program is mainly divided into two parts.

The first part is the establishment of the listening process, including lines 7-38; Lines 13-14 first create a local TCP listening socket; Lines 16-20 bind the socket to the local port and ANY address; Lines 27-38 Lines call listen and accept respectively to complete passive socket conversion and monitoring.

The second part is from lines 43 to 77. It reads data from the established connection socket, parses the message, and performs different processing according to the message type.

55-57 Behavior handles messages of MSG_TYPE1;

59-61 Behavior processing MSG_TYPE2 messages;

The focus is on lines 64-72, which handle MSG_PING type messages. Use sleep to simulate whether the response is timely, and then call the send function to send a PONG message to indicate “still alive” to the client;

74 Behavior exception handling, because the message format is not recognized, the program exits with an error.

Experiment

Based on the above program design, let us do two different experiments:

In the first experiment, the server-side sleep time is 60 seconds.

We see that after the client sent three heartbeat detection messages and PING messages, it judged that the connection was invalid and exited directly. The reason for this result is that no PONG message was received from the server during this period. Of course, the actual working program may require different processing, such as reinitiating the connection.

$./pingclient 127.0.0.1
sending heartbeat #1
sending heartbeat #2
sending heartbeat #3
connection dead

$./pingserver 60
received 1028 bytes
received 1028 bytes

In the second experiment, we set the server-side sleep time to 5 seconds.

We see that since the server responded promptly during the heartbeat detection process this time, the client will always think that the connection is normal.

$./pingclient 127.0.0.1
sending heartbeat #1
sending heartbeat #2
received heartbeat, make heartbeats to 0
received heartbeat, make heartbeats to 0
sending heartbeat #1
sending heartbeat #2
received heartbeat, make heartbeats to 0
received heartbeat, make heartbeats to 0

$./pingserver 5
received 1028 bytes
received 1028 bytes
received 1028 bytes
received 1028 bytes

Summary

Through today’s article, we can see that although TCP does not provide system keep-alive capabilities so that applications can easily detect the survival of connections, we can flexibly establish this mechanism in applications. Generally speaking, the establishment of this mechanism relies on system timers and appropriate application layer message protocols. For example, using heartbeat packets is such a mechanism to keep Alive.