Analysis of a large number of TCP connections stuck in TIME_WAIT, SYN_SENT, and CLOSE_WAIT states

Article directory

  • 1. Count the number of tcp connections in various states
  • 2. TIME_WAIT
    • On application server, connection from reverse proxy
    • On the reverse proxy, access the connection to the application service
    • On the reverse proxy, connections from users
  • 3. SYN_SENT
    • On the reverse proxy, access the target on the other side of the firewall
    • On the reverse proxy, access targets that are not blocked by firewalls
  • 4. CLOSE_WAIT
    • On application server, connection from reverse proxy
    • On the application server, access the connection to external services

This article records some solutions for handling abnormal TCP connections on nginx and tomcat servers.

1. Count the number of tcp connections in various states

Both ss and netstat tools can count:

ss -ant | awk '{print $1}' | sort | uniq -c

netstat -ant | awk '/^tcp/ { + + S[$NF]} END {for(a in S) print a, S[a]}'

2. TIME_WAIT

This state is a normal completion state, but try to keep the completion state on the client that initiated the request, and use long connections to reduce the number.

On application server, connection from reverse proxy

Reason: The request initiated from nginx declares the http 1.0 version protocol (or the Connection field of the request header refers to Close), then tomcat will actively disconnect the tcp connection after responding to the request.

Scheme: The proxy_http_version configuration of the nginx http_proxy module uses the http 1.0 protocol to access the upstream instance by default and needs to be modified to 1.1

proxy_connect_timeout 3s;
proxy_http_version 1.1;
# Notify the client that the connection will be maintained for 60s; the server will actually actively close the connection after 75s;
# If you do not set the second parameter to return the timeout suggestion for idle long connections,
# Some clients will not use the http connection pool to maintain idle connections for a long time.
# Some clients will use a default idle connection disconnection time
keepalive_timeout 75s 60s;
# To maintain http1.1 communication with the requesting end, the chunked mechanism cannot be turned off;
# Otherwise, nginx will actively close the TCP connection with the requesting end after completing the response, which is equivalent to degenerating into the http1.0 protocol.
chunked_transfer_encoding on;

upstream myapp {
    #The number of idle long connections between nginx and the upstream server (the default is up to 60s)
    keepalive 20;
    server 10.0.0.1:8080;
    server 10.0.0.2:8080;
}

server {
    listen 80 default;
 
    location/{
        proxy_pass http://myapp;
        # When the request header specifies the HTTP 1.1 protocol and Connection is not Close,
        # The other party will not actively disconnect the TCP connection after completing the response.
        proxy_http_version 1.1;
        proxy_set_header Connection '';
        proxy_set_header Cookie $http_cookie;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-For ${proxy_add_x_forwarded_for};
    }
}

On the reverse proxy, access the connection to the application service

Reason: After nginx uses the http 1.1 protocol to access the upstream instance, if the idle connection reuse mechanism is not enabled, the tcp connection will be actively closed.

Plan: The keepalive configuration of the nginx upstream module is not enabled by default, and a value needs to be actively provided.

Connection from user on reverse proxy

Reason 1: When the request side (browser, http request framework) turns on the connection pool by default and uses the http 1.1 protocol, if nginx turns off the chunked_transfer_encoding mechanism of the http 1.1 protocol, then after completing the request , nginx will actively disconnect from the requester

Option: Do not turn off chunked_transfer_encoding

Reason 2: The recommended length of time for the client to keep the connection is not returned (Keep-Alive: timeout=time in the response header), causing the user’s client to continue to disconnect the idle connection, and eventually nginx takes the initiative Disconnect and leave TIME_WAIT on the nginx server

Scheme: keepalive_timeout configures the maximum idle time and the recommended length of time for the client to keep the connection, so that the client knows when the idle connection should be closed before

3. SYN_SENT

On the reverse proxy, access the target on the other side of the firewall

Cause: When telneting the target port, the command was blocked (no response was received immediately that the target had not opened this port), which proved that the SYN packet was dropped by the firewall.

Option: Apply for a firewall policy

On reverse proxy, access targets without firewall blocking

Cause 1: The target tomcat server has received the number of http connections (server.tomcat.max-connections configuration of springboot application, default 10000) and is queued at the service port for accept (net.core of the operating system. After the number of tcp sockets configured in somaxconn (default 128 or 1024) reaches the upper limit, subsequent SYN packets arriving at the service port will be discarded, and the connection status of the requesting end remains SYN_SENT.

Plan: When using responsive IO frameworks such as webflux and websocket, you can increase the server.tomcat.max-connections configuration.

Cause 2: When telnet to the target port, the command is blocked (no response is received immediately that the target has not opened this port), which proves that iptables is used on the target server to DROP the request to access the service port.

Plan: Use iptables rules to add the requester IP to the release list

Cause 3: The number of file handles (including tcp sockets) processed by the application service process exceeds the limit

# View the file handle limit of the current user’s order process
ulimit -n

Plan: Edit the /etc/security/limits.conf file and restart the application service process

4. CLOSE_WAIT

On application server, connection from reverse proxy

Cause: The application opened the port, but the subsequent initialization failed (for example, the configuration center, service registration center, database, etc. were not successfully connected), and the accept socket logic did not run;
The established request is placed in the backlog of the service port to be accepted (net.core.somaxconn configuration of the operating system), and the received request content is placed in the operating system tcp buffer;
After the application has not processed and responded for a long time, the client issues a FIN command. After the server responds with ACK, the server connection enters the CLOSE_WAIT state. Since the data in the tcp buffer has not been processed, the server does not continue to reply to the FIN, and the connection It stays in the backlog of the service port waiting to be accepted in the CLOSE_WAIT state;
Before the backlog is filled up, the application service port is actually in a state of suspended animation where it can connect but cannot respond.

Plan: Perform regular readiness detection on deployed applications and promptly discover applications that have not been successfully initialized.

On the application server, connection to access external services

Reason: When using a connection pool to manage long http connections, if the server disconnects first, the connection on the application server will enter the CLOSE_WAIT state; if the connection pool is not configured properly, HttpClient may use This unavailable connection is used to initiate the request

Plan:
The execution mechanism of httpclient is as follows:

HttpClient.doExecute If the response is not received due to connection exception or response timeout, it will retry 2 times by default.
    AbstractConnPool.lease repeatedly obtains connections unless the maximum number of connections is reached
        getPoolEntryBlocking
            entry.isExpired uses the old connection first. Only when the last response received by this connection contains Keep-Alive:timeout=?, will the expiration judgment be made.
            connFactory.create Establishes a new TCP connection if no old connection is available
        validate(leasedEntry) The time since the last use of this connection now exceeds ValidateAfterInactivity, then check whether the connection status is normal.
    HttpRequestExecutor.execute

Only if both isExpired and validate are skipped will the old connection of CLOSE_WAIT be used;
To avoid isExpired judgment being skipped, the response returned by the server needs to contain Keep-Alive:timeout=?, or configure a custom ConnectionKeepAliveStrategy in HttpClient to provide a default timeout;
To prevent the validation judgment from being skipped, do not set the ValidateAfterInactivity value of the connection pool too large;
And use IdleConnectionEvictor to regularly scan and actively close idle connections that have timed out (HttpClientConnectionManager does not automatically start a background thread to regularly close idle connections):

 @Bean
@Primary
public HttpClientConnectionManager connectionManager() {
PoolingHttpClientConnectionManager poolingConnManager = new PoolingHttpClientConnectionManager();
poolingConnManager.setMaxTotal(400);
poolingConnManager.setDefaultMaxPerRoute(200);
poolingConnManager.setValidateAfterInactivity(1000);
return poolingConnManager;
}
\t
@Bean(initMethod = "start", destroyMethod = "shutdown")
@Primary
public IdleConnectionEvictor idleConnectionEvictor(HttpClientConnectionManager connectionManager) {
//Close idle connections that meet the timeout standard in 2 seconds
return new IdleConnectionEvictor(connectionManager, 2, TimeUnit.SECONDS, 60, TimeUnit.SECONDS);
}

@Bean("httpClient")
@Primary
public CloseableHttpClient httpClientRequestConfig(HttpClientConnectionManager connectionManager) {
// If no idle connection is obtained within 5 seconds, it will fail directly;
// When establishing a connection, timeout is 3 seconds;
// After initiating the request, if the response header is not obtained within 30 seconds, it will time out;
RequestConfig defaultRequestConfig = RequestConfig.custom().setConnectionRequestTimeout(5000)
.setConnectTimeout(3000).setSocketTimeout(30000).build();
HttpClientBuilder builder = HttpClientBuilder.create().disableAutomaticRetries()
.setConnectionManager(connectionManager)
.setDefaultRequestConfig(defaultRequestConfig)
.setKeepAliveStrategy(new ConnectionKeepAliveStrategy() {
@Override
public long getKeepAliveDuration(final HttpResponse response, final HttpContext context) {
long timeoutMillis = DefaultConnectionKeepAliveStrategy.INSTANCE.getKeepAliveDuration(response,
context);
// Idle connection, closed in 20 seconds (when the response header Keep-Alive does not provide a timeout value)
if (timeoutMillis < 0) {
return 20 * 1000;
}
return timeoutMillis;
}
});
return builder.build();
}

When using HttpClient, CloseableHttpResponse must be closed promptly to return the connection to the connection pool:

 @Autowired
@Qualifier("httpClient")
private CloseableHttpClient httpClient;

public String test() throws ClientProtocolException, IOException {
try (CloseableHttpResponse response = httpClient.execute(httpGet)) {
return EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
}
}