kubernetes health check liveness readiness startupProbe probe

Since the historical project is running in kubernetes, there are some problems as follows

When the program is released, the new version of the pod has not started successfully, and the old version of the pod has stopped, which causes some requests to access the new pod. Since the program in the new pod has not started successfully, all these requests end in failure. It is also possible that the new pod fails to start, and the pod keeps restarting but the service is unavailable.

The running pod is temporarily unavailable due to the network or some other reason. For kubernetes, the status of the pod is normal. At this time, the business traffic may also be distributed to the secondary pod, and an error failure will also be reported.

How to let kubernetes define whether the pod is healthy and started successfully?

health check

The current kubernetes version provides three health checks for v1.19.

Survival probe livenessProbe: decide whether to restart by detecting whether the container responds normally

If this probe check fails, the container will be restarted

Readiness probe readinessProbe: Used to determine whether the container is ready to accept requests

If the readiness probe is configured, the pod will be considered capable of access only after passing the probe check, and kubernetes will add the ip port to the endpoint in the service. Otherwise, it will be removed from the endpoint, so that the traffic will not Will be allocated to unprepared containers.

Startup probe startupProbe: Detect whether the application in the container has been started v1.18 + is only supported

The livenessProbe and readinessProbe probes will be disabled before the probe is started. Based on this feature, it is possible to avoid the deadlock problem of livenessProbe restarting the container because it keeps failing.

Each of these three probes provides three detection methods

exec Executes command line checks If the return value is 0, the container is considered healthy.

httpGet HTTP request check If the status code is between 200 and 400, the container is considered healthy

tcpSocket TCP port check If the port is open, the container is considered healthy.

Since the project uses the SpringBoot 2.2.6 project, the spring-boot-starter-actuator package is naturally used. Note that there are some differences between different Boot versions, such as 1.x, 2.2.x and 2.3. The access to configuration exposure indicators is somewhat different.

The springboot2.3 + version also provides the readable readiness indicator/actuator/health/readiness and the survival indicator/actuator/health/liveness interface, which can be directly used for the readiness probe and survival probe of kubernetes respectively. Since the 2.2.6 used here uses the /actuator/health indicator.

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

Configure and expose health indicators

management.endpoints.web.base-path=/actuator
management.endpoints.web.exposure.include=health

Configure the startup probe check. The theory is that only after the application starts successfully can it provide external services. Therefore, the httpGet probe is configured to detect the ip:port/actuator/health address in the container. If it can successfully return 200, the startup is considered successful.

startupProbe: # start the probe
  httpGet:
    path: /actuator/health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 5
  successThreshold: 1
  failureThreshold: 30

Some indicator parameters are additionally configured here. The three probes startupProbe, livenessProbe, and readinessProbe all have the same configuration. This is also where kubernetes does a good job in health check configuration.

periodSeconds check period (s) how many seconds to check once

initialDelaySeconds Delay time (s) After the pod is started, how long (s) is the delay before checking

timeoutSeconds timeout time (s) timeout time of the access detection method. If accessing ip:port/actuator/health as above does not return within 5 seconds, it is considered a timeout failure.

successThreshold success threshold (number of times) By default, as long as there is 1 successful visit, it is considered successful

failureThreshold The maximum number of failures, such as the 30 times configured above, if all 30 attempts fail, it means that the Pod failed to start. Here you can cooperate with the periodSeconds parameter to control the problem that some programs take too long to start. For example, if the above configuration of 30*10=300s fails, it will fail.

The purpose of configuring the survival probe is to detect the situation that guarantees the suspended animation of the program

livenessProbe: # liveness probe
  httpGet:
    path: /actuator/health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 5
  successThreshold: 1
  failureThreshold: 10

The purpose of configuring the readiness probe is to detect and ensure that the program is available at any time. If the readiness probe is unavailable, the ip port of the endpoint in the service will be removed. Ensure that traffic is only distributed to available containers.

readinessProbe: # readiness probe
  httpGet:
      path: /actuator/health
      port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 5
  successThreshold: 1
  failureThreshold: 10

It is not necessary to configure all the three probes, and it needs to be configured according to the situation. If not configured, it will be read according to the Pod status by default. If the Pod status is successful, it is considered alive and ready.

Problems encountered

At the beginning, only the liveness probe and readiness probe were configured. As a result, when starting, the survival probe has been failing to meet the restart condition. It appears that the pod has been restarted. A startup probe is added later to ensure that the ready probe is not triggered within the startup time period. Configure failureThreshold * periodSeconds as much as possible to accommodate the worst-case startup time.

Some services have deadlines

context deadline exceeded (Client.Timeout exceeded while awaiting headers) Back-off restarting failed container

The timeoutSeconds configuration can be adjusted to a few seconds larger.