[kubernetes] pod life cycle

Article directory

  • 1 Overview
  • 2. pod life cycle
  • 3. pod stage
  • 4. Container status
  • 5. Container restart strategy
  • 6. pod status
    • 6.1 Pod ready state
    • 6.2 Pod ready status
    • 6.3 Pod network ready
  • 7. Container probe
    • 7.1 Inspection mechanism
    • 7.2 Detection results
    • 7.3 Detection types
  • 8. Termination of Pod
    • 8.1 Forcefully terminate Pod
    • 8.2 Pod garbage collection

1. Overview

Pod follows a predefined life cycle, starting from the Pending phase. If at least one of the main containers starts normally, it enters the Running state, and then depends on whether there are containers in the pod. Ending with a failure status and entering the Succeeded and Failed stages

While the pod is running, kubelet can restart the container to handle some failure scenarios. Inside a pod, Kubernetes tracks the status of different containers and determines the actions needed to make the pod healthy again.

In the kubernetes API, Pod contains a specification part and an actual status part. The status of the Pod object contains a set of pod conditions (Conditions). If the application needs it, customization can also be injected into it. Readiness information

A pod will only be scheduled once in its life cycle. Once a pod is scheduled (assigned) to a node, The pod will continue to run on this node until the pod stops or is terminated.

2. pod life cycle

Like independent application containers, pods are also considered relatively temporary entities, and pods will be created. It is assigned a unique id(UID), is scheduled to the node, and runs on that node until terminated or deleted.

If a node dies, the pods scheduled to that node will also be deleted after the given timeout period.

The pod itself does not have the ability to self-heal. If a pod is scheduled to a node and the node later fails, the pod will be deleted; similarly, the pod cannot survive while being evicted due to node resource exhaustion or node maintenance.

Kubernetes uses a high-level abstraction to manage these relatively disposable pod instances, called a controller.

Any given pod (defined by UID) is never rescheduled to a different node; instead, the pod can be rescheduled to a new, almost Replace the identical pod. If necessary, the name of the new pod can remain the same, but its UID will be different.

If something claims that its lifecycle is the same as a pod, such as a storage volume. This means: the object will always exist during the existence of this pod. If the pod is deleted for any reason, or even if an identical replacement pod is created, this related object will Will also be deleted and rebuilt

  • pod structure legend
    A pod containing multiple containers includes a program for pulling files and a web server, both of which use persistent volumes as inter-container Shared storage.

3. pod stage

The status field of a pod is a PodStatus object, which contains a phase field.

A pod’s phase (Phase) is a simple high-level overview of where a pod is in its life cycle. This phase is not a comprehensive summary of container or pod status, nor is it intended to complete a complete state machine.

Thenumber andmeaning of pod stages are strictly defined. It should not be assumed that a pod has a phase value other than the following:

The following are possible values for phase:

Value Status description
Pending(pending) The Pod has been accepted by the kuberbetes system, but one or more containers have not been created or run. This phase includes the time waiting for the Pod to be scheduled and the time to download the image through the network
Running (running) The Pod has been bound to a node, and all containers in the Pod have been created. At least one container is still running, or is starting or restarting
Succeeded(successful) All containers in the Pod have been terminated successfully, and will not be restarted
Failed (failed) All containers in the Pod have been terminated, And at least one container terminated due to failure. In other words, the container exits with a non-zero status or is terminated by the system
Unknown (unknown) For some reasons, the status of the pod cannot be obtained. This situation is usually caused by a failure to communicate with the host where the Pod is located

Description:

  • When a pod is deleted, executing some kubectl commands will display the status of the pod as Terminating. This Terminating state is not one of the Pod’s states. Pods are given a time limit that can be terminated gracefully. The default time is 30 seconds. You can use the --force parameter to forcefully terminate the pod
  • Starting from kubernetes 1.27, in addition to static Pod and forced termination of Pod without Finalizer, kubelet will transition the deleted Pod to the termination stage (Failed or Succeeded, depending on the exit status of the Pod container), and then deleted from the API server.
  • If some nodes die or lose contact with other nodes in the cluster, kubernetes will implement a strategy to set the phase of all Pods running on the lost node to Failed

4. Container status

Kubernetes tracks the status of each container in a Pod, just as it tracks the overall stage of the Pod. We can use Container Lifecycle Callbacks to trigger events at specific points in the container lifecycle.

Once the scheduler assigns a Pod to a node, kubelet starts creating containers for the Pod through the Container Runtime. There are three states of the container, namely: Waiting (waiting), Running (running) and Terminated (terminated).

To check the status of containers in a Pod, you can use the following command:

kubectl describe pod <pod name> -n namespace

This output information includes the status of each container in the Pod.

Meaning of each status:

  • Waiting (waiting)
    If the container is not in one of the Running or Terminated states, it is in the Waiting state. The container in the Waiting state is still in the Run it to complete the operations required to start. For example: pulling a container image from a container image warehouse, or applying Secret data to the container, etc. When you use kubectl to query the Pod of a container that contains the Wating state, you will get a Reason field, which gives the reason why the container is in the waiting state.
  • Running (running)
    The Running status indicates that the container is executing and nothing unexpected has occurred. If the postStart callback is configured, then the callback has been executed. When you use kubectl to query the Pod containing a container in the Running state, you can also see information about the container entering the Running state.
  • Terminated
    A container in the Terminated state has started execution and ended normally or failed for some reason. When you use kubectl to query the Pod of a container that contains the Terminated state, you can see the reason why the container entered this state, the launch code, and the start and end times of the container execution period.
    If the container is configured with a preStop callback, the callback will be executed before the container enters the Terminated state.

5. Container restart strategy

Pod’s spec contains a restartPolicy field, whose possible values include Always, Onfailure, Never , the default value is Always

restartPlociy applies to all containers in the Pod. restartPolicy is only for the container restart action of kubelet on the same node. When the container in the Pod exits, kubelet will calculate the restart according to the exponential fallback method. Delay (10s, 20s, 40s…), the maximum delay is 5min. Once a container has been executed for 10 minutes without any problems, kubelet resets the restart backoff timer of the container.

6. pod status

Pod has a PodStatus object, which includes a PodConditions array.
The kubelet manages the following PodCondition:

  • PodScheduled: Pod has been scheduled to a node
  • PodReadyToStartContainers: The Pod sandbox was successfully created and the network was configured (Alpha feature, must be explicitly enabled)
  • ContainersReady: All containers in the Pod are ready
  • Initialized: All init containers have completed successfully
  • Ready: The Pod can serve requests and should be added to the Load Balancing Pool of the corresponding service

6.1 Pod ready status

Additional anti-or signals can be injected into PodStatus: PodReadiness (Pod Readiness). To use this feature, you can set the readinessGates list in the pod specification to provide kubelet with A set of additional conditions for use when evaluating Pod readiness

Readiness gating makes decisions based on the current value of the Pod’s status.conditions field. If kubernetes cannot find a condition in the status.conditions field, the status value of the condition defaults to False

e.g.

kind: Pod
...
spec:
  readinessGates:
    - conditionType: "www.example.com/feature-1"
status:
  conditions:
    - type: Ready # Built-in Pod status
      status: "False"
      lastProbeTime: null
      lastTransitionTime: 2018-01-01T00:00:00Z
    - type: "www.example.com/feature-1" # Additional Pod status
      status: "False"
      lastProbeTime: null
      lastTransitionTime: 2018-01-01T00:00:00Z
  containerStatuses:
    - containerID: docker://abcd...
      ready: true
...

6.2 Pod ready status

The command kubectl patch does not support modifying the status of objects. If you need to set the status.conditions of the Pod, and the application or operator needs to use the PATCH operation, you can use one of the kubernetes client libraries to write code and customize the Pod readiness settings. Pod status.

For Pods using custom status, the Pod will be evaluated as ready only if all of the following apply:

  • All containers in the Pod are ready
  • All conditions in readinessGates are True

When the Pod’s containers are ready, but at least one custom status has no value or has a value of False, kubelet sets the Pod’s status to ContainersReady

6.3 Pod network ready

After a Pod is scheduled to a node, it needs to be accepted by the kubelet and mount the required volumes. Once these stages are complete, the kubelet will work with the container runtime (using the Container Runtime Interface (CRI)) to generate a runtime sandbox for the Pod and configure the network.

If the PodReadyToStartContainersCondition feature gating is enabled, kubelet will report whether the Pod has reached the initialization milestone through the PodReadyToStartContainers status in the Pod’s status.conditions field. .

When the kubelet detects that a Pod does not have a runtime sandbox configured with networking, the PodReadyToStartContainers status will be set to False, which will occur in the following scenarios:

  • In the early stages of the Pod life cycle, kubelet has not yet started sandboxing the Pod with the container runtime
  • At the end of the Pod’s life cycle, when the Pod’s sandbox is destroyed for some reason:
    ? Pods are not evicted when the node is restarted
    ? For containers that use virtual machines for isolation, when the Pod sandbox virtual machine is restarted, a new sandbox and a new container network configuration need to be created.

After the runtime plug-in successfully completes the sandbox creation and network configuration of the Pod, kubelet will set the PodReadyToStartContainers status to True. At this time, kubelet can start pulling container images and creating container.

For Pods with an init container, the kubelet will set the Initialized status to True after the init container successfully completes (this occurs after the runtime has successfully created the sandbox and network configuration ), for Pods without an Init container, kubelet will set the Initialized status to True before creating the sandbox and network configuration before.

7. Container probe

probe is a periodic diagnostic performed by kubelet on the container. To perform diagnostics, the kubelet can either execute code within the container or make a network request.

7.1 Checking mechanism

There are four different ways to use probes to inspect containers. Each probe must be defined as one of these four mechanisms:

  • exec
    Execute the specified command within the container. If the return code is 0 when the command exits, the diagnosis is considered successful.
  • grpc
    Use gRPC to perform a remote procedure call. The goal should implement gRPC health checks. If the response status is SERVING, the diagnosis is considered successful.
  • httpGet
    Execute an HTTP GET request on the specified port and path on the container’s IP address. If the response status code is greater than or equal to 200 and less than 400, the diagnosis is considered successful.
  • tcpSocket
    Performs a TCP check on the specified port on the container’s IP address. If the port is open, the diagnosis is considered successful. This is considered healthy if the remote system (container) closes the connection immediately after opening it.

Note:

  • Unlike other mechanisms, the implementation of the exec probe involves the creation/replication of multiple processes per execution. Therefore, in clusters with high Pod density and low initialDelaySeconds and periodSeconds durations, configuring any probe that uses the exec mechanism may increase the CPU load of the node. . In this scenario, it is preferable to use other probe mechanisms to avoid additional overhead

7.2 Detection results

Each detection result is one of the following:

  • Success (success): The container passed diagnostics
  • Failure: Container failed diagnostics
  • Unknown: Diagnosis failed, so no action will be taken

7.3 Detection Type

For running containers, kubelet can choose whether to execute the following three probes and how to react to the detection results.

  1. livenessProbe
    Indicates whether the containeris running. If the liveness detection fails, the kubelet will kill the container, and the container will decide the next action based on its restart policy. If the container does not provide a liveness probe, the default status is Success
  2. readinessProbe
    Indicates whether the containeris ready to serve requests. If the readiness detection fails, the endpoint controller will remove the Pod’s IP address from the endpoint list of all services that the Pod matches. The status value of the ready state before the initial delay defaults to Failure. If the container does not provide a readiness probe, the default status is Success
  3. startupProbe
    Indicateswhether the application in the container has been started. If a startup probe is provided, all other probes are disabled until this probe succeeds. If the startup detection fails, kubelet will kill the container and the container will restart according to its restart policy. If the container does not provide startup detection, the default state is Success
  • When to use liveness probes
    If a process in a container is able tocollapse on its own if it encounters a problem or becomes unhealthy, thenNot necessarily requires a survival probe. kubelet will automatically perform repair operations based on the Pod’s restartPolicy
    If you want the container to be killed and restarted when the probe fails, you can specify a liveness probe and specify restartPolicy code> is Always or OnFailure.

  • When to use readiness probes
    1. If you want to start sending request traffic to the Pod only when the detection is successful, use readiness probe. In this case, the readiness probe may be the same as the liveness probe, but the presence of the readiness probe in the specification means that the Pod will not receive any data during the startup phase, and will only start receiving after the probe succeeds. data.
    2. If you want the container to automatically enter the maintenance state, you can also specify a readiness probe to check an endpoint that is specific to the ready state and therefore different from the live state.
    3. If the application has strict dependence on the back-end service, you can implement liveness and readiness probes at the same time,when the application The program itself is healthy. After the survival probe passes, the readiness probe will additionally check whether each required backend service is available. This helps us avoid directing traffic to pods that only return error messages.
    4. If you just want to be able to drain requests when the Pod is deleted, you do not necessarily need to use the readiness probe; when the Pod is deleted, the Pod will automatically put itself in the not-ready state, regardless of whether the readiness probe exists. While waiting for the containers in the Pod to stop, the Pod will remain in the not-ready state.

  • When to use startup probes
    Startup probes are useful for Pods that contain containers that take a long time to be ready to start. We no longer need to configure a long liveness probe interval, but only need to set another independent configuration option to perform probes on the container during startup, allowing the use of far longer than the liveness interval allows.
    Generally speaking, if a container's startup time exceeds the total value of initialDelaySeconds + failureThreshold * periodSeconds, then a startup probe should be set up to check the same endpoint used by the liveness probe. The default time of periodSeconds is 10 seconds. We should set failureThreshold high enough so that the container has enough time to complete startup and avoid changing the value used by the survival probe. the default value. This setting helps reduce the occurrence of deadlock conditions.

8. Termination of Pod

Because Pod represents a process running on the nodes in the cluster. Graceful termination is necessary when these processes are no longer needed. Generally should not be arbitrarily used to terminate them with the kill signal, as these processes have no chance to complete the cleanup work.

The design goal is to allow us to request the deletion of a process and know when the process has been terminated, while also ensuring that the deletion will eventually be completed. When we request to delete a Pod, the cluster will record and track the decent termination cycle of the Pod instead of directly forcibly killing the Pod.

Usually the process of graceful termination of Pod is:

  1. The kubelet first sends a TERM (aka SIGTERM) signal with a decent timeout period to each container's main process, sending a request to the container runtime to try to stop the container in the Pod.
  2. These requests to stop the container are handled asynchronously by the container runtime, and the order in which these requests are processed cannot be guaranteed.
  3. Many container runtimes respect the STOPSIGNAL value defined within the container image, and if different, send the STOPSIGNAL configured in the container image instead of TERMsignal.
  4. Once the graceful termination period is exceeded, the container runtime will send a KILL signal to all remaining processes, after which the Pod will be removed from the API server
  5. If the kubelet or container runtime management service is restarted while waiting for the process to terminate, the cluster will retry from the beginning, giving the Pod a full graceful termination period.

Below is an example:

  1. Use the kubectl tool to manually delete a Pod. The default termination period of the Pod is 30 seconds.
  2. The Pod object in the API server is updated to record the final death date of the Pod including the decent termination period. If the calculated time point is exceeded, the Pod is considered dead (dead). If you use kubectl describe to check The Pod being deleted will be displayed as Terminating (terminating). On the node where the Pod is running, kubelet once it sees that the Pod is marked as terminating (a decent termination period has been set), kubelet starts the local Pod shutdown process.
    1. If one of the containers in the Pod defines the preStop callback, kubelet starts running the callback logic in the container. If the preStop callback logic is still running when the decent termination period is exceeded, kubelet will request a grace period for the Pod, adding two seconds at a time.
    If the preStop callback takes longer than the default graceful termination period, you must modify the terminationGracePeriodSeconds attribute value to make it work properly.
    2. Kubelet next triggers the container to run and send a TERM signal to process 1 in each container; the containers in the Pod will receive the TERM signal at different times, and the order of reception is also uncertain. If the order of shutdown is important, consider using preStop callback logic to coordinate.
  3. While kubelet starts the graceful shutdown logic of the Pod, the control plane will evaluate whether to remove the closed Pod from the corresponding EndpointSlice object. The filtering condition is: the Pod is selected by the corresponding service with a certain selector. selected. ReplicaSet and other workload resources no longer treat Pods in the shutdown process as legitimate, serviceable replicas.

Pods that are slow to close should not continue processing regular service requests but should begin terminating and completing processing of open connections. Some applications not only need to complete the processing of open connections, but also need further graceful termination logic, such as: draining and completing the session.

The endpoints corresponding to any terminating Pods will not be immediately removed from EndpointSlice, and the EndpointSlice API (as well as the traditional Endpoints API) will expose a status to indicate that it is in the terminated state. Terminating endpoints always have their ready status set to false (for backward compatibility with pre-1.26 versions), so the load balancer does not use it for regular traffic.

8.1 Forced termination of Pod

By default, all deletions come with a 30-second grace period. The kubectl delete command supports the --grace-period= option, allowing you to override the default value and set the desired period value.

Forcing the grace period to 0 means the pod is immediately removed from the API server. If the Pod is still running on a node, the forced deletion operation will trigger the kubelet to perform an immediate cleanup operation.

Notice:

  • You must set --grace-period=0 and additionally set the --force parameter to initiate a forced deletion request.

When performing a force delete operation, the API server no longer waits for confirmation from the kubelet that the Pod has terminated execution on the node it was originally running on. The API server deletes the Pod object directly so that a new Pod with the same name can be created. On the node side, Pods that are set to terminate immediately will still get a slight grace period before being forcibly killed.

8.2 Pod garbage collection

For a failed Pod, the corresponding API object will still remain on the cluster's API server until the user or controller process explicitly deletes it.

The Pod's garbage collector (PodGC) is the controller of the control plane. It will operate when the number of Pods exceeds the configured threshold (according to kube-controller-manager's >Terminated-pod-gc-threshold setting), delete the terminated Pod (the stage value is Succeeded or Failed). This behavior will avoid resource leakage problems caused by continuously creating and terminating Pods over time.

Additionally, PodGC cleans any Pods that meet any of the following conditions:

  • Orphan Pod - bound to a node that no longer exists,
  • Pods that terminated unplanned
  • Pods in the process of being terminated, when the NodeOutOfServiceVolumeDetach feature gate is enabled, are bound to unready nodes with the node.kubernetes.io/out-of-service taint.

If the PodDisruptionConditions feature gate is enabled, while cleaning up Pods, PodGC will also mark them as failed if they are in a non-terminated state. In addition, PodGC will add Pod interference conditions when cleaning up orphan Pods

syntaxbug.com © 2021 All Rights Reserved.
Field name Description
type The name of the Pod status
lastProbeTime The timestamp of the last time the Pod status was detected
lastTransitionTime The timestamp when the Pod last transitioned from one state to another
reason Machine-readable, camelCase-encoded text describing the reason for the last change in status
message A human-readable message giving details of the last state transition