Why does every Kubernetes Pod need a Pause container?

Follow the “Wonderful World of Linux” on the public account

Set it as a “star” and let you play Linux every day!

Introduction

The error reported by Kubernetes is as follows:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "k8s.gcr.io/pause:3.5": failed to pull image "k8s.gcr.io/pause: 3.5": failed to pull and unpack image "k8s.gcr.io/pause:3.5": failed to resolve reference "k8s.gcr.io/pause:3.5": failed to do request: Head " https://k8s.gcr.io/v2/pause/manifests/3.5": x509: certificate signed by unknown authority

The address k8s.gcr.io needs to be connected to the external network before it can be pulled. As a result, the pause image cannot be pulled down and the Pod cannot be started. I have never paid attention to the pause container before. What is it, what is it used for, and why I haven’t seen it in the Pod. This article will help you understand the pause container.

What is a Pause container

In Kubernetes, Pod is the smallest scheduling unit, but its internal structure is full of many complex mechanisms, one of which is the Pause container. Although the pause container may seem inconspicuous, it plays a vital role in the entire Kubernetes cluster. When we execute docker ps on the kubernetes node, we can find that a pause process container is running on each node, as follows:

[root@localhost ~]# docker ps |grep traefik
66032431a20e 2ae1addee1b2 "/entrypoint.sh --gl…" 30 hours ago Up 30 hours k8s_traefik_traefik-68b9ccfc77-x8sqg_traefik_aa5b97bf-3db8-4b92-89a7-1fe551645e6a_0
10d393461904 registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 30 hours ago Up 30 hours k8s_POD_traefik-68b9ccfc77-x8sqg_traefik_aa5b97bf-3db8-4b92-89a7-1fe551645e6a_0

You will find that there are many pause containers running on the server, and the container naming is also very standardized. Then every time a container is started, a container like pause will be started. So what exactly does it do? It is the Pause container, also called the Infra container. After we deploy the kubernetes cluster and check the kubelet process, we can see that there is such a parameter in the configuration:

[root@localhost ~]# ps -ef|grep kubelet
root 8675 1 10 Sep18 ? 03:15:07 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/ var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-container-image=registry.aliyuncs.com/google_containers/pause:3.5

The image used by the pause container is registry.aliyuncs.com/google_containers/pause:3.5. The image is very small, only 683kB. Because it is always in the Pause (temporary) state, it is named pause.

[root@localhost ~]# docker images|grep pause
registry.aliyuncs.com/google_containers/pause 3.5 ed210e3e4a5b 2 years ago 683kB

If you want to know the composition of the pause container (the code is written in C language), you can go to the official warehouse to take a look: https://github.com/kubernetes/kubernetes/tree/master/build/pause

The role of Pause container

Network namespace isolation: Pod is the smallest scheduling unit in Kubernetes and can contain one or more containers. In order to achieve network isolation between containers, each Pod has its own independent network namespace. The Pause container is responsible for creating and maintaining this network namespace. Other containers share this network namespace so that they can communicate with each other without conflicting with containers in other Pods.
Process isolation: The Pause container keeps a lightweight process running even if other containers in the Pod are stopped. This process doesn’t actually perform any useful work, but its presence ensures that the Pod is not deleted without the container running. When other containers are stopped, the Pause container remains running to maintain the Pod’s life cycle.
Resource Isolation: Although the Pause container does not typically allocate large amounts of CPU and memory resources, it can be configured to use some. This helps ensure that Kubernetes can still monitor and manage the Pod’s resource usage even if no other containers are running in the Pod. This also helps prevent the Pod from being occupied by other Pods with the same resource requirements.
IP address maintenance: The Pause container is responsible for maintaining the IP address of the Pod. A Pod’s IP address is usually dynamically assigned, but because the Pause container is always running, it can maintain the Pod’s IP address so that other containers can communicate through that address. This helps ensure that the Pod’s IP address remains consistent throughout the Pod’s lifetime.
Life cycle management: The life cycle of the Pause container is the same as that of the Pod. When a Pod is created, the Pause container is created; when the Pod is deleted, the Pause container is also deleted. This ensures that the entire lifecycle of a Pod is managed by Kubernetes, including creation, expansion, scaling, and deletion.

How the Pause container works

A Pod can be composed of a group of containers that share storage and network resources. So how are network resources shared? Here is an example:

For example, there is a Pod that contains a container A and a container B. The two of them must share the Network Namespace. The solution in Kubernetes is this: it will create an additional small Infra container in each Pod to share the Network Namespace of the entire Pod. Infra container is a very small image, about 683kB. It is a container written in C language and is always in a “paused” state. Since there is such an Infra container, all other containers will be added to the Network Namespace of the Infra container through Join Namespace. Therefore, all containers in a Pod can see exactly the same network view.

That is: the network devices, IP addresses, Mac addresses, etc., and network-related information they see are all in one copy, and this copy all comes from the Infra container created for the first time by the Pod. This is a solution for Pod to solve network sharing. In the Pod, there must be an IP address, which is the address corresponding to the Network Namespace of the Pod and the IP address of the Infra container. So what everyone sees is one copy, and all other network resources are one copy of a Pod and are shared by all containers in the Pod. This is how Pod networking is implemented. Since there needs to be an intermediate container, the Infra container must be started first in the entire Pod. And the life cycle of the entire Pod is equal to the life cycle of the Infra container and has nothing to do with containers A and B. This is a very important design. Kubernetes’ pause container mainly provides two core functions for each business container:

First, it provides the basis for the entire pod’s Linux namespace.
Second, enable the PID namespace, which acts as a process with PID 1 in each pod and recycles zombie processes.

Manual simulation of Pod

We already know that a Pod consists of at least one container on the surface, but in fact a Pod must contain at least two containers, one is the application container and the other is the pause container. Run a pause container:

[root@localhost ~]# docker run -d --name pause -p 8080:80 registry.aliyuncs.com/google_containers/pause:3.5
fd315974f5d1a5f52ca47c5dc31aea3774cebf90c88ce065cc9e9ea2f52c103c

–name: Specify the name of the pause container, pause
-p 8080:80: Maps port 8080 of the host to port 80 of the container

Run an nginx container, proxy 127.0.0.1:8888 springboot application

# Prepare nginx configuration file
[root@k8s001 ~]# cat <<EOF >> nginx.conf
error_log stderr;
events { worker_connections 1024; }
http {
    server {
        listen 80 default_server;
        server_name www.kubesre.com;
        location/{
            proxy_pass http://127.0.0.1:8888;
        }
    }
}
EOF
 
#Create nginx container
[root@localhost ~]# docker run -d --name nginx -v `pwd`/nginx.conf:/etc/nginx/nginx.conf --net=container:pause --ipc=container:pause --pid =container:pause --ipc=shareable nginx
fa9f858adae826ad536178747e00fffc829c7baf98c3bc29e945230abbf2a5cb

–net=container:pause: Used to share the network namespace with another container. In this case, the container “nginx” shares the network namespace with the container named “pause” and they can use the same network configuration and interfaces.
–ipc=container:pause: Used to share the IPC namespace with another container. The IPC namespace allows inter-Process Communication between containers. Here, the container “nginx” shares the IPC namespace with a container named “pause”.
–pid=container:pause: Used to share the PID namespace with another container. The PID namespace allows containers to view and manage the processes of other containers.
–ipc=shareable: Indicates that the IPC namespace is shareable so that other containers can join this shared namespace.

Create an application container springboot

[root@localhost ~]# docker run -d --name springboot --net=container:pause --ipc=container:pause --pid=container:pause --ipc=shareable registry.cn-shanghai. aliyuncs.com/kubesre02/springboot
e33cfa3cebd5aafa714ca6ef0f6a16be52a282c64b8d24b2d98890ccf02c436a

At this point, we have manually simulated a “Pod” that conforms to the K8S Pod model, but it is not managed by K8S. Verify, view running containers

[root@localhost]~# docker ps | grep -E "pause|nginx|springboot"
4f877cdcba5d registry.cn-shanghai.aliyuncs.com/kubesre02/springboot "java -jar /app.jar" 3 seconds ago Up 2 seconds springboot
e541dc010fb3 nginx "/docker-entrypoint.…" 19 hours ago Up 19 hours nginx
09f94a052d50 registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 19 hours ago Up 19 hours 0.0.0.0:8080->80/tcp, :::8080->80/tcp pause

Access http://ip:8080 port through browser

[root@localhost ~]# curl http://localhost:8080
Hello Docker World

As can be seen from the above steps:

The pause container maps the internal port 80 to the host port 8080.
After the pause container sets the network namespace on the host, the nginx container is added to the network namespace.
-net=container:pause is specified when the nginx container is started.
When the springboot container starts, it is added to the namespace of the network in the same way.
In this way, the three containers share the network and can communicate directly with each other using localhost.
–ipc=container:pause, –pid=container:pause means that the ipc and pid of the three containers are in the same namespace, and the init process is paused.

Here, we enter the springboot container to view:

[root@localhost ~]# /tmp/test# docker exec -it springboot sh
/ # ps aux
PID USER TIME COMMAND
    1 65535 0:00 /pause
  205 root 0:22 java -jar /app.jar
  240 root 0:00 nginx: master process nginx -g daemon off;
  261 101 0:00 nginx: worker process
  263 root 0:00sh
  269 root 0:00 ps aux

You can see the processes of the pause and nginx containers in the springboot container, and the PID of the pause container is 1. In kubernetes, the process with PID=1 of the container is the business process of the container itself.

If there is no K8S Pod, to start a business container, you need to manually create three containers. When you want to destroy the service, you also need to delete the three containers. With K8S Pod, these three containers are logically a whole. Creating a Pod will automatically create three containers, and deleting a Pod will delete the three containers. From a management perspective, it is a lot more convenient.

This is a fundamental reason for the existence of Pod.

How to recycle zombie processes

In Linux, the processes in the PID namespace are a tree structure, and each process has a parent process. There is only one process at the root of the tree with no real parent. This is the init process, its PID is 1.

Zombie processes are processes that have stopped running but their process table entries still exist. In UNIX systems, if a child process ends, but its parent process does not wait for it (call wait/waitpid), then it will become a Zombie process.

How are zombie processes generated?

One situation where a zombie process occurs is that the parent process is poorly written and omits the wait call, or the parent process unexpectedly crashes and dies before the child process, and the new parent process does not call wait. When the parent process of a process dies before the child process, the operating system assigns the child process to the init process or the process with PID 1. That is, the init process accepts the child process and becomes its parent process. This means that now when the child process exits, the new parent process (init) must call wait to get its exit code, otherwise its process table entry will be retained forever and become a zombie process.

Within a Kubernetes pod, containers run essentially the same way as above, but a special pause container is created for each pod.

This pause container runs a very simple process. It does not execute any functions and essentially sleeps forever. Its source code implementation:

/*
Copyright 2016 The Kubernetes Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
 
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
 
static void sigdown(int signo) {
  psignal(signo, "Shutting down, got signal");
  exit(0);
}
 
static void sigreap(int signo) {
  while (waitpid(-1, NULL, WNOHANG) > 0);
}
 
int main() {
  if (getpid() != 1)
    /* Not an error because pause sees use outside of infra containers. */
    fprintf(stderr, "Warning: pause should be the first process\\
");
 
  if (sigaction(SIGINT, & amp;(struct sigaction){.sa_handler = sigdown}, NULL) < 0)
    return 1;
  if (sigaction(SIGTERM, & amp;(struct sigaction){.sa_handler = sigdown}, NULL) < 0)
    return 2;
  if (sigaction(SIGCHLD, & amp;(struct sigaction){.sa_handler = sigreap,
                                             .sa_flags = SA_NOCLDSTOP},
                NULL) < 0)
    return 3;
 
  for (;;)
    pause();
  fprintf(stderr, "Error: infinite loop terminated\\
");
  return 42;
}

From the above code, we can find that the pause container not only calls pause() to make the process sleep, but also has another important function:

It assumes that it is the role of PID 1. When the zombie process is isolated by its parent process, it will be adopted by the pause container and acquire the zombie process by calling wait. This way you won’t have zombie processes piling up in the Kubernetes pod’s PID namespace.

Then why do you usually not see the Pause container explicitly when you use commands such as kubectl create or kubectl apply to create a Pod. This is because Pause containers are automatically created and managed by Kubernetes and typically require no manual action or attention from the user. It is an implicit component of a Pod and is used to maintain network isolation between the Pod’s infrastructure and containers.

It is not difficult to imagine that the process is very complicated. And we haven’t delved into how to monitor and manage the lifecycle of these containers. But don’t worry, we don’t need to be so complicated to manage our containers, because kubernetes has already done it for us.

This article is reproduced from: “Cloud Native Operation and Maintenance Circle”, original text: https://url.hi-linux.com/dQUJn, the copyright belongs to the original author. Welcome to submit articles. Submission email: [email protected].

Recently, we established a Technical Exchange WeChat Group. At present, many great masters in the industry have joined the group. Interested students can join and exchange technology with us. In the “Wonderful World of Linux”, directly reply to “Add the Group” on the public account Invite you to join the group.