The three cornerstones of containers: Cgroups, Namespace, Rootfs

1. Introduction

Basic Concept

Docker includes three basic concepts

Mirror: It is equivalent to a root file system. For example, the official image ubuntu:16.04
Container: The relationship between image (Image) and container (Container) is just like classes and instances in object-oriented programming. Image is a static definition, and container is the entity when the image is run. Containers can be created, started, stopped, deleted, paused, etc.
Warehouse: The warehouse can be regarded as a code control center, used to save images.

Important mechanisms of containers

advantage

Compared with applications on virtual machines that require interception and processing by virtualization software (an extra layer of consumption), containers are more agile and performant.

shortcoming

The isolation mechanism based on Linux Namespace also has many shortcomings compared to virtualization technology. The main problem is that the isolation is not complete.
In the Linux kernel, there are many resources and objects that cannot be namespaced. The most typical example is: time.
- This means that if the program in your container uses the settimeofday(2) system call to modify the time, the time of the entire host will be modified accordingly, which is obviously not in line with user expectations. Compared with the degree of freedom that can be played around in a virtual machine, when deploying applications in a container, “what can and cannot be done” is a question that users must consider.

1. Namespace blinding method

Note here that Linux Namespace should not be confused with the concept of k8s Namespace:

Linux Namespace mechanism: used for resource and view isolation, so that the host cannot see the resources in the container, and the container cannot see the resources in other containers, achieving view isolation of different applications and avoiding interference

k8s Namespace mechanism: It is the isolation of user resources, in order to facilitate the management of k8s’ own resources

Linux Namespace is a kernel-level environment isolation method provided by Linux. This isolation mechanism is very similar to chroot, which changes a directory to the root directory so that external content cannot be accessed. On this basis, Linux Namesapce provides isolation mechanisms for UTS, IPC, Mount, PID, Network, User, etc., as follows:

Category	System call parameters	Function	strong>
Mount Namespaces	CLONE_NEWNS	Isolate mount point	Linux 2.4.19
UTS Namespaces	CLONE_NEWUTS	Isolate hostnames and domain names	Linux 2.6.19
IPC Namespaces	CLONE_NEWIPC	Isolate System V IPC and POSIX message queues	Linux 2.6.19
PID Namespaces	CLONE_NEWPID	Isolation Process ID	Linux 2.6.19
Network Namespaces	CLONE_NEWNET	Isolate network device, port number Etc.	Started in Linux 2.6.24 Completed in Linux 2.6.29
User Namespaces	CLONE_NEWUSER	Isolate users and user groups	Started in Linux 2.6.23 Completed in Linux 3.8)

1.1 PID Namespace

Brand new process space, itself is number 1

The newly created process will “see” a brand new process spaceIn this process space, its PID is 1. The reason why I say “see” is because this is just a “mask”. In the real process space of the host, the PID of this process is still a real value, such as 100.

We can also execute clone() calls multiple times, so that multiple PID Namespaces will be created, and the application process in each Namespace will think that it is the No. 1 process in the current container, and they can neither see the host. In the real process space, the specific situation in other PID Namespaces cannot be seen.

In addition to the PID Namespace we just used, theLinux operating system also provides Namespaces such as Mount, UTS, IPC, Network and User, which are used to perform “blind” operations on various process contexts. For example, Mount Namespace is used to allow the isolated process to see only the mount point information in the current Namespace; Network Namespace is used to allow the isolated process to see the network devices and configurations in the current Namespace. This is the most basic implementation principle of Linux containers.

1.2 Mount Namespace

Mount Namespace modifies the container process’s knowledge of the “mount point” of the file system. However, this also means that the view of the process will not be changed until the “mount” operation occurs. Before this, newly created containers would directly inherit each mount point of the host.

The use of Mount Namespace is slightly different from other Namespaces: its changes to the container process view must be accompanied by a mount operation (mount) to take effect (before the mount operation, the host is displayed The file of the host); after performing the mount operation on the container Mount Namespace, the host will not see the change of the mount point in the container by using the mount -l command. Only by executing mount -l in the container can the change of the mount point be seen ( That is, a change in view)

1.3 UTS Namespace

UTS Namespace is mainly used to isolate the two system identifiers nodename and domainname. In UTS Namespace, each Namespace has an independent host name.

1.4 IPC Namespce

IPC Namespace is mainly used to isolate inter-process communication. PID Namespace and IPC Namespace are used together to enable processes in the same IPC Namespace to communicate with each other, but processes in different IPC Namespaces cannot communicate.

1.5 User Namespace

User Namespace is mainly used to isolate users and user groups. A typical application scenario is that processes running as non-root users on the host can be mapped to root users in a separate User Namespace. Using User Namespace allows the process to have root permissions in the container, but is just an ordinary user on the host.

1.6 Net Namespace

Net Namespace is used to isolate information such as network devices, IP addresses, and ports. Net Namespace allows each process to have its own independent IP address, port and network card information. For example, if the host IP address is 172.16.4.1, an independent IP address can be set in the container to 192.168.1.1.

2. Cgroups Resource Limitation

Prevent a single container from seizing all the resources of the host

2.1 Container Cgroups Limitations

Although the No. 1 process in the container can only see the situation in the container due to the interference of the “blind eye”, on the host machine, as the No. 100 process, it still has an equal competitive relationship with all other processes. This means that although process No. 100 is seemingly isolated, the resources it can use (such as CPU, memory) can be occupied by other processes (or other containers) on the host at any time. . Of course, this process No. 100 may also eat up all the resources. These situations are obviously not reasonable behaviors that a “sandbox” should exhibit.

Linux Cgroups are an important function in the Linux kernel used to set resource limits for processes.

The full name of Linux Cgroups is Linux Control Group. Its main function is to limit the upper limit of resources that a process group can use, including CPU, memory, disk, network bandwidth, etc. In addition, Cgroups can also perform operations such as priority setting, auditing, and suspending and resuming processes.

2.2 Clear the resource limits of the Pod

The essence of Pod resource limitation is to use the Cgroup mechanism to limit container resources.

For a process in the operating system, if it wants to run, it must need CPU and storage. In the same way, if a pod wants to run, it must have these two parts, so k8s divides the resources required for the pod to run into Two major categories: compressible resources and incompressible resources.

Resource model of k8s:

Compressible resources: Refers to resources such as CPUs. The characteristic of this type of resources is that when the resources are insufficient, it will only cause pods to run longer and longer, which will lead to “starvation” ” and will not exit.

Incompressible resources: Refers to the mem category. Once the resources are insufficient, they will be killed by the kernel and force the pod to exit.

In order to describe these resource information, k8s binds these resources to pods. Since a pod in k8s is composed of multiple containers, the resources in the pod are the sum of container resources. Two of the more important indicators are CPU and Memory.

CPU is a compressible resource: the unit used to describe CPU in K8S is millicpu. For example: 500m refers to 500 millicpu, which means 0.5 CPU.

Memory is an incompressible resource: K8S uses Ei, Pi, Ti, Gi, Mi, Ki (or E, P, T, G, M, K) as the value of bytes, and the one ending with i is 2 The power of , for example: 1Mi=10241024; 1M=10001000

During scheduling, kube-scheduler will only calculate based on the value of requests, which represents the allocated resource size. — request is used for filtering and scoring during scheduling

When actually setting Cgroups limits, kubelet will set them according to the limits value, indicating the size of the resources used. –limit is used to set cgroup limits

3. Rootfs File System

For the Docker project, its core principle is actually the user process to be created:

Enable Linux Namespace configuration;
Set the specified Cgroups parameters;
Change the root directory of the process (Change Root).

In this way, a complete container is born. However, the Docker project will give priority to using the pivot_root system call in the last step of switching. If the system does not support it, chroot will be used. Although these two system calls have similar functions, they also have subtle differences.

To sum up

How does docker achieve isolation?

In fact, docker is a system combined with several components to deceive a process, mainly relying on three accomplices: namespace, chroot, and cgroup

1. Isolation of namespace processes

Linux provides process isolation implemented by namespace. If you want to “assign” a PID to a process, you only need to call a clone() function. The process created in this way will still have a random process number on the host, but in this namespace it will The process number is 1

The namespace only isolates it at the process level. Processes running in the namespace cannot see other processes, but other resources of the host (CPU, memory, file system) are still system-wide, that is, these resources are still shared.

2. Use cgroup to limit the resources that a process can use

cgroup is used to limit the CPU, memory and other resources of the process.

cgroup is applied to Linux’s “everything is a file”, and its directory is under /sys/fs/cgroup.

First create a group in the cgroup directory, which is a directory

mkdir /sys/fs/cgroup/cpu/container

Next, check the group. cgroup has created the required files.

[root@es1 ~]# ls /sys/fs/cgroup/cpu/container
cgroup.clone_children cgroup.procs cpuacct.usage cpu.cfs_period_us cpu.rt_period_us cpu.shares notify_on_release
cgroup.event_control cpuacct.stat cpuacct.usage_percpu cpu.cfs_quota_us cpu.rt_runtime_us cpu.stat tasks

Next create a process

while : ; do : ; done & amp;

It will continue to loop without doing anything, but it will consume a lot of CPU resources.

Next, use cgroup to limit the resource usage of the process. By default, the “group” just created will not be associated with any process, so if you want it to take effect on the process just now, you need to write the PID of the process to the specified file.

echo '12890' >/sys/fs/cgroup/cpu/container/tasks

Next limit the CPU

[root@es1 ~]# cat /sys/fs/cgroup/cpu/container/cpu.cfs_quota_us
-1

The content of the cpu.cfs_quota_us file in the corresponding directory is “-1”, which means there is no restriction.

echo 20000 > /sys/fs/cgroup/cpu/container/cpu.cfs_quota_us

20000 means that within 100ms of CPU time, 20ms can be used, which is limited to 20% usage.

cgroup cannot accurately limit the CPU usage of the process, so it will fluctuate around the 20% value. Docker is nothing more than creating a cgroup at the same time when creating the container and writing the corresponding PID into it, so docker can only It is a “single process” mode. If there are two processes at the same time, only the resource limit of the main process can take effect.

3. chroot

After isolating the process and host resources, only one file system is left.

Each container in docker has its own “own” root directory, but if you execute a command like df -h on the host machine, you can still see all the containers, so the “root directory” in the container actually depends on It exists in the host’s file system. It is nothing more than “mounting” a directory on the host into the container, and this is only visible to the process in the container.

First create a container, and then check the directory mounting of the host

[root@worker3 ~]# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
177db9617483 528909316/check:debian_11 "tail -f /etc/hosts" 2 weeks ago Up 2 weeks net2
[root@worker3 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 401M 3.5G 11% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/mapper/centos-root 48G 20G 28G 42% /
/dev/sda1 1014M 178M 837M 18% /boot
overlay 48G 20G 28G 42% /data/docker/overlay2/072a996f43e54fc20475a5c2df7c61856bfd30525c7ec955c157390b6ad78144/merged
tmpfs 796M 0 796M 0% /run/user/0

There is only one container, so only one mount is generated. Next, look at the contents of this mount directory of type overlay:

[root@worker3 ~]# ls /data/docker/overlay2/072a996f43e54fc20475a5c2df7c61856bfd30525c7ec955c157390b6ad78144/merged
bin boot dev etc home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var

Next, enter the container and view the root directory

[root@worker3 ~]# docker exec -it 177db9617483 /bin/bash
root@net2:/# ls
bin boot dev etc home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var

exactly the same

So the container “mounts” a directory on the host and uses this directory as the “root directory” of the container.

This process is accomplished by chroot.