How Cgroup throttling CPU works in kubernetes and docker

CPU Cgroup

Cgroups limit computer resources for specified processes. CPU Cgroup is one of the Cgroups subsystems in Cgroups, which is used to limit the CPU usage of processes.
For the CPU usage of a process, we know that it only includes two parts: one is the user mode (us and ni in the top command); the other part is the kernel mode (sy in the top command).
As for other states (wa, hi, si in the top command), these I/O or interrupt-related CPU usage will not be limited by the CPU Cgroup.
Each Cgroups subsystem is mounted to a default directory through a virtual file system mount point. CPU Cgroups are generally placed in the /sys/fs/cgroup/cpu directory in Linux distributions.

# mount -t cgroup
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)

Each cgroup subsystem represents a resource:

system – the non-subsystem cgroups hierarchy maintained by systemd for its own use;
cpu – provides access to CPU resources to tasks within a cgroup using the scheduler;
cpuacct – generate processor usage reports for all tasks in a cgroup;
memory – limit memory utilization;
freezer – allows a group of tasks in a cgroup to be suspended/resumed;
cpuset – assign independent processor and memory nodes to tasks within a cgroup;
net_cls – allows marking of network packets produced by tasks in a cgroup;
net_prio – Provides a way to dynamically modify the priority of network traffic for each network interface in a cgroup;
pid – limit the number of processes in a cgroup;
rdma – subsystem Remote Direct Memory Access Controller, which limits the use of RDMA/IB resources by processes. These process groups use the RDMA controller;
hugetlb – enable hugepage support for cgroups;
devices – restrict access to devices to a set of tasks in the cgroup;
perf_event – supports access to performance events in cgroups;
blkio – The blkio subsystem is used to limit the block device I/O rate. ;

For example, we start to create two control groups (that is, create two directories) group1 and group2 at the top level of the subsystem

# cd /sys/fs/cgroup/cpu
# mkdir group1 group2
# cd group2/
# ls
cgroup.clone_children cpuacct.stat cpuacct.usage_all cpuacct.usage_percpu_sys cpuacct.usage_sys cpu.cfs_period_us cpu.rt_period_us cpu.shares notify_on_release
cgroup.procs cpuacct.usage cpuacct.usage_percpu cpuacct.usage_percpu_user cpuacct.usage_user cpu.cfs_quota_us cpu.rt_runtime_us cpu.stat tasks

Considering that in the cloud platform, most programs are not real-time scheduling processes, but normal scheduling (SCHED_NORMAL) type processes, so what is the normal scheduling type?
Because the common scheduling algorithm is currently CFS (Completely Fair Scheduler) in Linux. For the convenience of understanding, let’s directly look at the parameters related to CPU Cgroup and CFS, there are three in total.

# cat /sys/fs/cgroup/cpu/group2/cpu.cfs_period_us
100000
# cat /sys/fs/cgroup/cpu/group2/cpu.cfs_quota_us
-1
# cat /sys/fs/cgroup/cpu/group2/cpu.shares
1024

The first parameter is cpu.cfs_period_us, which is a scheduling period of the CFS algorithm. Generally, its value is 100000, in units of microseconds, which is 100ms.
The second parameter is cpu.cfs_quota_us, which indicates the allowed running time of this control group in a scheduling cycle in the CFS algorithm. For example, when the value is 50000, it is 50ms. If this value is divided by the scheduling period (that is, cpu.cfs_period_us), 50ms/100ms = 0.5, so the maximum CPU quota allowed for this control group is 0.5 CPUs. It can be seen from here that cpu.cfs_quota_us is an absolute value. If this value is 200000, which is 200ms, then it is divided by the period, which is 200ms/100ms=2.
The third parameter, cpu.shares. This value is the ratio of CPU Cgroup to CPU allocation between control groups, and its default value is 1024.

In Kubernetes

Kubernetes will create a control group in the CPU Cgroup subsystem for each container, and then write the processes in the container into this control group.

cpu.cfs_quota_us, cpu.cfs_period_us

Kubernetes is implemented through two configurations of cpu.cfs_period_us and cpu.cfs_quota_us in the CPU cgroup control module. Kubernetes will configure two pieces of information for this container cgroup:

cpu.cfs_period_us = 100000 (i.e. 100ms)
cpu.cfs_quota_us = quota = (cpu in millicores * 100000) / 1000

The container CPU cap is determined by dividing cpu.cfs_quota_us by cpu.cfs_period_us. Moreover, in the operating system, the value of cpu.cfs_period_us is generally a fixed value.
In the cgroup CPU subsystem, these two configurations can be used to strictly control the CPU usage of the processes in the cgroup to ensure that the CPU resources used will not exceed cfs_quota_us/cfs_period_us, which is exactly the limit value applied for.
For cpu, if no limit is specified, then cfs_quota_us will be set to -1, that is, there is no limit.

cpu. shares

In the CPU Cgroup, cpu.shares == 1024 means the proportion of 1 CPU, then the value of Request CPU is n, and the value assigned to cpu.shares corresponds to n*1024.
The CPU request is implemented through the cpu.shares configuration in the CPU subsystem in the cgroup. When you specify the CPU request value of a container as x millicores, kubernetes will specify x * 1024 / 1000 for the cpu.shares value of the cgroup where the container is located. Right now:

cpu.shares = (cpu in millicores * 1024) / 1000

For example, when the CPU request value of your container is 1, it is equivalent to 1000 millicores, so the cpu.shares value of the cgroup where the container is located at this time is 1024.
The desired final effect of doing this is: even in extreme cases, that is, when all pods on this physical machine are CPU-heavy jobs (as many CPUs as allocated will be used), it is still possible to ensure that the container can be allocated to the CPU calculation of 1 core. In fact, it is to ensure that the container has a low demand for CPU resources. That is, “Request CPU” means that no matter how many CPU resources other containers apply for, even if the CPU of the entire node is fully occupied during runtime, my container can still guarantee to obtain the required number of CPUs.

in docker

The following parameters can be used in docker:

parameter	role
`--cpuset-cpus`	Restricts a container to use specific CPUs or cores. If you have more than one CPU, a comma-separated list or hyphen-separated range of CPUs the container can use. The number of the first CPU is 0. A valid value might be 0-3 (use 1st, 2nd, 3rd and 4th CPU) or 1,3 (use 2nd and 4th CPU).
`--cpu-shares`	Set this flag to a value greater or less than the default of 1024 to increase or decrease the weight of the container and enable it to take a greater or lesser proportion of the host CPU cycles. This is only enforced when CPU cycles are limited. All containers use as much CPU as they need when plenty of CPU cycles are available. As such, this is a soft limit.
`--cpu-period`	Specifies the period of the CPU CFS scheduler, used together with `--cpu-quota`. Defaults to 100000 microseconds (100 milliseconds). Most users do not change this default. For most use cases, `--cpu` is a more convenient choice.
`--cpu-quota`	Impose a CPU CFS quota on the container. Number of microseconds per `--cpu-period` before the container is throttled. Therefore, as an effective upper bound. For most use cases, `--cpu` is a more convenient choice.
`--cpus`	Specifies how much available CPU resources a container can use. For example, if the host has two CPUs and you set –cpus=”1.5″, then the container is guaranteed to use at most one and a half CPUs. This is equivalent to setting `--cpu-period="100000"` and `--cpu-quota="150000"`. docker 1.13 support support, replace cpu-period and cpu-quota

Example:
If you have 1 CPU, each of the following commands will ensure that the container uses at most 50% of the CPU per second.

docker run -it --cpus=".5" ubuntu /bin/bash

This is equivalent to manually specifying –cpu-period and –cpu-quota;

docker run -it --cpu-period=100000 --cpu-quota=50000 ubuntu /bin/bash

Summary

The CPU Usage of each process only includes two parts: user mode (us or ni) and kernel mode (sy). Other system CPU overhead is not included in the CPU usage of the process, and CPU Cgroup only limits the CPU usage of the process.
The main parameters in the CPU Cgroup include these three: cpu.cfs_quota_us, cpu.cfs_period_us and cpu.shares.
1. The two values cpu.cfs_quota_us and cpu.cfs_period_us determine the maximum amount of CPU resources that can be used by all processes in each control group. The value obtained by dividing cpu.cfs_quota_us (the allowed running time of this control group in a scheduling period) by cpu.cfs_period_us (used to set the scheduling period) determines the upper limit of CPU usage in each control group of CPU Cgroup.
2. The cpu.shares parameter, it is this value that determines the relative proportion of available CPU in the control group under the CPU Cgroup subsystem. When the CPU on the system is fully occupied, this proportion will take effect among the control groups.