Clarify K8S OOM and CPU throttling mechanism in one article

Introduction

Out of memory (OOM) errors and CPU throttling are major pain points for resource handling in cloud applications when using Kubernetes.

why is that?

CPU and memory requirements in cloud applications are becoming increasingly important because they are directly related to your cloud costs.

With limits and requests , you can configure how pods should allocate memory and CPU resources to prevent resource starvation and adjust cloud costs.

Pods may be evicted through preemption or node pressure if the node does not have enough resources.

When a process runs out of memory (OOM), it is terminated because it does not have the resources it needs.

If the CPU consumption is higher than the actual limit, the process will start throttling.

But how do you proactively monitor how close a Kubernetes pod is to OOM and CPU throttling?

Kubernetes OOM

Each container in a Pod requires memory to run.

Kubernetes limits are set per container in a Pod definition or Deployment definition.

All modern Unix systems have a way to kill processes in case they need to reclaim memory. This will be flagged as error 137 or OOMKilled.

 State: Running
      Started: Thu, 10 Oct 2019 11:14:13 + 0200
    Last State: Terminated
      Reason: OOM Killed
      Exit Code: 137
      Started: Thu, 10 Oct 2019 11:04:03 + 0200
      Finished: Thu, 10 Oct 2019 11:14:11 + 0200

This exit code 137 means that the process used more memory than allowed and must be terminated.

This is a feature present in Linux where the kernel oom_score sets a value for processes running in the system. Additionally, it allows setting a value called oom_score_adj which Kubernetes uses to allow quality of service. It also has an OOM Killer feature that will audit processes and kill those that are using more memory than they should be capped.

Note that in Kubernetes, a process can reach any of the following limits:

The Kubernetes Limit set on the container.
The Kubernetes ResourceQuota set on the namespace.
The actual memory size of the node.

Memory overcommit

Limits can be higher than requests, so the sum of all limits can be higher than node capacity. This is called overuse, and it’s very common. In fact, it may run out of memory in the node if all containers are using more memory than requested. This usually results in some pods being killed to free up some memory.

Monitoring Kubernetes OOM

When using the node exporter in Prometheus, there is a metric called node_vmstat_oom_kill. Tracking when OOM kills occur is important, but you may want to know about such events before they happen.

Instead, you can check how close a process is to Kubernetes limits:

(sum by (namespace, pod, container)
(rate(container_cpu_usage_seconds_total{container!=""}[5m])) / sum by
(namespace, pod, container)
(kube_pod_container_resource_limits{resource="cpu"})) > 0.8

Kubernetes CPU throttling

CPU throttling is the behavior of slowing down a process when it is about to reach certain resource limits.

Similar to the memory case, these limits may be:

The Kubernetes Limit set on the container.
The Kubernetes ResourceQuota set on the namespace.
The actual memory size of the node.

Consider the following analogy. We have a highway with some traffic where:

The CPU is the way.
Vehicles represent processes, and each vehicle has a different size.
Multiple channels represent multiple cores.
A request would be for a dedicated road, such as a bike lane.

Throttling here manifests itself as a traffic jam: eventually, all processes will run, but everything will be slower.

CPU Processes in Kubernetes

CPUs are handled using shares in Kubernetes. Each CPU core is divided into 1024 shares, which are then distributed among all running processes using the cgroups (control groups) feature of the Linux kernel.

If the CPU can handle all current processes, no action is required. If the process is using more than 100% of the CPU, the shares are in place. Like any Linux Kernel, Kubernetes uses the CFS (Completely Fair Scheduler) mechanism, so processes with more shares will get more CPU time.

Unlike memory, Kubernetes does not kill pods for throttling.

CPU statistics can be viewed in /sys/fs/cgroup/cpu/cpu.stat

CPU Excessive Usage

As we saw in the Limits and Requests article, setting limits or requests is important when we want to limit the resource consumption of a process. However, be careful not to set the total number of requests larger than the actual CPU size, as this means that each container should have a certain amount of CPU.

Monitoring Kubernetes CPU throttling

You can check how close a process is to the Kubernetes limit:

(sum by (namespace,pod,container)(rate(container_cpu_usage_seconds_total
{container!=""}[5m])) / sum by (namespace, pod, container)
(kube_pod_container_resource_limits{resource="cpu"})) > 0.8

If we want to track the amount of throttling happening in the cluster, cadvisor provides container_cpu_cfs_throttled_periods_total and container_cpu_cfs_periods_total. With these two, you can easily calculate the throttling percentage of all CPU cycles.

Best practice

Pay attention to limits and requests

Limits are a way of setting a maximum resource cap in node, but these need to be treated with care as you may end up with a process being throttled or terminated.

Prepare to be evicted

By setting a very low request, you might think this would grant your process the least amount of CPU or memory. But the kubelet will first evict pods that are more used than requested, so you mark them as the first to be killed!

If you need to protect specific pods from preemption (when kube-scheduler needs to allocate new pods), assign priority to the most important processes.

Throttling is the silent enemy

By setting unrealistic limits or overcommitting, you may not realize that your process is being throttled and performance is suffering. Proactively monitor your CPU usage and understand your actual limits in containers and namespaces.

Summary

Here’s a Kubernetes resource management cheat sheet for CPU and memory. This summarizes the current article, and these articles in the same series: