08 Case | Shmem: The process does not consume memory, where does the memory go?

In the previous lesson, we talked about the leakage of process heap memory and the harm of OOM caused by memory leaks. In this lesson, we will continue to talk about other types of memory leaks, so that when you find that the system memory is getting less and less, you can think of what is consuming memory.

Some memory leaks will be reflected in the process memory, which is relatively easy to observe; while some memory leaks are difficult to observe because they cannot be judged by observing the memory consumed by the process, so they are easy to be ignored, such as Shmem memory. Leakage is one of those things that is easily overlooked, and we will focus on it in this lesson.

The process does not consume memory, where does the memory go?

I have encountered a real case in a production environment. Our operation and maintenance personnel found that some machines had more and more used memory, but through top and other commands, they could not check who was occupying the memory. As available memory becomes less and less, business processes are also killed by OOM killer, which has a serious impact on the business. So they asked me for help to find out what was causing the problem.

As mentioned in the previous course, when encountering insufficient system memory, the first thing we need to do is to check which memory types consume more in /proc/meminfo, and then do targeted analysis. But if you don’t know the meaning of each item in /proc/meminfo, even if you know which memory items are abnormal, you don’t know how to continue the analysis. So it’s best to remember the meaning of each item in /proc/meminfo.

Back to our case, by checking the /proc/meminfo of these servers, we found that the size of Shmem is abnormal:

$ cat /proc/meminfo
...
Shmem 16777216 kB
...

So what does Shmem mean? How to further analyze who is using Shmem?

We mentioned in the previous basics that Shmem refers to anonymous shared memory, that is, the memory that the process applies for through mmap (MAP_ANON | MAP_SHARED). You may have questions, shouldn’t the memory requested by the process in this way belong to the RES (resident) of the process? For example, the following simple example:

#include <sys/mman.h>
#include <string.h>
#include <unistd.h>
#define SIZE (1024*1024*1024)

int main()
{
        char *p;

        p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED, -1, 0);
        if (!p)
                return -1;

        memset(p, 1, SIZE);

        while (1) {
                sleep(1);
        }

        return 0;
}

After running the program, you can see through top that it is indeed reflected in the RES of the process, and it is also reflected in the SHR of the process. In other words, if the process applies for memory using mmap, we can Observed through the memory consumption of the process.

However, the problem we encountered in our production environment is that the RES of each process is not large, and it seems that it does not correspond to Shmem in /proc/meminfo at all. Why is this?

Let’s start with the answer: This has to do with a special kind of Shmem. We know that the speed of disk is far lower than that of memory. In order to improve performance, some applications will avoid writing some data that does not require continuous storage to the disk. Instead, they will write this part of temporary data into the memory, and then Periodically or when this part of the data is not needed, clear this part of the content to free up memory. Under this demand, a special Shmem was born: tmpfs. tmpfs is shown in the figure below:

It is a memory file system that only exists in the memory. It does not require applications to apply for and release memory. Instead, the operating system automatically plans a part of the space. The application only needs to write data into it. In this way It will be very convenient. We can use the moun command or the df command to view the mount point of tmpfs in the system:

$ df -h
Filesystem Size Used Avail Use% Mounted on
...
tmpfs 16G 15G 1G 94% /run
...

Just like a process writing files to disk, the process closes the file after writing the file, and these files are no longer associated with the process, so the size of these disk files will not be reflected in the process. Similarly, the same is true for files in tmpfs, which will not be reflected in the memory footprint of the process. At this point, you may have guessed that our Shmem takes up a lot of memory. Is it because the tmpfs in Shmem is large?

tmpfs is a type of file system. For the file system, we can use df to view its usage. Therefore, we can also use df to see if tmpfs takes up more memory. It turns out that it does consume a lot of memory. This problem becomes very clear. We only need to analyze what files are stored in tmpfs.

We have also encountered such a problem in the production environment: systemd keeps writing logs to tmpfs but fails to clean them up in time, and the initial value of tmpfs configuration is too large, which causes the amount of logs generated by systemd to increase. more, eventually the available memory becomes less and less.

To solve this problem, the solution is to limit the size of tmpfs used by systemd. When the log volume reaches the tmpfs size limit, automatically clean up the temporary logs, or clean up this part of the logs regularly. This can be done through the systemd configuration file. arrive. The size of tmpfs can be adjusted through the following command (for example, to 2G):

$ mount -o remount,size=2G /run

As a special Shmem, the memory consumed by tmpfs is not reflected in the process memory, which often makes troubleshooting difficult. To effectively analyze this type of problem, you must become familiar with the memory types in your system. In addition to tmpfs, some other types of memory will not be reflected in the process memory, such as the memory consumed by the kernel: Slab (cache), KernelStack (kernel stack) and VmallocUsed (kernel applied through vmalloc) in /proc/meminfo memory), these are also things you need to check when you don’t know who is occupying the memory.

If the memory consumed by tmpfs accumulates and cannot be cleaned up, the final result is that the system has insufficient available memory, and then OOM is triggered to kill the process. It is likely to kill important processes, or processes that you think should not be killed.

The dangers of OOM killing processes

The logic of OOM killing process is roughly as shown in the figure below:

When OOM killer kills a process, it will scan the system for processes that can be killed, calculate the final score of the process based on the memory occupied by the process and the configured oom_score_adj, and then kill the process with the largest score (oom_score). , if there are multiple processes with the largest scores, kill the one scanned first.

The oom_score of a process can be viewed through /proc/[pid]/oom_score. You can scan the oom_score of all processes in your system. The one with the largest score is the process that is killed first when an OOM occurs. However, it should be noted that since oom_score is related to the memory overhead of the process, and the memory overhead of the process changes dynamically, this value will also change dynamically.

If you don’t want this process to be killed first, you can adjust the oom_score_adj of the process to change the oom_score; if your process cannot be killed anyway, you can configure oom_score_adj to -1000.

Generally speaking, we need to configure the oom_score_adj of some very important system services, such as sshd, to -1000, because once these system services are killed, it will be difficult for us to log in to the system.

However, except for system services, no matter how important your business program is, try not to configure it to -1000. Because once a memory leak occurs in your business program, and it cannot be killed, this will cause the OOM killer to be constantly awakened as its memory overhead increases, thereby killing other processes one by one. We have encountered similar cases before in production environments.

One of the functions of the OOM killer is to find the process in the system that keeps leaking memory and kill it. If the process is not found correctly, other processes or even more important business processes will be accidentally killed.

In addition to killing some innocent processes, the OOM killer’s strategy for killing processes may not be correct. Next, it’s time to find fault with the kernel. This is also the purpose of our series of courses: to tell you how to learn some Linux kernel, but at the same time I also want to tell you to be skeptical about the kernel. The following case is a kernel bug.

On one of our servers, we found that when OOM killer kills processes, it always kills the first process scanned. However, because the memory of the process scanned first is too small, it is difficult to kill the process after OOM. Free up enough memory and OOM happens again soon.

This is a problem triggered in the Kubernetes environment. Kubernetes will configure some important containers as Guaranteed (the corresponding oom_score_adj is -998) to prevent the important containers from being killed when the system OOM occurs. However, if an OOM occurs inside the container, this kernel bug will be triggered, causing the first scanned process to always be killed.

In response to this kernel bug, a patch (mm, oom: make the calculation of oom badness more accurate) was also contributed to the community to fix the problem of not being able to select the appropriate process. This problem is described in detail in the commit log of this patch. , you can take a look if you are interested.

Summary

In this lesson, we learned about this type of memory leak in tmpfs and how to observe it. The biggest difference between this type of memory leak and other process memory leaks is that it is difficult to determine where the leak is based on the memory consumed by the process. Because this type of memory will not be reflected in the RES of the process. However, if you are familiar with the general analysis methods of memory problems, you can quickly find the problem.

When you don’t know who is consuming the memory, you can use /proc/meminfo to find out which type of memory is more expensive, and then perform targeted analysis on this type of memory.

Appropriate OOM policies (oom_score_adj) need to be configured to prevent important businesses from being killed prematurely (such as reducing the oom_score_adj of important businesses to negative values). At the same time, it is also necessary to consider killing other processes accidentally. You can compare the /proc/[ of the process pid]/oom_score to determine the order in which processes are killed.

Again, you need to learn the kernel, but you also need to be skeptical about the kernel.

In short, the more you understand the characteristics of different memory types, the more efficient you will be when analyzing memory problems (such as memory leaks). Proficient in these different memory types, you can also choose the appropriate memory type when your business needs to apply for memory.