20 Analysis | How to analyze the problem of high CPU utilization?

If you are an application developer, then you should know how to analyze application logic, and you should also have your own set of experience on how to optimize application code to improve system performance. What we want to discuss in this lesson is how to expand your boundaries so that you can analyze modules outside of the code and the Linux kernel, which is almost a black box to you.

In many cases, application performance problems need to be solved by analyzing kernel behavior. Therefore, the kernel provides a lot of indicators for application reference. When there is a problem with the application, we can check which indicators are abnormal and then conduct further analysis. However, these kernel-exported indicators cannot cover all scenarios. The problem we face may be more difficult: the application has performance problems, but all indicators in the system appear to be normal. I believe many people have gone crazy over this. When this happens, is there something wrong with the kernel? What on earth is it doing? In this lesson, I will take you to explore how to analyze such problems.

We know that for application developers, the boundary of the application is the system call, and entering the system call is the Linux kernel. Therefore, in order to expand the boundaries of the analysis problem, you first need to know how to analyze the system call functions used by the application. For kernel developers, the boundary is also the system call, and outside the system call is the application. If kernel developers want to expand the boundaries of analysis problems, they also need to know how to use system calls to trace application logic.

How to expand the boundaries of your analysis problems?

My understanding of the application logic is not as deep as my understanding of the kernel. However, when app developers come to me for help, knowing nothing about their app’s logic does not affect the analysis of the problem because I know how to trace the application’s logic with the help of analysis tools. After a series of traces, you can have a general understanding of the application.

One of the tools commonly used to trace application logic is strace. strace can be used to analyze the “boundary” between the application and the kernel – system calls. With strace, we can understand not only the logic executed by the application, but also the kernel logic. Then, as an application developer, you can use this tool to expand the boundaries of your analysis of application problems.

strace can track system calls of a process, specific system calls, and the execution time of system calls. Many times, we can determine where business delays occur through the execution time of system calls. For example, if we want to track the system calls of a multi-threaded program, we can use strace like this:

$ strace -T -tt -ff -p pid -o strace.out

However, before using strace to track the process, I hope you can first understand how strace works. This is also the purpose of our lesson: not only do you need to know how to use the tool, but also the principles of the tool, so that you can understand when problems arise. Is this tool applicable?

Understand how the tool works, not just how to use it

The principle of the strace tool is shown in the figure below (we use the above command as an example to illustrate):

Basic principles of strace

We can see from the figure that for a running process, strace can be attached to the target process, which is achieved through the ptrace system call (the same is true for the gdb tool). PTRACE_SYSCALL of ptrace will track the system call of the target process; after the target process is tracked, every time it enters a syscall, a SIGTRAP signal will be generated and execution will be suspended; the tracker can know that the target process has entered a system call through the SIGTRAP signal triggered by the target process. , and then the tracer will process the system call, and the information output we observe with the strace command is the result of the processing; after the tracer finishes processing the system call, it will resume the execution of the target process. The resumed target process will continue to execute until the next system call.

It can be found that the target process will be interrupted every time it executes a system call. The target process can continue to execute after strace is processed, which will bring a significant delay to the target process. Therefore, I do not recommend using this command in a production environment. If you want to use this command to track problems in a production environment, you must make a plan.

Suppose we use strace to trace that thread delay jitter is caused by a certain system call taking a long time, then how should we continue to trace? This is the moment when application developers and operation and maintenance personnel need to expand the boundaries of analysis. For kernel developers, this is the beginning of analysis problems.

Learn to use analysis tools commonly used by kernel developers

Let us illustrate with a practical case. Once, business developers reported that they used strace tracking to find that the business’s pred(2) system call took a long time, often tens of milliseconds (ms), or even seconds, but they did not know what to do next. How to analyze, so let’s help them analyze it.

Since it has been determined that the problem is caused by the pread(2) system call, subsequent analysis is relatively easy for kernel developers. The most suitable tool for analyzing this type of problem is ftrace. We can use ftrace’s function_trace function to track where the pread(2) system call takes so long.

To track down where pread(2) takes a long time, we need to know what the kernel function corresponding to this system call is. We have two ways to easily obtain the kernel function corresponding to the system call:

Check the kernel functions in the include/linux/syscalls.h file:

As you can see, there are multiple functions related to pread. Since our system is 64bit, we only need to pay attention to the 64bit related system calls, so we lock on the two functions ksys_pread64 and sys_read64. From the comments in the header file, we can know that the former is used by the kernel, and the latter is exported to users. Then in the kernel, we need to track the former. In addition, please note that the functions corresponding to different kernel versions may be inconsistent. Here we take the latest kernel code (5.9-rc) as an example to illustrate.

\Find through the file /proc/kallsyms:

$ cat /proc/kallsyms | grep pread64

…ffffffffa02ef3d0 T ksys_pread64

Each line in /proc/kallsyms is a symbol. The first column is the symbol address, the second column is the symbol’s attributes, and the third column is the symbol name. For example, the T in the above information represents the global code symbol. We Symbols of this type can be traced. The meaning of these symbolic attributes can be viewed through man nm.

Next, we will use the function_graph function of ftrace to trace the ksys_pread64 function to see where in the kernel it takes so long. function_graph is used as follows:

# First set the function to be traced
$ echo ksys_pread64 > /sys/kernel/debug/tracing/set_graph_function

# Secondly set the pid of the thread to be tracked. If there are multiple threads, you need to write each thread one by one.
$ echo 6577 > /sys/kernel/debug/tracing/set_ftrace_pid
$ echo 6589 >> /sys/kernel/debug/tracing/set_ftrace_pid

# Set function_graph to the current tracer to track function calls
$ echo function_graph > /sys/kernel/debug/tracing/current_trace

Then we can view its output through /sys/kernel/debug/tracing/trace_pipe. Here is the time-consuming situation I tracked:

We can find that pread(2) is blocked in the io_schedule() function for 102ms. io_schedule() means that the thread is scheduled away due to I/O blocking, and the thread needs to wait for the I/O to complete before it can continue execution. In function_graph, we can also see how pread**(2)** is executed step by step to io_schedule. Since the whole process is relatively long, only the key calling logic is posted here. :

 21) | __lock_page_killable() {
 21) 0.073 us | page_waitqueue();
 21) | __wait_on_bit_lock() {
 21) | prepare_to_wait_exclusive() {
 21) 0.186 us | _raw_spin_lock_irqsave();
 21) 0.051 us | _raw_spin_unlock_irqrestore();
 21) 1.339 us | }
 21) | bit_wait_io() {
 21) | io_schedule() {<!-- -->

We can see that pread(2) is called from the __lock_page_killable function. When pread(2) reads a file from disk to a memory page (page), it will first lock the page and then unlock it after reading. If the page has been locked by another thread, such as during I/O, then pread(2) needs to wait. After the page is unlocked by the I/O thread, pread(2) can continue to read the file content into this page. The situation we encountered at that time was: when pread(2) read the file content from the disk into a page, the page was already locked, so the thread that called pread(2) waited here. This is actually reasonable kernel logic and there is nothing wrong with it. Next, we need to see why the page has been locked for so long.

Because the thread is blocked in disk I/O, we need to check the disk I/O situation of the system. We can use iostat to observe:

$ iostat -dxm 1

The tracking information is as follows:

Among them, sdb is the file where the disk read by business pread(2) is located. Normally, its read and write volume is very small, but we can see from the above figure that the disk utilization (%util) will randomly appear to be relatively high. situation, close to 100%. And avgrq-sz is very large, which means there are a lot of I/O queues. Also, the w/s is much higher than usual. We can also see that due to the large number of I/O write operations at this time, the disk I/O queuing is serious and the disk I/O utilization is also high. Based on this information, we can judge that the reason why pread(2) takes a long time to read the disk file is probably due to starvation from write operations. Therefore, we next need to investigate who is performing the write I/O operation.

Observing the I/O behavior through iotop, we found that no user thread was performing I/O write operations. The write operations were almost all performed by the kernel thread kworker. That is to say, the user thread wrote the content in the Page Cache, and then kwoker Resynchronize the contents of these Page Cache to the disk. This involves the content of the first module of our course: how to observe the behavior of Page Cache.

Write your own analysis tools

If it is still not clear how to observe the behavior of Page Cache, then I suggest reading the first module of our course from the beginning again, and I will not go into details here. However, I would like to mention some methods not mentioned in the Page Cache module, which are used to determine what files are in memory and the sizes of these files.

The conventional way is to use fincore and mincore, but they are relatively inefficient. Here is a more efficient way: by writing a kernel module to traverse the inode to view the composition of the Page Cache. This module has a lot of code. I will only talk about the core ideas. The pseudo code is roughly as follows:

iterate_supers // Traverse super block
  iterate_pagecache_sb // Traverse the inodes in the superblock
      list_for_each_entry(inode, & amp;sb->s_inodes, i_sb_list)
        //Record the pagecache size of the inode
        nrpages = inode->i_mapping->nrpages;
        /* Get the dentry corresponding to the inode, and then find the file path based on the dentry;
         * Please note that the inode may not have a corresponding dentry because the dentry may be recycled.
         * At this time, the file name corresponding to the inode cannot be viewed.
         */
        dentry = dentry_from_inode(inode);
        dentry_path_raw(dentry, filename, PATH_MAX);

Using this method, you can not only view the files that are being opened by the process, but also view the files that have been closed by the process but the file contents are still in the memory. Therefore, the analysis will be more comprehensive in this way.

By looking at the file contents of Page Cache, we found that some specific files occupy a particularly large amount of memory, but these files are files for offline business, that is, files for unimportant business. Because the offline business occupies a large amount of Page Cache, the workingset of the online business is greatly reduced, so pred(2) often fails to hit the Page Cache when reading the file content, and then the file needs to be read from the disk, which means that the online business exists A lot of pagein and pageout.

At this point, the solution to the problem is there: we can protect the workingset of the online business and prevent it from having more refaults by limiting the Page Cache size of the offline business. After such adjustments, the business no longer suffers from such performance jitter.

Are you confused by the names I mentioned above? I don’t know what the relationship is between inode and Page Cache? If so, it means that you have not studied the Page Cache module of this course carefully. It is recommended to study it carefully from the beginning.

Summary

The content of this lesson is somewhat difficult for application developers and operation and maintenance personnel. The reason why I talk about these difficult contents is that I hope to expand the boundaries of analytical problems. The content of this lesson is basically basic knowledge for kernel developers. If you don’t understand it clearly, it means you don’t understand the kernel enough and you need to spend more time learning it.

Now review the key points of this lesson:

The strace tool is the boundary between applications and the kernel. If you are an application developer and want to expand the boundaries of analysis problems, you need to understand the principles of strace and how to analyze the problems found by strace;

ftrace is a powerful tool for analyzing kernel problems, and you need to understand it;

You need to implement specific problem analysis tools based on your own problems. To better implement these analysis tools, you must master many kernel details.