18 Case Studies | Does business need to use transparent pages: Water can carry a boat, but it can also capsize it?

The case of this lesson comes from a stability problem that helped the business team analyze many years ago. At that time, the business team reported that the CPU utilization of some of their servers would soar abnormally, and then recover quickly, and the duration was not long, about a few seconds to a few minutes. From the monitoring chart, you can see that it looked like some glitch.

Because this type of problem is common, I would like to share the positioning and analysis process of this problem. I hope that when encountering the problem of high CPU utilization in the future, you will know how to analyze it step by step.

CPU utilization is a very general concept. When encountering the problem of high CPU utilization, we need to look at what kind of things the CPU is busy with. For example, the CPU is busy processing interrupts, waiting for I/O, and executing Kernel function? Or are you executing user functions? At this time, we need to refine the monitoring of CPU utilization, because monitoring these detailed indicators is very helpful for us to analyze the problem.

Refined CPU utilization monitoring

Here we take the commonly used top command as an example to see more detailed CPU utilization indicators (different versions of the top command may display slightly different results):

%Cpu(s): 12.5 us, 0.0 sy, 0.0 ni, 87.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

The top command displays the us, sy, ni, id, wa, hi, si and st indicators. The sum of these indicators is 100. Then you may have questions, will monitoring detailed CPU utilization indicators bring significant additional overhead? The answer is no, because CPU utilization monitoring usually parses the /proc/stat file, and these files contain these detailed indicators.

Let’s continue to look at the specific meanings of the above indicators. These meanings can also be viewed from the top manual:

 us, user : time running un-niced user processes
       sy, system: time running kernel processes
       ni, nice : time running nice user processes
       id, idle : time spent in the kernel idle handler
       wa, IO-wait: time waiting for I/O completion
       hi: time spent servicing hardware interrupts
       si: time spent servicing software interrupts
       st : time stolen from this vm by the hypervisor

The specific meanings and precautions of the above indicators are as follows:

Among the above items, idle and wait are the time when the CPU is not working, and the remaining items are the time when the CPU is working. The main difference between idle and wait is that idle means that the CPU has nothing to do, while wait means that the CPU wants to do something but can’t. You can also understand wait as a special type of idle, that is, an idle when at least one thread on the CPU is blocked in I/O.

Through detailed monitoring of CPU utilization, we found that the high CPU utilization in the case was caused by the increased sys utilization, which means that the sys utilization will suddenly soar, for example, when usr is lower than 30% In some cases, sys will be higher than 15% and return to normal after a few seconds.

Therefore, next we need to capture the scene where sys utilization is so high.

Catch the scene where sys utilization rate soared

As we mentioned earlier, the high sys utilization of the CPU means that the execution of the kernel function takes too much time, so we need to collect the kernel functions executed by the CPU at the moment when sys spikes. There are many ways to collect kernel functions, such as:

Through perf, you can collect the hot spots of the CPU to see which cores have high CPU utilization when sys utilization is high;

Through the call-graph function of perf, you can view the specific call stack information, that is, the path from which the thread is executed;

Through the annotate function of perf, you can track which statements of the kernel function the thread spends time on;

Through the function-graph function of ftrace, you can check the specific time consumption of these kernel functions and which path takes the most time.

However, these commonly used tracking methods are not suitable for this problem of instantaneous disappearance, because they are more suitable for collecting information within a period of time.

So for this transient state, we hope to have a system snapshot to record the work currently being done by the CPU. Then we can combine the kernel source code to analyze why the sys utilization is high.

There is a tool that can track this transient state of the system very well, that is, system snapshot, it is sysrq. sysrq is a tool often used to analyze kernel problems. You can use it to observe the current memory snapshot and task snapshot, you can construct vmcore to save all the information of the system, and you can even use it to kill the largest memory overhead when memory is tight. of that process. sysrq can be said to be a powerful tool for analyzing many difficult problems.

To use sysrq to analyze problems, you first need to enable sysyrq. It is recommended that you enable all functions of sysrq. There is no need to worry about any additional overhead, and there is no risk. The enabling method is as follows:

$ sysctl -w kernel.sysrq = 1

After the sysrq function is enabled, you can use its -t option to save the current task snapshot to see what tasks are in the system and what these tasks are doing. How to use it:

$ echo t > /proc/sysrq-trigger

Then the task snapshot will be printed to the kernel buffer. You can view these task snapshot information through the dmesg command:

$dmesg

At that time, in order to capture this instantaneous state, I wrote a script to collect it. The following is a simple script example:

#!/bin/sh

while [ 1 ]; do
     top -bn2 | grep "Cpu(s)" | tail -1 | awk '{
         # $2 is usr, $4 is sys.
         if ($2 < 30.0 & amp; & amp; $4 > 15.0) {
              # save the current usr and sys into a tmp file
              while ("date" | getline date) {
                   split(date, str, " ");
                   prefix=sprintf("%s_%s_%s_%s", str[2],str[3], str[4], str[5]);
               }

              sys_usr_file=sprintf("/tmp/%s_info.highsys", prefix);
              print $2 > sys_usr_file;
              print $4 >> sys_usr_file;

              # run sysrq
              system("echo t > /proc/sysrq-trigger");
         }
     }'
     sleep 1m
done

This script will detect when sys utilization is higher than 15% and usr is low, that is, whether the CPU is spending too much time in the kernel. If this occurs, sysrq is run to save a snapshot of the current task. You can find that this script is set to be executed once every minute. The reason for this is because we don’t want to cause a lot of performance overhead, and at that time there were several machines in the business team that would experience this situation almost two or three times a day. Some machines every time It lasts a few minutes, so that’s enough. However, if your problems occur less frequently and last longer, a more precise approach will be needed.

Transparent page: Water can carry a boat, but can it also capsize it?

After we deployed the script, we captured the problem site. From the information output by dmesg, we found that the threads in the R state are all performing compcation (memory consolidation), and the call stack of the thread is as follows (this is an older kernel, version 2.6.32):

java R running task 0 144305 144271 0x00000080
 ffff88096393d788 0000000000000086 ffff88096393d7b8 ffffffff81060b13
 ffff88096393d738 ffffea003968ce50 000000000000000e ffff880caa713040
 ffff8801688b0638 ffff88096393dfd8 000000000000fbc8 ffff8801688b0640

Call Trace:
 [<ffffffff81060b13>] ? perf_event_task_sched_out + 0x33/0x70
 [<ffffffff8100bb8e>] ? apic_timer_interrupt + 0xe/0x20
 [<ffffffff810686da>] __cond_resched + 0x2a/0x40
 [<ffffffff81528300>] _cond_resched + 0x30/0x40
 [<ffffffff81169505>] compact_checklock_irqsave + 0x65/0xd0
 [<ffffffff81169862>] compaction_alloc + 0x202/0x460
 [<ffffffff811748d8>] ? buffer_migrate_page + 0xe8/0x130
 [<ffffffff81174b4a>] migrate_pages + 0xaa/0x480
 [<ffffffff81169660>] ? compaction_alloc + 0x0/0x460
 [<ffffffff8116a1a1>] compact_zone + 0x581/0x950
 [<ffffffff8116a81c>] compact_zone_order + 0xac/0x100
 [<ffffffff8116a951>] try_to_compact_pages + 0xe1/0x120
 [<ffffffff8112f1ba>] __alloc_pages_direct_compact + 0xda/0x1b0
 [<ffffffff8112f80b>] __alloc_pages_nodemask + 0x57b/0x8d0
 [<ffffffff81167b9a>] alloc_pages_vma + 0x9a/0x150
 [<ffffffff8118337d>] do_huge_pmd_anonymous_page + 0x14d/0x3b0
 [<ffffffff8152a116>] ? rwsem_down_read_failed + 0x26/0x30
 [<ffffffff8114b350>] handle_mm_fault + 0x2f0/0x300
 [<ffffffff810ae950>] ? wake_futex + 0x40/0x60
 [<ffffffff8104a8d8>] __do_page_fault + 0x138/0x480
 [<ffffffff810097cc>] ? __switch_to + 0x1ac/0x320
 [<ffffffff81527910>] ? thread_return + 0x4e/0x76e
 [<ffffffff8152d45e>] do_page_fault + 0x3e/0xa0
 [<ffffffff8152a815>] page_fault + 0x25/0x30

We can see from the call stack that this java thread is applying for THP (do_huge_pmd_anonymous_page) at this time. THP is a transparent huge page, which is a 2M continuous physical memory. However, because there is no continuous 2M memory space in the physical memory at this time, direct compaction (direct memory consolidation) is triggered. The memory consolidation process can be represented by the following figure:

This process is not complicated. When performing compcation, the thread will scan the used movable page from front to back, and then scan the free page from back to front. After the scan is completed, these movable pages will be migrated to the free page, and finally the A 2M continuous physical memory, so that THP can successfully apply for memory.

The process of direct compaction is very time-consuming, and on the 2.6.32 version of the kernel, this process needs to hold coarse-grained locks, so during the running process the thread may also actively check (_cond_resched) whether there are other higher priorities. Level tasks need to be performed. If there is, other threads will be executed first, which will further increase its execution time. This is why sys utilization is so high. Regarding these, you can also see from the comments in the kernel source code:

/*
 * Compaction requires the taking of some coarse locks that are potentially
 * very heavily contended. Check if the process needs to be scheduled or
 * if the lock is contended. For async compaction, back out in the event
 * if contention is severe. For sync compaction, schedule.
 *...
 */

After we found the cause, in order to quickly solve these problems in the production environment, we turned off THP on the business server. After shutting down, the system became very stable, and the problem of high sys utilization never occurred again. . To shut down THP, use the following command:

$ echo never > /sys/kernel/mm/transparent_hugepage/enabled

After turning off THP in the production environment, we evaluated the performance impact of THP on the business in the offline test environment. We found that THP could not bring significant performance improvements to the business, even when the memory was not tight and the When memory pruning is triggered. This also got me thinking, what kind of business isTHP suitable for?

This starts from the purpose of THP. Let’s make a long story short, the purpose of THP is to map larger memory (hugepages) with one page table entry, which can reduce Page Faults because fewer pages are required. Of course, this will also improve the TLB hit rate because fewer page table entries are required. If the data that the process wants to access is in this large page, then this large page will be very hot and will be cached in the Cache. The page table entries corresponding to large pages will also appear in the TLB. From the storage hierarchy in the previous lecture, we can know that this helps improve performance. But conversely, assuming that the data locality of the application is relatively poor, and the data it needs to access in a short period of time are randomly located on different large pages, then the advantage of large pages will disappear.

Therefore, when we optimize business performance based on large pages, we must first evaluate the data locality of the business and try to aggregate the hot data of the business so that we can fully enjoy the advantages of large pages. Take the large page performance optimization I did during my tenure at Huawei as an example. We aggregated the hotspot data of the business, and then allocated these hotspot data to large pages. Then compared with the situation without using large pages, we finally found that This can bring more than 20% performance improvement. For architectures with smaller TLBs (such as MIPS), it can bring more than 50% performance improvement. Of course, we also made a lot of optimizations to the kernel’s large page code during this process, so we won’t go into detail here.

Here are some suggestions for the use of THP:

Instead of configuring /sys/kernel/mm/transparent_hugepage/enabled to always, you can configure it to madvise. If you don’t know how to configure it, configure it to never;

If you want to use THP to optimize the business, it is best to let the business use large pages in a madvise way, that is, by modifying the business code to specify specific data to use THP, because the business is more familiar with its own data flow;

Many times it will be troublesome to modify the business code. If you don’t want to modify the business code,

Summary

To review the key points of this lesson:

Refined CPU utilization monitoring. When the CPU utilization is high, you need to check which specific indicator is higher;

sysrq is a powerful tool for analyzing high CPU utilization in kernel mode. It is also a powerful tool for analyzing many difficult kernel problems. You need to understand how to use it;

THP can bring performance improvements to the business, but it may also bring serious stability issues to the business. It is best to use it in a madvise manner. If it’s not clear how to use it, turn it off.