Linux implementation principle – multi-thread scheduling overhead and performance optimization in NUMA multi-core architecture

Foreword

NOTE: The “thread” referred to in this article refers to the executable scheduling unit Kernel Thread.

NUMA Architecture

The design concept of NUMA (Non-Uniform Memory Access, non-uniform memory access) is to partition the CPU and Main Memory autonomously (Local NUMA node), and to cooperate across regions (Remote NUMA node), in this way to alleviate single There is a bottleneck in the memory bus.

Write picture description here

Different NUMA nodes have almost equal resources. Local NUMA nodes access Local Memory through their own storage buses, while Remote NUMA nodes can access Remote Memory on other Nodes through the shared bus on the motherboard.

Obviously, the time required for the CPU to access Local Memory and Remote Memory is different, so NUMA is named “non-uniform memory access”. At the same time, because NUMA is not storage isolation in the true sense, NUMA will also only save a copy of the operating system and database system. That said, time-consuming remote access is likely to exist by default.

This approach makes NUMA have a certain degree of scalability and is more suitable for application on the server side. However, because NUMA does not achieve complete main memory isolation, the scalability of NUMA is also limited and can support up to several hundred CPUs/Cores. This is a compromise made in pursuit of higher concurrency performance.

Write picture description here

Basic object concepts

Node: A Node can contain several Sockets, usually one.
Socket: The package of a physical processor SoC.
Core: Several physical processor cores (Physical processor) encapsulated by a Socket.
Hyper-Thread: Each Core can be virtualized into several (usually 2) logical processors (Virtual processors). Logical processors share most physical processor resources (e.g. memory cache, functional units).
Processor: CPU logical processor object at the operating system level.
Siblings: Subordination relationships between physical processors and subordinate Virtual processors at the operating system level.

The figure below shows a NUMA Topology, which means that the server has 2 Nodes, each Node contains a Socket, each Socket contains 6 Cores, and each Core is hyper-threaded into 2 Threads, so the total Processor of the server = 2 x 1 x 6 x 2 = 24, where Siblings[0] = [cpu0, cpu1].

Write picture description here

View Host’s NUMA Topology

#!/usr/bin/env python
# SPDX-License-Identifier: BSD-3-Clause
# Copyright(c) 2010-2014 Intel Corporation
# Copyright(c) 2017 Cavium, Inc. All rights reserved.

from __future__ import print_function
importsys
try:
    xrange#Python 2
except NameError:
    xrange = range # Python 3

sockets = []
cores = []
core_map = {}
base_path = "/sys/devices/system/cpu"
fd = open("{}/kernel_max".format(base_path))
max_cpus = int(fd.read())
fd.close()
for cpu in xrange(max_cpus + 1):
    try:
        fd = open("{}/cpu{}/topology/core_id".format(base_path, cpu))
    exceptIOError:
        continue
    except:
        break
    core = int(fd.read())
    fd.close()
    fd = open("{}/cpu{}/topology/physical_package_id".format(base_path, cpu))
    socket = int(fd.read())
    fd.close()
    if core not in cores:
        cores.append(core)
    if socket not in sockets:
        sockets.append(socket)
    key = (socket, core)
    if key not in core_map:
        core_map[key] = []
    core_map[key].append(cpu)

print(format("=" * (47 + len(base_path))))
print("Core and Socket Information (as reported by '{}')".format(base_path))
print("{}\\
".format("=" * (47 + len(base_path))))
print("cores = ", cores)
print("sockets = ", sockets)
print("")

max_processor_len = len(str(len(cores) * len(sockets) * 2 - 1))
max_thread_count = len(list(core_map.values())[0])
max_core_map_len = (max_processor_len * max_thread_count) \
                       + len(", ") * (max_thread_count - 1) \
                       + len('[]') + len('Socket ')
max_core_id_len = len(str(max(cores)))

output = " ".ljust(max_core_id_len + len('Core '))
for s in sockets:
    output + = "Socket %s" % str(s).ljust(max_core_map_len - len('Socket '))
print(output)

output = " ".ljust(max_core_id_len + len('Core '))
for s in sockets:
    output + = "--------".ljust(max_core_map_len)
    output + = " "
print(output)

for c in cores:
    output = "Core %s" % str(c).ljust(max_core_id_len)
    for s in sockets:
        if (s,c) in core_map:
            output + = " " + str(core_map[(s, c)]).ljust(max_core_map_len)
        else:
            output + = " " * (max_core_map_len + 1)
    print(output)

OUTPUT:

$ python cpu_topo.py
================================================== ====================
Core and Socket Information (as reported by '/sys/devices/system/cpu')
================================================== ====================

cores = [0, 1, 2, 3, 4, 5]
sockets = [0, 1]

       Socket 0 Socket 1
       -------- --------
Core 0 [0] [6]
Core 1 [1] [7]
Core 2 [2] [8]
Core 3 [3] [9]
Core 4 [4] [10]
Core 5 [5] [11]

The meaning of the above output:

There are two Sockets (physical CPU)
Each Socket has 6 Cores (physical cores), 12 in total

Output:

$ python cpu_topo.py
================================================== ====================
Core and Socket Information (as reported by '/sys/devices/system/cpu')
================================================== ====================

cores = [0, 1, 2, 3, 4, 5]
sockets = [0, 1]

       Socket 0 Socket 1
       -------- --------
Core 0 [0, 12] [6, 18]
Core 1 [1, 13] [7, 19]
Core 2 [2, 14] [8, 20]
Core 3 [3, 15] [9, 21]
Core 4 [4, 16] [10, 22]
Core 5 [5, 17] [11, 23]

There are two Sockets (physical CPU).
Each Socket has 6 Cores (physical cores), 12 in total.
Each Core has two Virtual Processors, for a total of 24.

Multi-threading performance overhead in NUMA architecture

1. Cross-Node Memory access overhead

NUMA (non-uniform memory access) means that the time required for Kernel Thread to access Local Memory and Remote Memory is different.

NUMA has the following two CPU allocation strategies:

cpu-node-bind: Constrains the Kernel Thread to run on specified NUMA Nodes.
phys-cpu-bind: Constrains the Kernel Thread to run on specified CPU Cores.

NUMA has the following four memory allocation strategies:

local-alloc: Constrains Kernel Thread to only access Local Node Memory.
preferred: Loosely specify a preferred Node for the Kernel Thread. If there are not enough Memory resources on the preferred Node, the runtime is allowed to access Remote Node Memory.
mem-bind: It is stipulated that Kernel Thread can only request Memory on several specified Nodes, but it is not strictly stipulated that it can only access Local NUMA Memory.
inter-leave: Specifies that the Kernel Thread can use the RR algorithm to request access to Memory from several specified Nodes in turn.

2. Cross-Core multi-threaded Cache synchronization overhead

NUMA Domain Scheduler is a Kernel Thread scheduler implemented by Kernel for the NUMA architecture. The purpose is to make each Core in NUMA as busy as possible.

According to the characteristics of NUMA Topology, it is a tree structure. NUMA Domain Scheduling traverses from leaf nodes upwards to the root node until the load in all NUMA Domains is balanced. Of course, users can set corresponding scheduling policies for different Domains.

Write picture description here

However, this kind of balanced optimization for all Cores comes at a cost. For example, balancing several Kernel Threads corresponding to the same User Process to different Cores will cause the Core Cache to fail and cause performance degradation.

Cache visibility (concurrency safety) issue: Kernel Threads running on Core1 and Core2 will cache data in their respective L1/L2 Cache, but these data are not visible to each other, that is: If Core1 does not write the data in the Cache back to the Main Memory, Core2 will never see Core1’s modification of a variable value. This will lead to inconsistencies in multi-threaded shared data.
Cache consistency (concurrency performance) issue: If multiple Kernel Threads run on multiple Cores, and there is shared data between these Threads, and these data are stored in the Cache, then there is a Cache The need for consistent data synchronization. For example: If the Kernel Thread running on Core1 and Core2 respectively wants to ensure that the shared data is consistent, then it is necessary to forcefully write the modification of the variable value in the Core1 Cache back to the Main Memory, and then Core1 notifies Core2 that the value has been updated, and then Let Core2 get the latest value from Main Memory and load it into Core2 Cache. The traffic generated to maintain the consistency of Cache data will put pressure on the main memory data bus, thereby affecting CPU performance.
Cache invalidity (concurrency performance) problem: If out of balance considerations, the scheduler will actively initiate thread switching, for example: dynamically schedule the Kernel Thread running on Core1 to another idle Core2 When running on Core1 Cache, the data on Core1 Cache needs to be written back to Memory first and then scheduled. If Core1 and Core2 belong to different NUMA Nodes at this time, more time-consuming Remote Memory access will occur.

Insert image description here

As shown in the figure below, there are different Cache costs in different Domains. Although NUMA Domain Scheduling itself also has soft affinity characteristics, it focuses on the balanced scheduling of NUMA Cores rather than ensuring the execution performance of the application.

It can be seen that the balanced scheduling mechanism of NUMA Domain Scheduler is inconsistent with high concurrency performance.

Write picture description here

3. Multi-thread context switching overhead

During the execution of tasks by the Core, the execution site information of the thread needs to be stored in the Register and Cache of the Core. These data sets are called Context and have the following 3 types:

User Level Context: PC program counter, registers, thread stack, etc.
Register Context: General registers, PC program registers, processor status registers, stack pointers, etc.
Kernel Level Context: Process descriptor (task_struct), PC program counter, registers, virtual address space, etc.

Multi-threaded Context Switch can also be divided into 2 levels:

User Level Thread level: Multiple User Threads implemented by a high-level programming language thread library, perform passive switching of time-sliced rotation training on a single Core, or collaborative automatic switching. Since the User Level Context of the User Thread is very lightweight and shares the same virtual address space of the User Process, the Context Switch at the User Level level has low overhead and is fast.
Kernel Level Thread level: Multiple Kernel Threads are scheduled and switched on multiple Cores by the NUMA Domain Scheduler in the Kernel. Since the Context of Kernel Thread is larger (the execution site of Kernel Thread, including: task_struct structure, register, program counter, thread stack, etc.) and involves data synchronization and main memory access across Cores, the Context Switch at the Kernel Level It’s expensive and slow.

During the process of thread switching, the Context of a thread is first stored in the corresponding user or kernel stack, and then the Context of the next thread to be run is loaded into the Core’s Register and Cache.

Write picture description here

It can be seen that multi-threaded Context Switch will inevitably lead to a decrease in processor performance. And User Level and Kernel Level switching are likely to occur at the same time. These are the costs of applying multi-threading mode.

Use the vmstat command to view the context switching of the current system:

$ vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu -----
 r b swpd free buff cache si so bi bo in cs us sy id wa st
 4 1 0 4505784 313592 7224876 0 0 0 23 1 2 2 1 94 3 0

r: The length of the CPU run queue and the number of running threads.
b: The number of blocked processes.
swpd: The size of virtual memory used. If it is greater than 0, it means that the machine’s physical memory is insufficient. If the program memory leak is not the cause, then the memory should be upgraded or memory-consuming tasks should be migrated to other machines.
si: The size of the virtual memory read from the disk per second. If it is greater than 0, it means that the physical memory is insufficient or there is a memory leak. Processes that consume a lot of memory should be killed or migrated.
so: The size of virtual memory written to disk per second, if greater than 0, the same as above.
bi: The number of blocks received by the block device per second. The block device here refers to all disks and other block devices on the system. The default block size is 1024Byte.
bo: The number of blocks sent by the block device per second. For example, when reading a file, bo will be greater than 0. Bi and bo are generally close to 0, otherwise the I/O is too frequent and needs to be adjusted.
in: Number of CPU interrupts per second, including time interrupts.
cs: The number of context switches per second. The smaller the value, the better. If it is too large, consider reducing the number of threads or processes. Too many context switches mean that most of the CPU’s time is wasted on context switching instead of executing tasks.
st: CPU overhead on other tenants in a virtualized environment.

4. CPU operating mode switching overhead

CPU operating mode switching will also have an impact on execution performance, but it will be lower than that of context switching, because the main task of mode switching is just to switch the context of the thread register.

The following operations in Linux systems trigger CPU operating mode switching:

System call/soft interrupt: When an application needs to access Kernel resources, it needs to enter kernel mode through SCI to execute the corresponding kernel code, and then return to user mode after completing the required operations.
Interrupt processing: When an interrupt event occurs in a peripheral, an interrupt signal will be sent to the CPU. At this time, the Kernel needs to respond to the interrupt immediately, enter kernel mode to execute the corresponding interrupt handler, and then return to user mode after processing. .
Exception handling: When a runtime error or other abnormal situation occurs in the Kernel, such as page fault, division-by-zero error, illegal operation, etc., the operating system needs to enter the kernel mode to execute the corresponding exception handler and perform error recovery. or prompt before returning to user mode.
Kernel Thread switching: When the Kernel Thread under the User Process is switched, the corresponding Kernel Level Context needs to be switched first and executed, and finally the code that executes the User Process in user mode is returned.

Insert image description here

5. Interrupt processing overhead

Hardware interrupt (HW Interrupt) is a mechanism for interactive communication between peripherals (e.g. network card, disk controller, mouse, serial adapter card, etc.) and the CPU, allowing the CPU to grasp events occurring in peripherals in a timely manner and, depending on the situation, Depending on the type of interrupt, decide whether to put down the current task and handle urgent peripheral events as soon as possible (e.g. Ethernet data frame arrival, keyboard input).

The essence of a hardware interrupt is an IRQ (interrupt request signal) electrical signal. The kernel assigns an IRQ Number to each peripheral to distinguish the type of device that issued the interrupt. The IRQ Number is mapped to an interrupt handler (usually provided by the peripheral driver) in the Kernel ISR (interrupt service routing list).

Hardware interrupts are one of the task types with the highest priority in Kernel scheduling. They are preemptively scheduled, so hardware interrupts are usually accompanied by task switching, switching the current task to the context of the interrupt handler.

An interrupt handler first needs to save the CPU’s status register data to the stack in the virtual memory space, then run the interrupt service routine, and finally clamp the status register data from the stack to the CPU. The entire process requires at least 300 CPU clock cycles. And in a multi-core processor computing platform, each Core may execute a hardware interrupt handler, so there is still the problem of Cache consistency traffic faced by cross-Core processing.

It can be seen that a large number of interrupt processing, especially hardware interrupt processing, will consume a lot of CPU resources.

6. Overhead of TLB cache invalidation

Because the TLB (address mapping table cache) space is very limited, in an operating system that uses 4K small pages, frequent Kernel Thread switching will cause frequent changes in the virtual address space mapping entries of the TLB cache, resulting in a large number of cache misses.

7. Memory copy overhead

In the network packet processing scenario, the NIC Driver runs in the kernel state. When the Driver receives a packet, it will first be copied to the TCP/IP Stack for processing, and then copied to the application buffer in user space. The time spent on these copies will account for 57.1% of the total packet processing time.

Performance optimization in NUMA architecture: using multi-core programming instead of multi-threading

In order to solve the above problems and further improve the performance of multi-core processor platforms in the NUMA architecture, the idea of “multi-core programming instead of multi-thread programming” should be widely adopted by establishing affinity between Kernel Threrad and NUMA Node or Core to avoid The overhead caused by multi-thread scheduling.

NUMA affinity: Prevent CPU from accessing memory across NUMA

On the Linux Shell, you can use the numastat command to view the memory allocation statistics of NUMA Node; you can use the numactl command to bind the User Process to a specified NUMA Node or to a specified NUMA Core.

CPU Affinity: Avoid Kernel Thread switching across CPU Cores

CPU Affinity is a Kernel Thread Scheduling Property of the Kernel, which specifies that the Kernel Thread should run on a specific CPU for as long as possible without being scheduled to other CPUs. In the NUMA architecture, setting the CPU affinity of the Kernel Thread can effectively improve the CPU Cache hit rate of the Thread and reduce the loss of Remote NUMA Memory access to obtain higher performance.

Soft CPU affinity: It is the default scheduling policy of Linux Scheduler. The scheduler will actively let Kernel Thread run on the same CPU.
Hard CPU affinity: It is the programmable CPU affinity provided by the Linux Kernel. The user program can explicitly specify which CPU or CPUs the Kernel Thread corresponding to the User Process runs on.

Hard CPU affinity is implemented by extending the task_struct (process descriptor) structure and introducing the cpus_allowed field to represent the CPU affinity bitmask (BitMask). cpus_allowed consists of n bits, corresponding to n Processors in the system. The lowest bit represents the first Processor, and the highest bit represents the last Processor. Processors affinity is specified by setting the mask position to 1. When multiple mask bits are set to 1, it means that the running process migrates between multiple Processors. By default 1 for all positions. The CPU affinity of the process is passed to the child thread.

On the Linux shell, you can use the taskset directive to set the CPU affinity of the User Process, but NUMA-affinity memory allocation is not guaranteed.

IRQ (Interrupt Request) Affinity

Linux Kernel provides the irqbalance program to optimize interrupt load. In most scenarios, the interrupt allocation optimization provided by irqbalance can play a positive role. irqbalance will automatically collect system data to analyze the usage pattern and analyze the usage pattern based on the system load status. Adjust the working status to the following 2 modes:

Performance mode: irqbalance will distribute interrupts as evenly as possible to the cores of each CPU to fully improve performance.
Power-save mode: irqbalance will concentrate interrupt processing on the first CPU to ensure the sleep time of other idle CPUs and reduce energy consumption.

Of course, hardware interrupt handling also has affinity attributes, which are used to specify the CPU that runs the ISR corresponding to the IRP. On the Linux shell, smp_affinity can be modified to specify an IRQ Number. Note that manually specifying IRQ affinity requires shutting down the irqbalance daemon first.

Use huge page memory

“Linux Implementation Principles – Large Page Memory”

– END –

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. CS entry skill treeLinux introductionFirst introduction to Linux37938 people are learning the system