Part 3 of the DPDK series: CPU affinity and practical applications

Series of articles

The first article in the DPDK series: DPDK architecture explanation-CSDN Blog

Part 2 of the DPDK series: Detailed explanation of CPU Cache and performance application of DPDK in Cache – CSDN Blog

Basic concepts

Take the lscpu result shown below as an example to pave the way for some basic concepts.

[root@cyber ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
Stepping: 7
CPU MHz: 3000.000
CPU max MHz: 3500.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 16896K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47

NUMA:
Non-Uniform Memory Architecture, non-uniform memory architecture, is different from all cores of a classic computer system accessing memory through the North Bridge. The time spent accessing memory is related to the processor. The reason why it is related to the processor is that each processor of the system has local memory (Local memory), and the time to access the local memory is very short, while accessing the remote memory (remote memory), that is, the local memory of other processors, needs to be accessed through extra bus

As shown in the lscpu output above, the system is numa architecture and has two node nodes.

Socket:

The number of physical CPUs in a physical machine. As shown in the lscpu output above, there are two physical CPUs.

Core:

The number of CPU cores on each physical CPU, as shown in the lscpu output above, cores per socket is 12

Threads per core:

Each CPU and the number of logical CPU cores that can be fissioned (hyper-threading enabled), as shown in the lscpu output above, can be fissioned to 2

Total number of CPU cores = Number of physical CPUs X Number of cores per physical CPU

Total number of logical CPUs = number of physical CPUs X number of cores per physical CPU X number of hyperthreads

What is CPU affinity

It is the tendency for a specific task to run on a given CPU for as long as possible without being migrated to other processors. The word affinity is translated from affinity, which can actually be called CPU binding.

Why use affinity

As mentioned in our last cache sharing (Part 2 of the DPDK series: Detailed explanation of CPU Cache and performance application of DPDK in Cache – CSDN blog), on a multi-core running machine, each CPU itself will have its own cache. The data used by the process is stored in it. If the CPU is not bound, the process may be scheduled by the operating system to other CPUs, and cache consistency problems will occur. Please see the previous blog post for details.

Another important reason is that after binding a specific process to the CPU core, coupled with core isolation measures, the process that requires high performance can monopolize the CPU core and avoid the core loss caused by other processes’ preemption.

The above two points have been fully applied in the practical application of DPDK.

When to use affinity

1. There is a lot of calculation to do;

2. Test complex applications (some products claim to perform better when using more hardware. Instead of buying multiple machines (one for each processor configuration), we can: ① Buy a A multi-processor machine, ② continuously increase the number of allocated processors, ③ measure the number of transactions per second, ④ evaluate the scalability of the results);

3. Run time-sensitive, deterministic threads

Thread exclusive (affinity + core isolation)

DPDK avoids switching overhead in cross-core tasks by binding threads to logical cores. However, thread switching may still occur for the current logical core that is bound to run. If you want to further reduce the cost of other tasks for a specific To further improve the impact of tasks on the basis of affinity, core isolation can be used to separate logical cores from the kernel scheduling system.

Thread exclusive configuration:

As shown below, add the isolcpus parameter configuration in /etc/default/grub. This parameter indicates which cores are to be isolated. It can be written in two ways: isolcpus=0,1,2 cores isolcpus=0-2

After executing the above configuration, execute the update grub command to make the configuration take effect. For ubuntu systems, execute update-grub to take effect; for systems such as centos or Red Hat, execute grub2-mkconfig -o /boot/efi/EFI/centos/grub according to the boot mode. cfg or grub2-mkconfig -o /boot/grub2/grub.cfg

Practical use of DPDK

DPDK threads can be used as control threads or data threads. In some examples of DPDK, the control thread is generally bound to the MASTER core (the first core read in the eal_parse_args process is the MASTER core), accepts user configuration, and passes configuration parameters to the data thread, etc.; the data thread is distributed in different Packets are processed on the core. The logical cores where the main thread and data processing thread are located are basically isolated on the network processing device to ensure the uniqueness of the business. Combined with DPDK’s user-mode driver and polling mode (if this is unclear, you can refer to the previous article: DPDK Series Part 1: DPDK Architecture Explanation – CSDN Blog), it can basically avoid the corresponding CPU logical core from being trapped. operate.

The following is the DPDK l3fwd main process code. rte_eal_mp_remote_launch implements the logic described above. The scheduled and registered main_loop is run on each logical core. In main_loop, the same code can be executed according to actual needs or different cores can be judged to execute different logic.

 /* launch per-lcore init on every lcore */
rte_eal_mp_remote_launch(l3fwd_lkp.main_loop, NULL, CALL_MASTER);
RTE_LCORE_FOREACH_SLAVE(lcore_id) {
if (rte_eal_wait_lcore(lcore_id) < 0) {
ret = -1;
break;
}
}

Simple example demonstrates core binding and isolation

At the request of fans, simple examples are used to demonstrate core binding and isolation under Linux.

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int main()
{
printf("This is a demo program\
");
\t
while (1) {
sleep(1);
}
\t
exit(0);
}

1. For the above code, use gcc to compile and generate an executable program (gcc -o core_bing core_bing.c)

2. As shown below, you can see that the device has 8 CPUs. After executing the sample code generated above and executing the taskset tool, you can see that the affinity MASK of the process is 0xFF, that is, it is affinity to 8 cores, and Not bound to a separate core