Linux Kernel-CPU Cache and Memory Barrier

1. CPU cache

The origin of cpu cache

In all instruction fetch cycles of the CPU (program calculation), the memory needs to be accessed at least once (that is, what we call data on physical memory)
It usually requires multiple accesses to the memory to fetch operands or save results. The speed of CPU processing and calculation is obviously limited by the limitations of accessing the memory.
Therefore, the solution is touse the principle of locality to provide a small-capacity and fast memory between the CPU and physical memory, called a cache

Caching overview

The cache is divided into “segments” (lines). One segment corresponds to a block of storage space. The size is 32 (earlier ARM, x86 and PowerPC in the 90s/early 2000s), 64 ( newer ARM and x86) or 128 (newer Power ISA machines) bytes

The cache contains a partial copy of the data in physical memory

When the CPU reads data, it will first check whether the data in the cache exists. If it exists, it will return. If it does not exist, it will read the physical memory data.

cache and memory

The cache is divided into L1-L3 Cache

L1 Cache: The first level cache is the first level cache of the CPU, which is divided into instruction cache and data cache. The L1 cache capacity of the general server CPU is 32-4096kb. Now The L1 Cache cannot directly connect to the memory to transfer data.

L2 Cache: Due to the limitation of L1 level cache capacity, in order to increase the computing speed of the CPU again, a high-speed memory, that is, the second level cache, is placed outside the CPU.

L3 Cache: Today’s L3 cache is built-in, mainly to further reduce memory latency and improve processor computing power. Generally, multiple cores share an L3 cache.

CPU system architecture

2. Cache consistency and MESI protocol

Single CPU cache read and write operations

Cached read operations

When the CPU reads data, it first searches in L1, then in L2, then in L3, then in memory, and finally in external memory (persistent media)

If only read operations are processed, then the L1-L3 cache will be consistent with the data in the main memory.

Cache write operations

Cache direct write: Write data directly to the next level cache or main memory through the current level cache. After the writing is successful/failed, the corresponding cache content will also be updated/discarded. , so that the data in the cache will be consistent with the main memory data

Cache writeback: The current level cache will be modified and the flag segment will be recorded. The data will be written back to the next level cache or main memory through the flag segment. If the flag segment is discarded, it will be recycled first. , also ensures that cache data at all levels remains consistent

cache coherence protocol

Multi-core cache write operation problem

Scenario: Read and write operations are performed on a processor with multiple cores and each core has a corresponding cache

Assume that one CPU caches a certain segment of data in the main memory, and another CPU needs to write the data in the memory segment. At this time, the CPU writing the data updates the cache, but the other CPUs do not update the cache. This is Inconsistency of cached data will occur when

Cache consistency convention

How to solve the above problems, the reasons for cache data inconsistency are as follows:

Multi-core CPUs have corresponding caches, and the data cached by each core cannot be shared.

At this time, we will think of allowing the cache to be shared by multi-core CPUs, but the problem is that the computing power and performance of the processor will decrease. Each time, we need to wait for one of the CPUs to perform a write operation before proceeding to the next step of processing.

Then our expectation is to use multi-core caches and make them operate like a set of caches. The cache coherence protocol is designed to solve this problem.

Cached MESI protocol

There are many cache consistency protocols, the more typical one is the MESI protocol. The MESI protocol is briefly described as follows:

Invalid cache segment: The cache does not exist or has expired

Shared cache segment: The data is valid and consistent with the main memory and other cache data, used for read cache operations

Exclusive cache segment: The data is valid and consistent with the data in the main memory. The difference from S is that when the processor is in an exclusive state, other CPU caches will be invalid

Modified cache segment is a dirty segment, which means that the current CPU cache has been modified, but has not been synchronized to the main memory, and is exclusive to the current CPU

Summary: That is, the CPU controls the read and write operations of the cache, and also needs to monitor notifications from other CPUs to ensure the consistency of the final cached data. The E state solves the problem of telling other processors before modifying the cached data.

CPU data reading and writing process

3. Memory barrier

CPU optimization means: runtime instruction reordering

Why does instruction reordering occur?

When the CPU writes to the cache and finds that the block is occupied by other CPUs, in order to improve the CPU processing performance, the subsequent read cache commands may be executed first

Instruction reordering principle

Reordering needs to follow the as-if-serial semantic rules, that is, no matter how the reordering is performed (compilers and processors in order to improve parallelism), the execution results of the (single-threaded) program cannot be changed. Compiler/runtime/ Processors must follow as-if-serial semantics, which means that the compiler and processor will not reorder operations with data dependencies

Problems with CPU cache

The data in the cache and the data in the main memory are not synchronized in real time, and the data cached between CPUs (or CPU cores) are not synchronized in real time, that is, at the same point in time, each CPU sees the same memory address. Data values may be inconsistent

There is a problem with instruction reordering. Although it follows as-if-serial semantics, it can only guarantee that the result is correct when a single thread executes it on a single-core CPU. If it is multi-core and multi-threaded, the instruction logic cannot distinguish cause and effect. , may be out of order, resulting in errors in the program running results.

memory barrier

definition

is a type of synchronization barrier instruction, which allows the CPU or compiler to strictly follow a certain order when operating memory. That is to say, the instructions before the memory barrier and the instructions after the memory barrier will not be executed due to Out of order caused by system optimization and other reasons

memory barrier instructions

Write memory barrier, inserting Store Barrier after the instruction, allows the latest data update in the write cache to be written to the main memory, making it visible to other threads. Forced writing to the main memory, this kind of explicit call, the CPU will not be interrupted due to performance considerations to rearrange instructions

Read memory barrier, inserting Load Barrier before the instruction, can invalidate the data in the cache, force the data to be loaded from the new main memory, read the main memory content, keep the CPU cache consistent with the main memory, and avoid consistency problems caused by cache

A complete memory barrier ensures that after the results of memory read and write operations earlier than the barrier are submitted to the memory, read and write operations later than the barrier are executed.

function

It is to solve the above-mentioned CPU cache problem.

Original author: keithl

Original address: CPU Cache and Memory Barrier-Tencent Cloud Developer Community-Tencent Cloud (Copyright belongs to the original author, please contact us to delete any infringement messages)

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. CS entry skill treeLinux introductionFirst introduction to Linux38138 people are learning the system