c++ atomic variable-Memory fence

Excerpted from: https://github.com/apache/brpc/blob/master/docs/cn/atomic_instructions.md#cacheline

English version

We all know that locks are commonly used in multi-core programming to prevent race conditions from occurring when multiple threads modify the same data. When locks become a performance bottleneck, we always want to try to bypass it, and inevitably come into contact with atomic instructions. But in practice, it is very difficult to write correct code using atomic instructions. The unpredictable race condition, ABA problem, and memory fence are very brain-burning. This article tries to help by introducing the atomic instructions under the SMP architecture. Everyone gets started. C++11 officially introduces atomic instructions, which we describe in its syntax.

As the name suggests, atomic instructions are instructions that cannot be subdivided for software. For example, x.fetch_add(n) means to add n to x atomically. This instruction for software either does not do it , either complete and no intermediate state will be observed. Common atomic instructions are:

Atomic instructions (x are all std::atomic)	Function
x.load()	Returns the value of x.
x.store(n)	Set x to n and return nothing.
x.exchange(n)	Set x to n and return the value before setting.
x.compare_exchange_strong(expected_ref, desired)	If x is equal to expected_ref, set it to desired and return success; otherwise, write the latest value to expected_ref, Return failure.
x.compare_exchange_weak(expected_ref, desired)	There may be a spurious wakeup compared to compare_exchange_strong.
x.fetch_add(n), x.fetch_sub(n)	Do x + = n, x-= n atomically, return before modification value.

You can already use these instructions to do atomic counting, such as multiple threads simultaneously accumulating an atomic variable to count the number of operations on some resources by these threads. However, there may be two problems with this:

This operation is not as fast as you think.
If you try to control access to some resources through seemingly simple atomic operations, there is a high chance that your program will crash.

Cacheline

Atomic operations without any contention or accessed by only one thread are faster. “Content” refers to multiple threads accessing the same cacheline at the same time. In order to obtain high performance at a low price, modern CPUs make extensive use of cache and divide the cache into multiple levels. The common Intel E5-2620 in Baidu has 32K L1 dcache and icache, 256K L2 cache and 15M L3 cache. The L1 and L2 caches are unique to each core, while L3 is shared by all cores. One core writing to its own L1 cache is extremely fast (4 cycles, ~2ns), but when another core reads or writes the same memory, it has to make sure it sees the corresponding cacheline in the other core. For software, this process is atomic, and other code cannot be interspersed in the middle. It can only wait for the CPU to complete consistency synchronization. This complex hardware algorithm makes atomic operations very slow. When competition is fierce on the E5-2620, fetch_add It will take about 700 nanoseconds. Accessing memory that is frequently shared by multiple threads tends to be slower. For example, the critical section of some scenes looks very small, but the performance of the spinlock that protects it is poor, because the exchange, fetch_add and other instructions used by the spinlock must wait for the latest cacheline. It seems that there are only a few instructions, and it is not surprising that it takes several microseconds.

To improve performance, it is necessary to avoid frequently synchronizing the cacheline with the CPU. This is not only related to the performance of the atomic instruction itself, but also affects the overall performance of the program. The most effective solution is straightforward: Try to avoid sharing.

A program that relies on a global multi-producer multi-consumer queue (MPMC) is unlikely to have good multi-core scalability because the maximum throughput of this queue depends on the latency of the synchronization cache, not the number of cores. It is best to use multiple SPMC or multiple MPSC queues, or even multiple SPSC queues instead, to avoid competition at the source.
Another example is counters. If all threads frequently modify a counter, performance will be poor. The reason is also that different cores are constantly synchronizing the same cacheline. If this counter is only used for logging, then we can let each thread modify the thread-local variable, and then merge the values in all threads when needed, and the performance may be dozens of times different.

A related programming pitfall is false sharing: access to variables that are rarely modified or even read-only are significantly slower because other variables in the same cache are frequently modified and have to wait for cache synchronization. Variables in multi-threads should be arranged according to access rules as much as possible, and variables that are frequently modified by other threads should be placed in independent caches. To align a variable or structure according to cacheline, you can include and then use the BAIDU_CACHELINE_ALIGNMENT macro. Please grep the brpc code to understand the usage.

Memory fence

Atomic technology alone cannot achieve access control to resources. Even if it is as simple as spinlock or reference counting, code that looks correct may crash. The key here is that reordering instructions results in a change in the reading and writing order. As long as there are no dependencies, instructions at the back of the code may run to the front, and both the compiler and the CPU will do this.

The motivation for doing this is very natural. The CPU should try to fill every cycle and run as many instructions as possible in unit time. As mentioned in the previous section, memory access instructions spend hundreds of nanoseconds waiting for cacheline synchronization. The most efficient way is to synchronize multiple cachelines at the same time, rather than doing them one by one. A thread’s sequential modifications to multiple variables in the code may be synchronized to the core of another thread in a different order. Different threads have different requirements for data, and on-demand synchronization will also lead to different cacheline reading and writing orders.

If the first variable acts as a switch, it controls access to subsequent variables. Then when these variables are synchronized to other cores together, the update order may change. The first variable may not be the first to be updated. However, other threads still think that it represents that other variables are valid and access the ones that have actually been deleted. variables, resulting in undefined behavior. For example, the following code snippet:

// Thread 1
// bool ready was initialized to false
p.init();
ready = true;

//Thread2
if (ready) {
    p.bar();
}

From a human perspective, this is correct, because thread 2 will only access p when ready is true. According to the logic of thread 1, p should be initialized at this time. But on multi-core machines, this code may not run properly:

ready = true in thread 1 may be reordered by the compiler or CPU before p.init(), so that when thread 2 sees ready as true, p is still uninitialized. This situation will also happen in thread 2, and some code in p.bar() may be rerouted before checking ready.
Even if there is no reordering, the values of ready and p will be independently synchronized to the cache of the core where thread 2 is located, and thread 2 may still see an uninitialized p when it sees ready as true.

Note: x86/x64 load has acquire semantics, and store has release semantics. The above code can run correctly except for compiler and CPU factors.

Through this simple example, you can get a glimpse of the complexity of atomic instruction programming. In order to solve this problem, the CPU and the compiler provide memory fence, which allows users to declare the visibility relationship between memory access instructions. Boost and C++11 abstract the memory fence and summarize it into the following memory orders .

memory order	Function
memory_order_relaxed	No fencing effect
memory_order_consume	Do not rearrange subsequent memory access instructions that rely on this atomic variable to before this instruction
memory_order_acquire	Do not rearrange subsequent memory access instructions to before this instruction
memory_order_release	Previous memory access Do not reorder instructions after this instruction. When the result of this instruction is visible to other threads, all previous instructions will be visible
memory_order_acq_rel	acquire + release semantics
memory_order_seq_cst	The acq_rel clause accidentally adds that all instructions using seq_cst have a strict total order relationship

With memory order, the above example can be corrected as follows:

// Thread1
// std::atomic<bool> ready was initialized to false
p.init();
ready.store(true, std::memory_order_release);

//Thread2
if (ready.load(std::memory_order_acquire)) {
    p.bar();
}

The acquire in thread 2 is paired with the release of thread 1 to ensure that when thread 2 sees ready==true, it can see all the memory access operations before thread 1 release.

Note that memory fence does not equal visibility. Even if thread 2 happens to read ready after thread 1 sets ready to true, it does not mean that it can see true, because there is a delay in the synchronization cache. The memory fence guarantees the order of visibility: “If I see the latest value of a, then I must also see the latest value of b.”

A related question is: How do you know if the value you are seeing is new or old? Generally there are two situations:

Values are special. For example, in the above example, ready=true is a special value. As long as thread 2 sees that ready is true, it means it has been updated. As long as a special value is set, reading or not reading the special value represents a meaning.
Always adds up. In some scenarios, if there is no special value, we can use instructions such as fetch_add to accumulate a variable. As long as the value range of the variable is large enough, over a long period of time, the new value will be different from all the previous old values, and we can distinguish each other.

For examples of atomic instructions, you can see the Example of boost.atomic, and the official description of atomic can be found here.

wait-free & amp; lock-free

Atomic instructions can give our service two important properties: wait-free and lock-free. The former means that no matter how the OS schedules threads, each thread is always doing useful things; the latter is weaker than the former, meaning that no matter how the OS schedules threads, at least one thread is doing useful things. If a lock is used in our service, the OS may switch out a thread that has just obtained the lock. At this time, all threads that rely on this lock are waiting instead of doing anything useful, so using a lock is not lock-free. , let alone wait-free. In order to ensure that something is always completed within a certain time, the key code of the real-time system is at least lock-free. Baidu’s extensive and diverse online services also have strict requirements on timeliness. If the most critical part of RPC meets wait-free or lock-free, more stable service quality can be provided. In fact, reading and writing in brpc are wait-free, see IO for details.

It is worth reminding that the common idea is that lock-free or wait-free algorithms will be faster, but the opposite may be true because:

Lock-free and wait-free must deal with more and more complex race conditions and ABA problems, and the code to achieve the same purpose is more complicated than using locks. The more code, the longer it takes.
The mutex algorithm is used to create a “retreat” effect in disguise. Backoff refers to trying another way to temporarily avoid competition when competition occurs. When mutex competition occurs, the caller will sleep, so that the thread that gets the lock can quickly complete a series of processes exclusively, and the overall throughput may be higher. .

Mutex often causes low performance because the critical section is too large (limiting concurrency) or the competition is too fierce (context switching overhead becomes prominent). The value of the lock-free/wait-free algorithm lies in its guarantee that one or all threads are always doing useful things, rather than absolute high performance. But in one case, the performance of lock-free and wait-free algorithms is probably higher: that is, the algorithm itself can be implemented with a small number of atomic instructions. Atomic instructions are also used to implement locks. When the algorithm itself can be completed in one or two instructions, it is definitely faster than using additional locks.