Brief analysis of various locks in the Linux kernel: semaphore/mutex lock/read-write lock/atomic lock/spin lock/memory barrier, etc.

First of all, it must be clear that different locks have different targets.
The following is a summary of various locks that act on critical section, CPU, memory, cache:

1. Atomic variable/spinlock –CPU

Since it is a CPU lock, it is for multi-core processors or multi-CPU processors. For a single core, only an interrupt will cause the task to be preempted, so you can turn off the interrupt before entering the critical section, but it is not enough to turn off the interrupt for the multi-core CPU, because turning off the interrupt for the current CPU can only So that the current CPU will not run other programs that will enter the critical area, but other CPUs may still execute programs that enter the critical area.

Atomic variable: In the x86 multi-core environment, when multiple cores compete for the data bus, the Lock command is provided to lock the bus to ensure the atomicity of the “read-modify-write” operation at the chip level. This is easy to say, we generally set a variable that is accessed by multiple threads to an atomic type, such as atomic_int x; or atomic x;

Spinlock:
When a thread acquires a lock, if the lock has been acquired by other threads, then the thread will wait in a loop, and then constantly judge whether the lock can be successfully acquired. The usage examples are as follows:

#include <linux/spinlock.h>
// define spin lock
spinlock_t my_lock;

void my_function(void)
{<!-- -->
    spin_lock( & my_lock);
    // operations to access shared resources
    spin_unlock( & my_lock);
}

In the mutex, if the current thread does not get the lock, it will give up the CPU; in the spin lock, if the current thread does not get the lock, the current thread is busy waiting on the CPU until the lock is available, This is to ensure a faster response. But if there are too many such threads, it means that multiple CPU cores are busy waiting, which degrades system performance.
Therefore, it must not spin for too long, so if a spin lock is used to protect a critical section in user mode programming, the critical section must be as small as possible, and the granularity of the lock should be as small as possible.

Why is the response speed of the spin lock faster than that of the mutex?

As mentioned in Kobayashi coding, the spin lock is provided by the CPU through the CAS function (Compare And Swap), and is locked and unlocked in the “user state” Operation will not actively generate thread context switching, so it will be faster and less expensive than mutex.
Mutex locks are not. As mentioned earlier, if the mutex lock fails, the thread will relinquish the CPU. This process is actually thread switching is completed by the kernel, so when the lock fails, 1) first Switch from user state to kernel state, the kernel will set the state of the thread from “running” state to “sleep” state, and then switch the CPU to other threads to run; 2) When the mutex is available, The thread in the previous “sleep” state will become “ready” state (to enter the ready queue), and then the kernel will switch the CPU to the thread to run at an appropriate time.
Then return to user mode.
In this process, there is not only the overhead of switching from user mode to kernel mode, but also the overhead of two thread context switches.
Thread context switching is mainly thread stack, registers, thread local variables, etc.
The spin lock does not switch threads when the current thread fails to acquire the lock, but waits in a loop until the lock is successfully acquired. Therefore, the spin lock will not switch to kernel mode, and there is no thread switching overhead.
So if the lock is occupied for a short time, or each thread is fast in and fast out of the critical section, then using the spin lock is the least expensive!
Disadvantages of spin lock As mentioned earlier, if the spin lock is long or the number of threads is too large, the utilization rate of the CPU will drop, because each thread executed above is busy. Waiting – uses CPU but does nothing.

Second, semaphore/mutex – critical section

Semaphore:
A semaphore (semaphore) is essentially a counter that describes the number of resources available in a critical section.
The semaphore is 3, indicating that the available resources are 3. The initial semaphore is 3, and the semaphore is 1 at a certain moment, indicating that the number of available resources is 1, then there are 2 processes/threads using resources or 2 resources are consumed (the specific resource depends on the specific situation) . The process has a PV operation on the semaphore. The P operation is -1 before entering the shared resource area, and the V operation is + 1 after leaving the shared resource (at this time, the semaphore indicates how many processes can be allowed to enter the critical area).
When semaphore is used for multi-thread communication programming, the semaphore is often initialized to 0, and then two functions are used to synchronize between threads:
sem_wait(): Wait for the semaphore, if the value of the semaphore is greater than 0, decrease the value of the semaphore by 1 and return immediately. If the value of the semaphore is 0, the thread blocks.
sem_post(): Release resources, semaphore + 1, which is equivalent to unlock, so that the thread that executes sem_wait() will not be blocked.

Note: The semaphore itself is also a shared resource, and its + + operation (release resource) and -- operation (acquire resource) are also need protection. In fact, it is protected by the spin lock. If there is an interrupt, it will save the interrupt to the eflags register. After the operation is completed, it will read the register and then execute the interrupt.

struct semaphore {<!-- -->
     spinlock_t lock; // spin lock
     unsigned int count;
     struct list_head wait_list;
};

Mutex:
The semaphore indicates the number of available resources, which allows multiple processes/threads in the critical section. But the mutex is not. Its purpose is to only let one thread enter the critical section, and the other threads can only block and wait if they don’t get the lock. Threads enter the critical section mutually exclusive, which is the origin of the mutex name.
Also mention the std::timed_mutex sleep lock, the difference between it and a mutex is:
In the mutual exclusion lock, the thread that has not obtained the lock will be blocked and waited all the time, while the sleep lock is to set a certain sleep time, such as 2s, and the thread sleeps for 2s. The output fails to acquire the lock), if it is obtained, then continue to do things. For example, use the member function try_lock_for()

std::timed_mutex g_mutex;
//Sleep for 2s before grabbing the lock
if(g_mutex.try_lock_for(std::chrono::seconds(2)))){<!-- -->
// do something
}
else{<!-- -->
// did not grab
std::cout<<"Failed to acquire lock";
}

3. Read-write lock/preemption – critical section

Read-write lock:
For scenarios where read operations are more frequent than write operations, Let reads and writes be locked separately, which can reduce the granularity of locks and improve program performance.
It allows multiple threads to read shared resources at the same time, but only allows one thread to write to shared resources. This can improve concurrency performance, since read operations are usually much more frequent than write operations. Read-write locks are high-order locks, and spin locks can be used to implement them.

Preemption:
Preemption must involve switching of process contexts, while interrupts involve switching of interrupt contexts.
The kernel has supported kernel preemption since 2.6. The previous kernel did not support preemption. As long as the process is occupying the CPU and the time slice is not used up, unless there is an interruption, it can occupy the CPU all the time;
The situation of preemption:
For example, a high-priority task (process) will actively give up the CPU because it needs to wait for resources (or it is interrupted because of an interrupt), and then the low-priority task will occupy the CPU first. When the resource arrives, the kernel will let it The task with the higher priority preempts the task that is running on the CPU. In other words, the current low-priority process is running, the time slice is not used up, and no interruption occurs, but it is kicked out.
In order to support kernel preemption, the kernel introduces the preempt_count field. The initial value of the count is 0, + 1 whenever the lock is used, and -1 when the lock is released. When preempt_count is 0, it means that the kernel can preempt safely, and when it is greater than 0, the kernel preemption is prohibited

Per-CPU–act on cache
The per-cpu variable is used to resolve data inconsistency between L2 cache and memory in each CPU.

Fourth, RCU mechanism/memory barrier – memory

The RCU mechanism is read copy update, that is, read copy update.
Like the read-write lock, the RCU mechanism also allows multiple readers to read at the same time, but when updating data, you need to copy a copy first, complete the modification on the copy, and then Replace the old data one at a time again.
For example, to modify the data of a node in the linked list, copy the node first, modify the value inside, and then point the pointer in front of the node to the copied node, refer to the link

When no one wants to read the old data, reclaim the memory.
So There are two cores of the RCU mechanism: 1) update after copying; 2) delay reclaiming memory
If there is an RCU mechanism, there is no need to synchronize reading and writing, and there will be no competition between reading and writing, because the reader is reading the original data, and the writer is modifying the copied memory. reads Writes can be made in parallel.
Their reading and writing is carried out according to the memory pointer. After the writer finishes writing, he assigns the pointer of the old reader to the pointer of the new data. The assignment operation of the pointer is atomic, so that the new of readers will access new data.
Old memory is reclaimed by a single thread.

Memory barrier:
Memory barriers are used to control the order of memory access to ensure that the execution order of instructions is as expected.
Because the code is often not executed in the order we write, it has two levels of disorder:
1) At the compiler level. Because the optimization of the compiler often rearranges the assembly instructions of the code, refer to the blog
2) CPU level. There is memory out-of-order access among multiple CPUs.
The memory barrier is to make the compiler or CPU access to memory in order.

Out-of-order access at compile time:

int x, y, r;
void f()
{<!-- -->
    x = r;
    y = 1;
}

After compiling with the optimization option enabled, the resulting assembly may be executed first with y = 1, and then executed with x = r. You can use g + + -O2 -S test.cpp to generate assembly code, check the assembly after -O2 optimization is enabled, refer to the article:
We can use the macro function barrier() provided by the kernel to avoid this disorder of the compiler:

#define barrier() __asm__ __volatile__("" ::: "memory")
int x, y, r;
void f()
{<!-- -->
x = r;
__asm__ __volatile__("" ::: "memory");
y = 1;
}

Or modify the related variables x and y with the volatile keyword:

volatile int x, y;

Note that the volatile keyword in C++ can only avoid instruction rearrangement at compile time, and it does not work for multi-CPU instruction rearrangement, so in fact, when the code actually runs, it may be Out of order. The Java volatile keyword seems to have a memory barrier function at the compiler and CPU levels.

Multiple CPUs access memory out of order:
On a single CPU, regardless of the disorder caused by compiler optimization, multi-threaded execution does not have the problem of out-of-order memory access. Because a single CPU fetches instructions in order (queue FIFO), the result of returning instruction execution to the register is also in order (also through the queue)
But in a multi-CPU processor, because each CPU has a cache, when the data x is acquired by a CPU for the first time, x is obviously not in the CPU’s cache (this is cache miss). When a cache miss occurs, it means that the CPU needs to obtain data from the memory, and then the data x will be loaded into the cache of the CPU, so that it can be directly accessed from the cache in the future.
When a CPU performs a write operation, it must ensure that other CPUs have removed data x from their caches (in order to ensure consistency), and only after the removal operation is complete, the CPU can safely modify the data.
Obviously, when there are multiple caches, we must use the cache consistency protocol to avoid data inconsistency, and this communication process may lead to out-of-order access.
There are three types of CPU-level memory barriers:

  1. General barrier to ensure orderly read and write operations, mb() and smp_mb() // mb is memory barrier
  2. Write operation barrier, only guarantees that write operations are in order, wmb() and smp_wmb()
  3. Read operation barrier, only guaranteed to read in order, rmb() and smp_rmb()

The above functions are also defined by macros, such as mb(), which is used in the above example of out-of-order during compilation by adding a mfence:

#define mb() _asm__volatile("mfence":::"memory")
void f()
{<!-- -->
x = 1;
__asm__ __volatile__("mfence" ::: "memory");
r1 = y;
}
// Memory barriers in GNU #define mfence() _asm__volatile_("mfence": : :"memory")

Note that all CPU-level Memory Barriers (except data-dependent barriers) imply compiler barriers.

Moreover, In fact, many thread synchronization mechanisms are supported by memory barriers at the bottom, such as atomic lock and spin lock are both dependent on CPU Provided CAS operation implementation. CAS is Compare and Swap, its basic idea is:
In a multi-threaded environment, if you need to modify the value of a shared variable, first read the value of the variable, then modify the value of the variable, and finally compare the new value with the old value. If they are the same, the modification is successful, otherwise the modification fails. The operation needs to be repeated.
When implementing CAS operations, you need to use memory barriers to ensure the order and consistency of operations. For example, in Java, when using the compareAndSet method of the Atomic class to implement the CAS operation, a memory barrier is automatically inserted to ensure the correctness of the operation.

For programming at the application layer, C++11 introduces a memory model, which ensures synchronization and consistency in multithreaded programs. The memory barrier (CPU level) is a part of the memory model, which is used to ensure a specific memory operation sequence. X86-64 only supports one instruction rearrangement: Store-Load, that is, read operations may be rearranged to write operations front.
There are two types of memory barriers: store and load. Examples of usage are as follows:

// store barrier
std::atomic<int> x;
x.store(1, std::memory_order_release); // store barrier ensures that previous writes complete before subsequent writes

// load barrier
std::atomic<int> y;
int val = y.load(std::memory_order_acquire); // load barrier ensures that previous reads complete before subsequent reads

In addition to guaranteeing the order of instructions, the memory barrier at the CPU level must also ensure the visibility of data. Invisibility will lead to data inconsistency.
Therefore, acquire and release semantics are also used in the above code to set barriers for reading and writing respectively:

acquire: Ensure that the read and write operations after acquire will not occur before the acquire action
release: Ensure that the read and write operations before the release will not occur after the release action

In addition to the above atomic load and store, C++11 also provides a separate memory barrier function std::atomic_thread_fence, its usage is similar to the above:

#include <atomic>
std::atomic_thread_fence(std::memory_order_acquire);
std::atomic_thread_fence(std::memory_order_release);

5. Examples of using these locks in the kernel

Process scheduling: The kernel lock is used to protect the data structures of the scheduler to avoid errors caused by multiple CPUs modifying them at the same time.

// spin lock
spin_lock( &rq->lock);
...
spin_unlock( &rq->lock);

File system: The kernel lock is used to protect the metadata of the file system, such as inode, dentry and other data structures, to avoid errors caused by multiple processes accessing them at the same time.

spin_lock( &inode->i_lock);
...
spin_unlock( &inode->i_lock);

Network protocol stack: The kernel lock is used to protect the data structure of the network protocol stack, such as sockets, routing tables, etc., to avoid errors caused by multiple processes accessing them at the same time.

read_lock( &rt_hash_lock);
...
read_unlock( & rt_hash_lock);

Memory management: Kernel locks are used to protect memory management data structures, such as page tables, memory maps, etc., to avoid errors caused by multiple processes accessing them at the same time

spin_lock( &mm->page_table_lock);
...
spin_unlock( &mm->page_table_lock);