14｜Standard library: How to use technologies such as mutexes to coordinate thread running?

In the previous lecture, we mainly introduced some basic knowledge about concurrent programming, and showed through a simple example how to perform basic operations such as thread creation in C language. It also introduces how factors such as data competition, race conditions, and instruction rearrangement affect the execution correctness of multi-threaded applications. So, what methods can help us solve these problems?

Today, let’s take a look at several major tools provided by C language for concurrent programming: mutexes, atomic operations, condition variables, and thread local variables.

Use mutex

In essence, a mutex (Mutex) is actually a lock. Before a thread can access a shared resource, it needs to lock the mutex first. At this point, any other thread that wants to lock the mutex will be blocked until the current thread releases the lock. When the lock is released, all previously blocked threads continue to run and repeat the previous steps again, starting to “compete” for the number of places that can lock the mutex. In this way, we can ensure that only one thread operates on resources shared by multiple threads each time.

In C language, you can use the mutex capability through the related interfaces prefixed with “mtx_” provided by the header file threads.h. You should still remember the C example code with data race mentioned in the previous lecture. It has been rewritten here as follows:

#include <threads.h>
#include <stdio.h>
#defineTHREAD_COUNT 10
#define THREAD_LOOP 100000000
mtx_t mutex;
long counter = 0;
int run(void* data) {
  for (int i = 0; i <thREAD_LOOP; i + + ) {
    mtx_lock( & amp;mutex); // Lock the mutex,
    counter + + ;
    mtx_unlock( & amp;mutex); // Release a mutex;
  }
  printf("Thread %d terminates.\\
", *((int*) data));
  return thrd_success;
}
int main(void) {
#ifndef __STDC_NO_THREADS__
  int ids[THREAD_COUNT];
  mtx_init( & amp;mutex, mtx_plain); // Create a simple, non-recursive mutex object;
  thrd_t threads[THREAD_COUNT];
  for (int i = 0; i <thREAD_COUNT; i + + ) {
    ids[i] = i + 1;
    thrd_create( & amp;threads[i], run, ids + i);
  }
  for (int i = 0; i <thREAD_COUNT; i + + )
    thrd_join(threads[i], NULL);
  printf("Counter value is: %ld.\\
", counter);
  mtx_destroy( & amp;mutex); // Destroy a mutex object;
#endif
  return 0;
}

As you can see, in line 19 of the code, a mutex object of basic type (mtx_plain) is created using the mtx_init function (hereinafter referred to as the mutex). Immediately afterwards, inside the run function, before accumulating the value of the variable counter, we need to lock the previously created mutex through the mtx_lock function. Similarly, after the process has finished using the shared variables, it needs to unlock the mutex through the mtx_unlock function to give other threads the opportunity to continue processing the shared variables. Finally, on line 28 of the code, the previously created mutex is destroyed before the program exits.

Generally speaking, in C language, mutexes can be divided into three types: mtx_plain, mtx_recursive and mtx_timed.

Among them, mtx_plain is the simplest type of mutex. We can perform basic locking and unlocking on it, but it cannot be used in scenarios that require “repeated locking” (such as recursive calls to functions). This is because even if the current thread holds the lock, locking the same mtx_plain mutex again will cause the thread to be blocked. At this time, a deadlock problem will occur, that is, the current thread waits for itself to unlock before it can lock again. If you want to unlock, you need to let the thread lock first to complete the execution of the current function.

On the contrary, the mtx_recursive type mutex is also called “Reentrant Mutex”. As the name suggests, it can be used in scenarios where repeated locking is required. This type of mutex can be repeatedly locked multiple times by the same thread without blocking the thread. But correspondingly, fully unlocking it also requires executing mtx_unlock multiple times.

And the last one is a mutex of type mtx_timed, which has a special “timeout attribute”. This means that by using the mtx_timedlock function, we can achieve “mutex locking with timeout limit”, that is, when a thread attempts to lock the corresponding mutex, it will only wait for a certain period of time in a blocking manner. If the mutex is not successfully locked after the given time, the thread continues execution.

In addition to the functions mentioned above, the C standard library also provides two other functions related to “mutual exclusion”. Here, they are organized in the table below for your reference.

Using mutex locks can help us solve data race problems, but in some scenarios with more stringent performance requirements, it may not be the best choice. Next, let’s look at another way to avoid data races, atomic operations.

Use atomic operations

Atoms are basic particles that cannot be further divided in chemical reactions, so as the name suggests, “atomic operation” means that the operation itself cannot be divided into finer steps. When we perform atomic operations on shared resources in multiple different threads, the compiler and CPU will ensure the correct execution of these operations, that is, only one thread will perform these operations at the same time. Only after the thread has completed the entire operation, other threads can continue to perform the same operation.

Similarly, these atomic operation capabilities are conveniently available through the header file provided by C11 called stdatomic.h. For example, in the following code, the data competition problem of the example in the previous lecture is solved in this way.

#include <threads.h>
#include <stdio.h>
#include <stdatomic.h>
#defineTHREAD_COUNT 10
#define THREAD_LOOP 100000000
#if !defined(__STDC_NO_ATOMICS__)
_Atomic long counter = 0; //Define an atomic type global variable to record the cumulative value of the thread;
#endif
int run(void* data) {
  for (int i = 0; i <thREAD_LOOP; i + + )
    atomic_fetch_add_explicit( & amp;counter, 1, memory_order_relaxed); // Use atomic addition operation;
  printf("Thread %d terminates.\\
", *((int*) data));
  return thrd_success;
}
int main(void) {
#if !defined(__STDC_NO_THREADS__) || !defined(__STDC_NO_ATOMICS__)
  int ids[THREAD_COUNT];
  thrd_t threads[THREAD_COUNT];
  for (int i = 0; i <thREAD_COUNT; i + + ) {
    ids[i] = i + 1;
    thrd_create( & amp;threads[i], run, ids + i);
  }
  for (int i = 0; i <thREAD_COUNT; i + + )
    thrd_join(threads[i], NULL);
  printf("Counter value is: %ld.\\
", counter);
#endif
  return 0;
}

Similar to using thread control-related interfaces, you also need to use the macro named STDC_NO_ATOMICS to determine whether the compiler supports atomic operations. As you can see, we have made corresponding preprocessing judgments on lines 6 and 16 of the code respectively.

Next, in line 7 of the code, the original global variable counter is modified using the _Atomic keyword newly introduced in C11 to define it as an atomic type (here you can also directly use the C standard library to encapsulate it for us macro atomic_long).

Immediately afterwards, inside the run function, at line 11 of the code, a function named atomic_fetch_add_explicit is used to complete the accumulation process of the counter variable. This function provides us with an atomic accumulation operation that allows the thread to exclusively occupy the entire variable when accumulating data. In addition, you should also note that through the third parameter of this function, you can also specify the memory order that the current operation needs to satisfy.

As mentioned in the previous lecture, since the compiler and processor may use instruction rearrangement to optimize the running efficiency of the program, when running multi-threaded applications with inter-thread data dependencies on a multi-core CPU, the correctness of the program will be affected. Problems may arise. So, how to solve this problem? Take a look at the code below. Here, I fixed the example mentioned in the last section of the previous lecture by specifying the specific memory order of each atomic operation.

#include <threads.h>
#include <stdio.h>
#include <stdatomic.h>
#if !defined(__STDC_NO_ATOMICS__)
atomic_int x = 0, y = 0;
#endif
int run(void* v) {
  atomic_store_explicit( & amp;x, 10, memory_order_relaxed);
  atomic_store_explicit( & amp;y, 20, memory_order_release);
}
int observe(void* v) {
  while(atomic_load_explicit( & amp;y, memory_order_acquire) != 20);
  printf("%d", atomic_load_explicit( & amp;x, memory_order_relaxed));
}
int main(void) {
#if !defined(__STDC_NO_THREADS__) || !defined(__STDC_NO_ATOMICS__)
  thrd_t threadA, threadB;
  thrd_create( & amp;threadA, run, NULL);
  thrd_create( & amp;threadB, observe, NULL);
  thrd_join(threadA, NULL);
  thrd_join(threadB, NULL);
#endif
  return 0;
}

As you can see, the read and write operations on atomic type variables x and y in threads run and observe have been modified. Among them, the function atomic_load_explicit is used to read the value of an atomic type variable; and the function atomic_store_explicit is used to modify them. In addition, both functions support specifying the memory order that the corresponding operation needs to follow through their last parameter.

In this modified code, a total of three different memory orders (corresponding to three enumeration values) are used. First, let’s look at their specific definitions. For ease of observation, this information is organized in the table below.

I believe that after reading the definitions of these three memory sequences, you already have a general understanding of their functions. Among them, for operations using memory_order_relaxed, we do not need to make any guarantees about their execution order. Instead, the compiler and processor can perform appropriate optimizations as needed.

In the run thread, in order to ensure that the modification process of variable x must occur before the value of variable y is modified, memory_order_release needs to be used to limit the modification of variable y, which must be done after all previous modifications are completed. Similarly, for the observe thread, in order to prevent the processor from putting the value of variable x into the cache in advance, here, memory_order_acquire is also needed to ensure that the read operation of variable y will occur before the read operation of variable x. . You can intuitively understand the execution relationship between the various operations mentioned above through the picture below.

In addition to the three memory orders used in the above examples, the C language also provides three other different memory orders for us to use in different scenarios.

Overall, C11 provides us with a large number of related types, macros, and functions that can be used for atomic operations through the stdatomic.h header file. Compared with using mutexes, atomic operations allow us to abstract parallel code more clearly and conveniently, without the need for frequent locking and lock-releasing operations.

Not only that, from the perspective of execution performance, the execution of atomic operations usually directly relies on the corresponding atomic machine instructions provided by the CPU, such as the lock add instruction corresponding to the atomic_fetch_add_explicit function on the x86-64 platform. Using a mutex requires thread blocking and frequent context switching, so compared to this, atomic operations usually perform better.

Here, some common standard library functions related to atomic operations are organized in the table below for your reference.

Use condition variables

Condition variables are a commonly used thread synchronization mechanism. Through the example in the previous section, you will find that in multi-threaded applications, there is a very common synchronization pattern between threads, that is, the execution of a thread depends on the first preprocessing of data by another thread. In the above example, the execution of a certain piece of logic in the observe thread needs to wait for the run thread to change the value of the atomic variable y to 20. Here, we achieve this effect through “Busy Waiting”.

Although using busy waiting can meet our expectations, it is a very “expensive” method and may even be considered an anti-pattern and should be avoided. This is because busy waiting requires the thread to repeatedly check whether a certain condition is true, so a lot of valuable CPU resources need to be wasted on useless activities. So, is there a better way to minimize the waste of processor resources and solve the problem of data dependence between threads? The answer is yes, this method is to use condition variables.

Consider the following example:

#include <threads.h>
#include <stdio.h>
mtx_t mutex;
cnd_t cond; //Define a condition variable;
int done = 0;
int run(void* data) {
  mtx_lock( & amp;mutex);
  done = 1;
  cnd_signal( & amp;cond); // Notify the waiting thread;
  mtx_unlock( & amp;mutex);
  return thrd_success;
}
int main(void) {
#ifndef __STDC_NO_THREADS__
  mtx_init( & amp;mutex, mtx_plain);
  cnd_init( & amp;cond); // Initialize condition variables;
  thrd_t thread;
  thrd_create( & amp;thread, run, NULL);
  mtx_lock( & amp;mutex);
  while (done == 0) {
    cnd_wait( & amp;cond, & amp;mutex); // Let the current thread enter the waiting queue;
  }
  mtx_unlock( & amp;mutex);
  printf("The value of done is: %d", done);
  mtx_destroy( & amp;mutex);
  cnd_destroy( & amp;cond); // Destroy condition variable;
#endif
  return 0;
}

The basic logic of this code is similar to the example in the previous section. The code starting from line 23 needs to wait for the run thread to first modify the value of the global variable done to 1 before execution. In lines 15-16 of the code, we initialize the mutex object and condition variable object that need to be used. In lines 19~23 of the code corresponding to the main thread, we use the function cnd_wait related to the condition variable. When this function is called, the current thread needs to obtain a mutex lock and pass it as an actual parameter. The lock will be released after the function is called. At the same time, all threads executing here will be blocked.

Next, let’s turn our attention to the run thread.

In line 8 of the run thread code, change the value of the variable done to 1. Immediately afterwards, by calling the function cnd_signal, the run thread can “notify” all threads that were previously blocked at the function cnd_wait, so that one of them can continue to run. Of course, in this example, there is only one thread corresponding to the main function. At this point, the mutex will be re-locked and the main thread will continue executing the next instructions. In lines 24~26 of the code, it prints out the value of the global variable done and destroys the mutex and condition variable objects. Finally, the program execution is completed.

As you can see, in fact, condition variables provide us with a “notification” capability between threads. After a thread has completed something, it can notify and wake up the waiting thread, allowing it to continue working and complete the next task. In this process, we do not need to use busy waiting to let the thread frequently query the flag. Therefore, CPU resources are better utilized.

Here, a small question: Why use a while statement instead of an if statement in line 20 of the code?

In concurrent programming, condition variables are a very powerful weapon. Through it, we can further implement tools such as monitors and monitors and synchronization primitives. Moreover, it also solves the classic producer-consumer problem very well. If you are interested in this part of the content, you can refer to books such as “C++ Concurrency in Action” and “Modern Operating Systems” for more in-depth study. Although they do not specifically introduce concurrent programming based on the C language, many of the concepts, even the C++ interface, are similar and interoperable with the C language.

In addition to the condition variable method used in the above code, the C standard library also provides two other commonly used functions. They are organized in the table below for your reference.

Finally, let’s go back and look at another thing directly related to threads, thread local variables.

Using thread local variables

In addition to sharing global variables that exist within the process, a thread can also have its own thread local variables (TLS).

As the name suggests, the value of a thread-local variable is only available during the lifetime of a specific thread. The actual storage space of the variable is allocated when the thread starts and is reclaimed when the thread ends. Threads will not cause data races for read and write operations on these variables. Let’s look at an example:

#include <stdio.h>
#include <threads.h>
#include <stdatomic.h>
#defineTHREAD_COUNT 10
#define THREAD_LOOP 10000
_Thread_local int counter = 0; //Define thread local variables;
int run(void *data) {
  for (int i = 0; i <thREAD_LOOP; + + i)
    counter + = 1; // Update the value of the counter variable to which the current thread belongs;
  return counter;
}
int main(int argc, char const *argv[]) {
  thrd_t threads[THREAD_COUNT];
  int sum = 0, result = 0;
  for (int i = 0; i <thREAD_COUNT; + + i)
    thrd_create( & amp;threads[i], run, NULL);
  for (int i = 0; i <thREAD_COUNT; + + i) {
    thrd_join(threads[i], & amp;result);
    sum + = result; // Accumulate the calculated value of each thread;
  }
  printf("The value of count is %d.\\
", sum);
  return 0;
}

As you can see, the logic of this code is very simple: create 10 threads (corresponding to THREAD_COUNT), let them accumulate the global variable counter at the same time, and continue 10,000 times (corresponding to THREAD_LOOP). Then, at the end of the main thread, the accumulated value is printed out.

When you see this, I believe your first feeling must be that mutex locks or atomic operations should be used to prevent data competition when multiple threads modify the counter variable. But here, I did not do that, but adopted a more convenient way. All this is due to the existence of thread local variables.

In line 6 of the code, the global variable counter is marked as thread-local using the _Thread_local keyword (you can also use the macro thread_local). This means that each thread generates a variable counter at creation time that belongs only to the current thread. Therefore, when this thread accumulates the counter variable, it will not be affected by other threads. When the thread exits, through thrd_join on line 18 of the code, we can uniformly accumulate the respective counter values returned by each ending thread in the main thread to obtain the final calculation result. You can intuitively understand this process through the figure below.

In short, thread local variables provide us with another way to avoid data races. In addition, it can also be used to store some thread-unique information, such as the value of errno.

The above code uses keywords to define thread local variables. In addition, the standard library also provides a series of functions that can achieve the same purpose. But the difference is that when creating thread local variables through functions such as tss_create, you can also specify a corresponding destructor for it. In this way, when the thread exits, you can ensure that the corresponding thread local resources (such as heap memory) can be cleaned up in the correct way. Here, I have listed the relevant functions in the table below for your reference.

Summary

In this lecture, we mainly introduce relevant content about mutexes, atomic operations, condition variables, and thread local variables. Properly using these methods, we can avoid problems caused by data competition, race conditions, and instruction rearrangement that multi-threaded applications often encounter.

Among them, the mutex allows us to restrict the execution of multiple threads by locking and unlocking it, so that they can use shared resources in an orderly manner. In C language, mutexes are divided into three types. mtx_plain is the most basic type, mtx_recursive can be used in scenarios that require repeated locking, and mtx_timed gives the mutex a timeout attribute. Used in conjunction with mtx_timedlock, it allows a thread to try to lock a mutex for a limited period of time.

Atomic operations are a more convenient way to avoid data races. By using _Atomic keyword we can define variables as atomic types. When a thread accesses a variable of this type, it can complete the entire operation in one go in an “indivisible” form. Not only that, when performing atomic operations, you can also specify the memory order that the operations need to satisfy. The implementation of atomic operations usually relies on the special machine instructions of the platform, and the C standard library helps us shield these details by directly providing common synchronization primitives.

Condition variables provide notification capabilities between threads. It allows a thread to notify waiting threads that need subsequent processing after completing something, thereby allowing threads with data dependencies to synchronize in a more efficient way. In addition, condition variables can also be used to implement more complex synchronization mechanisms such as monitors and monitors.

Finally, thread local variables are also a common way to resolve data races. The specific operation is to add the _Thread_local keyword to the global variable in the C code. In this way, local variables with the same name that only belong to the current thread will be generated only when the thread is created. Therefore, modifications to this variable by the current thread will not be affected by other threads.