Detailed description of Java memory barriers and a thorough understanding of volatile

Source: Detailed description of Java memory barriers and a thorough understanding of volatile

Directory of series articles

[JVM Series] Chapter 1 Runtime Data Area
[JVM Interview Questions] Chapter 2 From JDK7 to JDK8, why does the JVM use metaspace to replace the permanent generation?
[JVM Interview Questions] Chapter 3 Why is the JVM generation age 15 times? Can it be set to 16?
[JVM Series] Detailed description of Java memory barriers and a thorough understanding of volatile

Article directory

  • Table of Contents of Series Articles
  • Preface
  • 1. Compiler barrier
  • 2. x86 CPU barrier
  • 3. Memory barrier in HotSpot VM
  • Summarize

Foreword

Generally speaking, memory barriers are divided into two layers: compiler barriers and CPU barriers. The former only takes effect during the compilation period to prevent the compiler from generating out-of-order memory access instructions; the latter inserts or modifies specific CPU instructions during operation. This prevents out-of-order execution of memory access instructions.

1. Compiler barrier

The compiler barriers are as follows:

asm volatile(“”: ::”memory”)

Only an empty instruction “” is inserted during inline assembly. The key is that “memory” is specified in the modified register list in inline assembly, which tells the compiler: this instruction (actually empty) may Reading any memory address may also overwrite any memory address. Then the compiler will become conservative, and it will prevent the memory access operations above the fence command from moving to the bottom, and prevent the operations below from moving to the top, which means preventing disorder, which is the result we want. This command has another side effect: it causes the compiler to flush all memory variables cached in registers to memory, and then read those values from memory again.

To sum up, the above command has two functions, preventing instruction reordering and ensuring visibility.

If you use a pure bytecode interpreter to run Java, the orderAccess_linux_zero.inline.hpp file in HotSpot VM has the following implementation:

static inline void compiler_barrier() {<!-- -->
  __asm__ volatile ("" : : : "memory");
}
 
inline void OrderAccess::loadload() {<!-- -->
  compiler_barrier(); }
inline void OrderAccess::storestore() {<!-- -->
  compiler_barrier(); }
inline void OrderAccess::loadstore() {<!-- -->
  compiler_barrier(); }

This method relies on the compiler to achieve its purpose. If the compiler supports it, there is no need to write corresponding implementations on different platforms and CPUs, simplifying cross-platform operations.

2. x86 CPU barrier

x86 belongs to a strong memory model, which means that in most cases the CPU will ensure that memory access instructions are executed in order. To prevent this CPU out-of-order, we need to add a CPU memory barrier. The special memory barrier instruction of X86 is “mfence”. In addition, you can also use the lock instruction prefix to achieve the same effect, and the latter has less overhead. That is to say, memory barriers can be divided into two categories:

  • Memory barriers themselves, such as the “lfence”, “sfence” and “mfence” assembly instructions
  • It is not a memory barrier per se, but is modified by the lock instruction prefix, and its combination becomes a memory barrier. In the X86 instruction system, one type of memory barrier is often implemented using the “lock instruction prefix plus a no-op” method, such as lock addl $0x0,(%esp)

The following introduces the lock command prefix. The functions of the lock instruction prefix are as follows:

  • Modified assembly instructions become “atomic”
  • Provides memory barrier effects in conjunction with decorated assembly instructions

In the X86 instruction system, there is a lock instruction prefix, and the assembly instructions allowed to be modified with the lock instruction prefix are:

ADD,ADC,AND,BTC,BTR,BTS,CMPXCHG,CMPXCH8B,DEC,INC,NEG,NOT,OR,SBB,SUB,XOR,XADD,XCHG, etc.

It should be noted that the “XCHG” and “XADD” assembly instructions themselves are atomic instructions, but they are also allowed to be modified with the lock instruction prefix.

There are two functions of the lock prefix to remember. The first is the memory barrier. Any instruction with a lock prefix explicitly or implicitly, as well as instructions such as CPUID, have the function of a memory barrier. For example, xchg [mem], reg has an implicit lock prefix. The second is atomicity. A single instruction is not an indivisible operation. For example, mov is atomic only when its operand meets certain conditions. But if the lock prefix is allowed, it is atomic.

3. Memory barrier in HotSpot VM

In order to better allow Java developers to understand these concepts in a CPU-independent way, JMM combines memory read (Load) and write (Store) operations in pairs: LoadLoad, LoadStore, StoreLoad and StoreStore. Only the StoreLoad combination may be out of order, and The memory addresses of Store and Load must be different.

Now we only discuss the CPU barrier under the x86 architecture, referring to the Intel manual. The four barriers are just designed by Java for cross-platform. In fact, depending on the CPU, the JVM on the corresponding CPU platform may be able to optimize some barriers. For example, LoadLoad, LoadStore and StoreStore are the default behaviors on x86. In this Writing code on the platform will simplify some development processes. X86-64 only supports one kind of instruction rearrangement: StoreLoad, that is, read operations may be rearranged before write operations. At the same time, write operations in different threads are not guaranteed to be globally visible. For examples, see “Intel? 64 and IA-32 Architectures Software” Developer’s Manual” manual sections 8.6.1 and 8.2.3.7. This problem can be solved with lock or mfence, not by combining sfence and lfence.

The loadload(), storestore() and loadstore() functions implemented by HotSpot VM in JDK 1.8 version on x86 are as follows:

inline void OrderAccess::loadload(){<!-- -->
    acquire();
}
inline void OrderAccess::storestore(){<!-- -->
    release();
}
inline void OrderAccess::loadstore(){<!-- -->
    acquire();
}
inline void OrderAccess::storeload(){<!-- -->
    fence();
}
 
inline void OrderAccess::acquire() {<!-- -->
  volatile intptr_t local_dummy;
#ifdef AMD64
  __asm__ volatile ("movq 0(%%rsp), %0" : "=r" (local_dummy) : : "memory");
#else
  __asm__ volatile ("movl 0(%%esp),%0" : "=r" (local_dummy) : : "memory");
#endif // AMD64
}
 
inline void OrderAccess::release() {<!-- -->
  // Avoid hitting the same cache-line from different threads.
  volatile jint local_dummy = 0;
}
</code><img class="look-more-preCode contentImg-no-view" src="//i2.wp.com/csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreBlack.png" alt ="" title="">

The acquisition semantics prevent the read and write operations behind it from being reordered before acquire, so the combination of LoadLoad and LoadStore can meet the requirements; the release semantics prevent the read and write operations before it from being reordered after release, so the combination of StoreStore and LoadStore can meet the requirements. In this way, acquire and release can implement a “fence” to prevent internal read and write operations from running outside, but external read and write operations can still run inside the “fence”.

On x86, acquire and release do not involve StoreLoad, so they are supported by default, and no operation is required when the function is implemented. Specifically, during implementation, the acquire() function reads a C++ volatile variable, and the release() function writes a C++ volatile variable. This may be in support of Microsoft’s addition of synchronization semantics to the C++ volatile keyword starting from Visual Studio 2005, that is, read operations on volatile variables have acquire semantics, and write operations on volatile variables have release semantics.

In addition, by the way, mutex can be realized with the help of acquire and release semantics. In fact, mutex is the origin of the two primitives acquire and release. The original meaning of acquire is to acquire a lock, and the original meaning of release is release a lock, therefore, the mutex lock can ensure that the data obtained in the locked area will not be expired data, and all write operations will be written to the memory before release. Therefore, when we implement the lock later, the following code will appear:

pthread_mutex_lock( & amp;mutex);
// operate
pthread_mutex_unlock( & amp;mutex);
The implementation of fence() called by the OrderAccess::storeload() function is as follows:

inline void OrderAccess::fence() {<!-- -->
    __asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");
}

You can see that the lock prefix is used to solve the memory barrier problem.

Let’s take a look at the implementation of Java’s volatile variables.

At the bytecode level, a certain attribute will be marked as volatile in access_flags. After arriving at HotSpot VM, barriers will be added when reading and writing to the volatile memory area. For example, when reading volatile variables, add the following barriers:

volatile variable read operation
LoadLoad
LoadStore

Add the following barriers when writing volatile variables:

LoadStore
StoreStore
Volatile variable write operation
StoreLoad

As mentioned above, the operations after reading volatile variables are not allowed to be reordered to the front, and the operations before writing are not allowed to be reordered to the end of writing, so volatile has the semantics of acquire and release.

For x86-64 bits, only StoreLoad needs to be processed, so judging from the putfield or putstatic instructions that are interpreted and executed (please refer to the article: Chapter 26 – virtual machine object operation instructions putstatic), volatile will be written at the end Add the following command after the variable:

lock addl $0x0,(%rsp)

In multi-thread programming, due to the use of mutexes, semaphores and events, they are designed to prevent memory reordering in their call points (already implicitly including various memory barriers), and the problem of memory reordering is also not a problem. Need to consider. Only when lock-free technology is used – memory is shared between threads without any mutexes, the effect of memory disorder will be obvious, so we need to consider adding appropriate memory barriers in appropriate places. .

Summary

The main functions of compiler barriers and CPU barriers are to ensure the order of instruction execution and prevent instruction rearrangement, thereby ensuring the correctness and reliability of the program.
Compiler barriers mainly play a role during compilation to prevent the compiler from generating out-of-order memory access instructions. By inserting a compiler barrier, the compiler will not rearrange or optimize the code before and after the barrier, thus ensuring the correctness and reliability of the program. Specifically, compiler barriers can be divided into read barriers and write barriers, which are used to ensure visibility when reading and writing shared variables respectively.
CPU barriers prevent memory access instructions from being executed out of order during runtime. This is achieved by inserting or modifying specific CPU instructions.