Done! I’m surrounded by Out of Memory!

Is it the ultimate charming, free and free Java heap space?
Is it intellectual, tender, and gentle? GC overhead limit exceeded?
Is it the innocent, lively and cute Metaspace?
If the above isn’t your cup of tea, there’s…
The unruly and willful CodeCache has no trace!
Sexy, hot and thoughtful Direct Memory
Noble and cool, I love you alone OOM Killer!
There is always one that you will fall in love with! The choice of BUG is now in your hands!

Java heap space

This is the most common OOM problem. Who hasn’t experienced a Heap OOM?

When the heap memory is filled up, while the GC cannot reclaim it in time, it continues to create new objects. After the Allocator cannot allocate new memory, it will send an OOM error:

java.lang.OutOfMemoryError: Java heap space

The analysis and solution are nothing more than these few steps:

dump heap memory
Analyze dump files through a series of tools such as MAT, YourKit, JProfiler, IDEA Profiler, etc.
Find the object that takes up the most memory and see which cutie did it.
Analyze the code and try to optimize the code and reduce object creation
Increase JVM heap memory, limit the number of requests, number of threads, increase the number of nodes, etc.

Common misunderstandings in using class libraries

In particular, some tool libraries try to avoid creating new objects every time, thereby saving memory and improving performance.

Most mainstream class libraries and entry classes ensure singleton thread safety, and only one copy can be maintained globally.

Here are some examples of common incorrect usage:

Apache HttpClient

CloseableHttpClient, this thing is equivalent to a “browser process”, with a connection pool and connection reuse behind it, and a bunch of auxiliary classes for the mechanism. If you create a new one every time, it will not only be slow, but also waste a lot of resources.

A more normal approach is to maintain one instance globally (or group it according to the business scenario, one for each group), create it when the service starts, and destroy it when the service shuts down:

CloseableHttpClient httpClient = HttpClients.custom()
                .setMaxConnPerRoute(maxConnPerRoute)
                .setMaxConnTotal(maxConnTotal)
                /// ...
                                 .build();

Gson

After all, it is a Google project, and the entry class is naturally thread-safe. Just maintain a Gson instance globally.

Jackson

As the default JSON processing library of Spring MVC, Jackson has powerful functions and many users. It supports all mainstream formats of xml/json/yaml/properties/csv. Singleton thread safety is naturally ok. Just maintain an ObjectMapper globally.

GC overhead limit exceeded

This error is quite interesting. The above Java heap space is still creating new objects after the memory is completely full. At this time, the service will be completely suspended and unable to handle new requests.

This error just means that the GC overhead is too high. The Collector spends a lot of time recycling memory, but the heap memory released is very small. It does not mean that the service is dead.

At this time, the program is in a very delicate state: the heap memory is full (or the recycling threshold is reached), GC recycling is constantly triggered, but most objects are reachable and cannot be recycled, and the Mutator is still creating new objects at a low frequency. object.

This error usually occurs in scenarios with low traffic and there are too many resident reachable objects that cannot be recycled. However, the free memory after GC can still meet the basic use of the service.

However, at this time, GC is already occurring frequently in the old generation. There are many large objects in the old generation. Under the existing recycling algorithm, the GC efficiency is very low and consumes a huge amount of resources. It may even fill up the CPU.

When this error occurs, it may look like this from a monitoring perspective:

The volume of requests may not be large
Non-stop GC and long pause time
There are new requests from time to time, but the response time is very high
High CPU utilization

After all, it is still a heap memory problem, and the troubleshooting ideas are no different from the Java heap space above.

Metaspace/PermGen

In the Metaspace area, the most important thing is the metadata of Class. The data added by ClassLoader will be stored here.

The initial value of MetaSpace is very small, and there is no upper limit by default. When the utilization rate exceeds 40% (default value MinMetaspaceFreeRatio), the capacity will be expanded. Each time the capacity is expanded a little, the expansion will not be directly FullGC.

The recommended approach is not to give an initial value, but to limit the maximum value:

-XX:MaxMetaspaceSize=

But you still have to be careful. If this thing is full, the consequences will be serious, ranging from Full GC to OOM:

java.lang.OutOfMemoryError: Metaspace

When troubleshooting MetaSpace problems, the main idea is to track Class Load data. The more mainstream approach is:

Use tools such as Arthas to view the data of ClassLoader and loadClassess and analyze the large number of ClassLoader or Class
Print the loading log of each class: -XX: + TraceClassLoading -XX: + TraceClassUnloading

Here are some common scenarios that may lead to MetaSpace growth:

Improper use of reflection

The performance of reflection in JAVA is very low, and reflected objects must be cached. Especially this Method object, if in a concurrent scenario, you get a new Method every time and then invoke it, MetaSpace will blow up for you before long!

Simply put, in concurrent scenarios, Method.invoke will repeatedly dynamically create classes, causing the MetaSpace area to grow.

When using reflection, use mature tool classes as much as possible, such as Spring and Apache. They all have built-in caches for reflection-related objects, and are full-featured and performant enough to meet daily usage needs.

Some Agent bugs

Some Java Agents, both static and runtime injected. Based on the Instrumentation API, various enhancements have been made, including load, redefine, and remove. If a bug occurs accidentally, it is easy to generate a large number of dynamic classes, causing the metaspace to be full.

Dynamic proxy problem

Spring’s AOP is also implemented based on dynamic proxy, whether it is CgLib or JDK Proxy, whether it is ASM or ByteBuddy. The final result cannot escape dynamic creation and loading of Class. With these two operations, Metaspace will definitely be affected.

Spring’s beans are singleton by default. If configured as prototype, then every time getBean will create a new proxy object, regenerate dynamic classes and redefine, MetaSpace naturally getting bigger.

Code Cache

The Code Cache area stores the hot code cache after JIT compilation (note that the memory used during the compilation process does not belong to the Code cache) and is also non-heap.

If the code cache is full, you may see a log like this:

Server VM warning: CodeCache is full. Compiler has been disabled.

At this point the JVM will disable JIT compilation and your service will start to slow down.

The upper limit of Code Cache is relatively low by default, generally 240MB/128MB, and may vary on different platforms.

The upper limit of Code Cache can be adjusted through parameters:

-XX:ReservedCodeCacheSize=

As long as you try to avoid overly large Class and Method, there will generally be less problems with this area being filled up. The default 240MB/128MB is enough.

Direct Memory

The Direct Memory area is generally called direct memory. Many scenarios involving disk I/O and Socket I/O use Direct Memory in order to improve the performance of “Zero Copy”.

Take Netty for example, it really puts Direct Memory to the test (I will write an analysis of Netty memory management when I have time)…

When using Direct Memory, it is equivalent to directly bypassing JVM memory management, calling the malloc() function, and experiencing the fun of manual memory management ~

However, this thing is more dangerous to use. It is usually operated with Unsafe. If you accidentally read or write an address with an incorrect address, you can get a surprise from the JVM:

#
# A fatal error has been detected by the Java Runtime Environment:
#
# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00007ffdbd5d19b4, pid=1208, tid=0x0000000000002ee0
#
# JRE version: Java(TM) SE Runtime Environment (8.0_301-b09) (build 1.8.0_301-b09)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.301-b09 mixed mode windows-amd64 compressed oops)
# Problematic frame:
# C [msvcr100.dll + 0x119b4]
#
# No core dump will be written. Minidumps are not enabled by default on client versions of Windows
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

This Direct Memory area has no upper limit by default, but in order to prevent being killed by the OS, it will still be limited and given a value of 256MB or less to prevent unlimited memory growth:

-XX:MaxDirectMemorySize=

If Direct Memory reaches MaxDirectMemorySize and cannot be released, you will get an OOM error:

java.lang.OutOfMemoryError: Direct buffer memory

Linux OOM Killer

After jumping out of JVM memory management, when the OS memory is exhausted, Linux will choose the process that takes up the most memory, has the lowest priority, or is the least important to kill.

Generally in a container, the main process is definitely our JVM. Once the memory is full, it will be the first to be killed, and the kill -TERM (-9) signal will catch you off guard.

If the JVM memory parameters are properly configured and are far below the container memory limit, but OOM Killer still occurs, congratulations, there is a high probability that there is a Native memory leak.

The JVM cannot manage this part of memory.

Except for the small probability event of Native leak BUG inside the JVM, there is a high probability that it is caused by the third-party library you reference.

This kind of problem is very troublesome to troubleshoot. After all, outside the JVM, it can only be analyzed with some native tools.

Moreover, this kind of tool requires root permissions at every turn, but the leadership has to approve the application for permission… The cost of troubleshooting is really high.

The basic idea for troubleshooting Native memory is:

pmap View memory address mapping, locate suspicious memory blocks, and analyze memory block data
strace manually traces process system calls and analyzes the system call links of memory allocation
Replace memory allocators such as jemalloc/tcmalloc (or async-profiler has a branch that supports native analysis) to track the malloc call link

Currently, the most common Native memory leak scenario is JDK’s Inflater/Deflater, which functions to provide GZIP compression and decompression. Under the default malloc implementation of glibc, “memory leaks” are prone to occur. If there is a Native memory leak, you can first check if there are any GZIP related operations in the application. There may be surprises.

Okay, now that you’ve experienced various styles of OOM, which one impresses you more?