Service log performance optimization, a series of accidents caused by log

Only those who have been severely beaten by online service problems understand how important logs are!

Who is in favor and who is against? If you feel the same way, congratulations on becoming a social person 🙂

The importance of logs to the program is self-evident. It is lightweight, simple, and requires no brain effort. It can be found everywhere in the program code and helps us troubleshoot and locate problems. However, the seemingly inconspicuous logs hide various “pits”. If used improperly, not only will they not help us, but they will become service “killers”.

This article mainly introduces the “pits” caused by improper use of production environment logs and how to avoid them, which is especially obvious in high-concurrency systems. At the same time, a set of implementation solutions are provided to allow programs and logs to “coexist harmoniously”.

Avoid pitfalls and point north

In this chapter, I will introduce the log problems encountered online in the past, and analyze the root causes of the problems one by one.

Irregular log writing format

//Format 1
log.debug("get user" + uid + " from DB is Empty!");

//Format 2
if (log.isdebugEnable()) {
    log.debug("get user" + uid + " from DB is Empty!");
}

//Format 3
log.debug("get user {} from DB is Empty!", uid);
----------------

I believe that everyone has seen the above three writing methods more or less in the project code, so what are the differences between them before, and what impact will they have on performance?

If you turn off the DEBUG log level at this time, the difference appears:

Format 1 still needs to perform string concatenation, even if it does not output logs, it is a waste.

The disadvantage of format 2 is that additional judgment logic needs to be added, which adds waste code and is not elegant at all.

Therefore, format 3 is recommended. It will be dynamically spliced only during execution. After turning off the corresponding log level, there will be no performance loss.

Production printing a large number of logs consumes performance

Having as many logs as possible can string together user requests, making it easier to determine the location of the problematic code. Due to the current distributed system and complex business, the lack of any log is a huge obstacle for programmers to locate problems. Therefore, programmers who have suffered from production problems must log as much as possible during the code development process.

In order to locate and fix problems as soon as possible if they occur online in the future, programmers will log as many key logs as possible during the programming implementation phase. After going online, the problem can be quickly located, but then there will be new challenges: with the rapid development of the business, user visits continue to increase, and the system pressure is increasing. At this time, there are a large number of INFO logs online, especially in During peak periods, a large number of log disk writes consume service performance.

Then this becomes a game theory. If there are more logs, it will be easier to troubleshoot problems, but service performance will be “eaten”. If there are fewer logs, service stability will have no impact, but troubleshooting will be difficult, and programmers will suffer.

Question: Why does the performance suffer if there are too many INFO logs (CPU usage is very high at this time)?

Root cause 1: Synchronous printing log disk I/O has become a bottleneck, resulting in a large number of thread blocks

It is conceivable that if the logs are all output to the same log file and multiple threads are writing to the file, it will be chaotic. The solution is to add locks to ensure that the log file output will not be confused. If it is during peak periods, lock contention will undoubtedly consume the most performance. When one thread grabs the lock, other threads can only block and wait, which seriously drags down the user thread. The performance is that the upstream call times out and the user feels stuck.

The following is the stack when the thread is stuck writing a file:

Stack Trace is:
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.logging.log4j.core.appender.OutputStreamManager.writeBytes(OutputStreamManager.java:352)
– waiting to lock <0x000000063d668298> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
at org.apache.logging.log4j.core.layout.TextEncoderHelper.writeEncodedText(TextEncoderHelper.java:96)
at org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeText(TextEncoderHelper.java:65)
at org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:68)
at org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:32)
at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:228)
…..
So is it okay to reduce the INFO log online? Similarly, the amount of ERROR logs cannot be underestimated. If a large amount of abnormal data appears online or a large number of downstream timeouts occur, a large number of ERROR logs will be generated instantly. At this time, the disk I/O will still be full, causing user threads to block.

Question: Assuming you don’t care about INFO troubleshooting, is there no performance problem if you only print ERROR logs in production?

Root cause 2: Thread Block caused by log printing exception stack under high concurrency

Once upon a time, a large number of timeouts occurred online and downstream, and the exceptions were all caught by our service. Fortunately, the disaster recovery design anticipated that this problem would occur, and implemented a bottom-line logic. Fortunately, there was no impact, but the server started ” Teach people how to be a good person.” The online monitoring started to alarm. The CPU usage increased too fast, and the CPU increased directly to 90% +. At this time, we urgently expanded the capacity to stop the loss, and found a machine to pull down the traffic and pull the stack.

After checking the dumped thread stack, combined with the flame regression analysis, most of the ready-made threads are stuck at the following stack locations:

Stack Trace is:
java.lang.Thread.State: BLOCKED (on object monitor)
at java.lang.ClassLoader.loadClass(ClassLoader.java:404)
– waiting to lock <0x000000064c514c88> (a java.lang.Object)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.logging.log4j.core.impl.ThrowableProxyHelper.loadClass(ThrowableProxyHelper.java:205)
at org.apache.logging.log4j.core.impl.ThrowableProxyHelper.toExtendedStackTrace(ThrowableProxyHelper.java:112)
at org.apache.logging.log4j.core.impl.ThrowableProxy.(ThrowableProxy.java:112)
at org.apache.logging.log4j.core.impl.ThrowableProxy.(ThrowableProxy.java:96)
at org.apache.logging.log4j.core.impl.Log4jLogEvent.getThrownProxy(Log4jLogEvent.java:629)
…
The stack here is long, and most of the blocks in the scene are in java.lang.ClassLoader.loadClass, and when you look down the stack, you find that they are all triggered by this line of code.

at org.apache.logging.slf4j.Log4jLogger.error(Log4jLogger.java:319)

//The corresponding business code is
log.error(“ds fetcher get error”, e);
Ah this. . . That’s outrageous. Why would you load a class when you log it? Why does loading a class block so many threads?

After some review and analysis, we came to the following conclusions:

When using Log4j’s Logger.error to print the exception stack, in order to print out the location information of the classes in the stack, you need to use Classloader for class loading;
Classloader loading is thread-safe. Although parallel loading can improve the efficiency of loading different classes, when multiple threads load the same class, they still need to wait for each other synchronously. Especially when the exception stacks printed by different threads are exactly the same, additional threads will be added. Block risk, and when the Classloader loads a class that cannot be loaded, the efficiency will drop sharply, further worsening the thread Block situation;
Because of the efficiency problem of reflection calls, JDK optimizes reflection calls and dynamically generates Java classes for method calls, replacing the original native calls. The generated dynamic classes are loaded by DelegatingClassLoader and cannot be loaded by other Classloaders. In the exception stack Dynamic classes with reflection optimization are very prone to thread block situations under high concurrency conditions.
Combined with the above stack, it is very clear to get stuck here:

The influx of a large number of threads causes downstream services to time out, causing the timeout exception stack to be printed frequently. Each layer of the stack needs to obtain the corresponding class, version, line number and other information through reflection. LoadClass needs to wait synchronously. A thread Locking causes most threads to block and wait for the class to be loaded successfully, affecting performance.
To be fair, even if most threads are waiting for a thread to loadClass, it will only be stuck for a moment. Why does this error keep loadingClass? Combining the above conclusions and analyzing the program code, it is concluded that the request downstream service logic in the thread here contains Groovy script execution logic, which belongs to dynamic class generation. The third conclusion above shows that dynamic classes cannot be correctly reflected and loaded by log4j under high concurrency conditions. , then stack reflection needs to be used again, and it enters an infinite loop. More and more threads can only join and wait, blocking.

1. Remove unnecessary exception stack printing

For obvious exceptions, don’t print the stack and save some performance. Anything + high concurrency means a different meaning 🙂

try {
    System.out.println(Integer.parseInt(number) + 100);
} catch (Exception e) {
    // Before improvement
    log.error("parse int error : " + number, e);
    // After improvement
    log.error("parse int error : " + number);
}

If an exception occurs in Integer.parseInt, the cause of the exception must be that the incoming and outgoing number is illegal. In this case, printing the exception stack is completely unnecessary, and the printing of the stack can be removed.

2. Convert the stack information into a string and print it

public static String stacktraceToString(Throwable throwable) {
StringWriter stringWriter = new StringWriter();
throwable.printStackTrace(new PrintWriter(stringWriter));
return stringWriter.toString();
}
The stack information obtained by log.error will be more complete. The JDK version, Class path information, and the classes in the jar package will also print the name and version information of the jar. These are all information obtained by loading the class reflection, which is greatly loss performance.

Call stacktraceToString to convert the exception stack into a string. Relatively speaking, it confirms some version and jar metadata information. At this time, you need to make your own decision whether it is necessary to print out this information (for example, class conflict troubleshooting is still based on version). useful).

3. Disable reflection optimization

Use Log4j to print stack information. If there is a dynamic proxy class generated by reflection optimization in the stack, this proxy class cannot be loaded by other Classloaders. Printing the stack at this time will seriously affect the execution efficiency. However, disabling reflection optimization will also have side effects, resulting in reduced efficiency of reflection execution.

4. Asynchronous printing log

In production environments, especially services with high QPS, asynchronous printing must be enabled. Of course, if asynchronous printing is enabled, there is a possibility of losing logs, such as the server being forcibly “killed”, which is also a process of trade-offs.

5. Log output format

The difference between our theater log output formats

//Format 1
[%d{yyyy/MM/dd HH:mm:ss.SSS}[%X{traceId}] %t [%p] %C{1} (%F:%M:%L) %msg%n

//Format 2
[%d{yy-MM-dd.HH:mm:ss.SSS}] [%thread] [%-5p %-22c{0} -] %m%n

The official website also has clear performance comparison tips. If the following fields are used for output, it will greatly reduce performance.

 %C or $class, %F or %file, %l or %location, %L or %line, %M or %method

In order to get the function name and line number information, log4j uses the exception mechanism. It first throws an exception, then captures the exception and prints out the stack content of the exception information, and then parses the line number from the stack content. The lock acquisition and parsing process is added to the implementation source code. Under high concurrency, the performance loss can be imagined.

The following are parameter configurations that affect performance. Please configure them as appropriate:

%C – the caller’s class name (slow, not recommended)
%F – the file name of the caller (extremely slow, not recommended)
%l – the caller’s function name, file name, and line number (extremely not recommended, very performance consuming)
%L – the line number of the caller (extremely slow, not recommended)
%M – the caller’s function name (extremely slow, not recommended)

Solution-Dynamic adjustment of log level

The project code needs to print a large number of INFO level logs to support problem location and test troubleshooting. However, these large amounts of INFO logs are ineffective in the production environment. A large number of logs will eat up CPU performance. At this time, the log level needs to be dynamically adjusted so that the INFO logs can be viewed at any time and closed dynamically when not needed. Impact service performance needs.

Solution: Combine Apollo and log4j2 features to dynamically and fine-grained control the log level globally or within a single Class file from the API level. The advantage is that it takes effect at any time. When troubleshooting production problems, you can specify the log level to open a single class file, and it can be closed at any time after troubleshooting.

Due to the length of this article, I will not post the specific implementation code. In fact, the implementation is very simple, which is to cleverly use Apollo’s dynamic notification mechanism to reset the log level. If you are interested, you can send me a private message or leave a message, and I will write an article. This article is dedicated to explaining in detail how to achieve this.

Summary and Outlook

This article takes you through the common problems of logs in daily software services and the corresponding solutions. Remember, simple things + high concurrency = not simple! Be respectful of production!

Finally, I would like to thank everyone who read my article carefully. Reciprocity is always necessary. Although it is not a very valuable thing, if you can use it, you can take it directly:

This information should be the most comprehensive and complete preparation warehouse for [software testing] friends. This warehouse has also accompanied tens of thousands of test engineers through the most difficult journey. I hope it can also help you!

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Cloud native entry-level skills treeHomepageOverview 17036 people are learning the system