Actual combat! Use Arthas to locate the timeout problem of the Spring Boot interface and let the application take off ~

ClickFollowOfficial account:Internet Architect, backend reply2TGet2TB< strong>Learning resources!

Previous article: Sharing of useful learning materials for 2T architects

Background

The company has a channel system that is specifically used to interface with third-party channels. It has no business logic and mainly does work such as message conversion and parameter verification, which serves as a link between the previous and the following.

Recently, I have been optimizing the response time of the interface and optimizing the code, but the time still does not meet the requirements; there is a strange time-consuming problem of about 100ms. After the request processing time is printed in the interface, it is still different from the caller’s response time. The difference is about 100ms. For example, the program records 150ms, but the caller’s waiting time is about 250ms.

The detailed positioning & solution process at that time is recorded below (in fact, the solution is very simple, the key lies in how to position and find a solution to the problem)

Location process

Analyzing code

The channel system is a common spring-boot web project, using integrated tomcat. After analyzing the code, I found that there was nothing special, no special filters or interceptors, so I initially ruled out that it was a business code problem.

Analyze the calling process

After this problem occurred, we first confirmed the calling process of the interface. Since it is an internal test, there are fewer call processes.

Nginx-reverse proxy->channel system

The company is a cloud server, and the network is also the cloud’s intranet. Since the cause of the problem is not clear, we use the troubleshooting method to first confirm whether there is a problem with the server network.

First confirm whether there is a problem with the sending end to Nginx Host:

[jboss@VM_0_139_centos ~]$ ping 10.0.0.139
PING 10.0.0.139 (10.0.0.139) 56(84) bytes of data.
64 bytes from 10.0.0.139: icmp_seq=1 ttl=64 time=0.029 ms
64 bytes from 10.0.0.139: icmp_seq=2 ttl=64 time=0.041 ms
64 bytes from 10.0.0.139: icmp_seq=3 ttl=64 time=0.040 ms
64 bytes from 10.0.0.139: icmp_seq=4 ttl=64 time=0.040 ms

Judging from the ping results, there is no problem with the delay from the sending end to the Nginx host. Next, check the network from Nginx to the channel system.

# Since the log is ok, I will copy the above log directly here.
[jboss@VM_0_139_centos ~]$ ping 10.0.0.139
PING 10.0.0.139 (10.0.0.139) 56(84) bytes of data.
64 bytes from 10.0.0.139: icmp_seq=1 ttl=64 time=0.029 ms
64 bytes from 10.0.0.139: icmp_seq=2 ttl=64 time=0.041 ms
64 bytes from 10.0.0.139: icmp_seq=3 ttl=64 time=0.040 ms
64 bytes from 10.0.0.139: icmp_seq=4 ttl=64 time=0.040 ms

Judging from the ping results, there is no problem with the network delay from Nginx to the channel system server.

Since the network seems to be fine, you can continue the troubleshooting method, cut off Nginx, and connect the client directly to the server of the channel system through the loopback address (localhost) to avoid going through the network card/dns, narrow the scope of the problem and see if it can be repaired. Now (this application and address were simulated by me later, and the test was an empty interface):

[jboss@VM_10_91_centos tmp]$ curl -w "@curl-time.txt" http://127.0.0.1:7744/send
success
              http: 200
               dns: 0.001s
          redirect: 0.000s
      time_connect: 0.001s
   time_appconnect: 0.000s
  time_pretransfer: 0.001s
time_starttransfer: 0.073s
     size_download: 7bytes
    speed_download: 95.000B/s
                  ----------
        time_total: 0.073s total request time

Judging from the curl log, calling an empty interface through the loopback address also takes 73ms. This is strange. It skips all the intermediate calling nodes (including filters & interceptors, etc.) and directly requests the application of an empty interface, which takes 73ms. Please request again and see:

[jboss@VM_10_91_centos tmp]$ curl -w "@curl-time.txt" http://127.0.0.1:7744/send
success
              http: 200
               dns: 0.001s
          redirect: 0.000s
      time_connect: 0.001s
   time_appconnect: 0.000s
  time_pretransfer: 0.001s
time_starttransfer: 0.003s
     size_download: 7bytes
    speed_download: 2611.000B/s
                  ----------
        time_total: 0.003s

What’s even weirder is that the second request time is normal and becomes 3ms. After consulting the information, Linux curl turns on http keep-alive by default. Even if keep-alive is not turned on, it will not take 70ms each time to rehandshake.

After continuous analysis and testing, we found that the time for continuous requests will be very short, each request only takes a few milliseconds, but if you request again after a period of time, it will take more than 70ms.

It is speculated from this phenomenon that it may be caused by some caching mechanism. Continuous requests are fast because of the cache. However, it takes a long time after the cache fails.

So where does this problem lie? Tomcat layer or spring-webmvc?

You can’t locate the problem by just guessing, you still have to actually test it. Put the code of the channel system into the local IDE and start the test to see if it can be reproduced.

However, after importing into the local Ide, the problem cannot be reproduced after starting in the Ide, and there is no 70+ms delay problem. Now I have a headache. It cannot be reproduced locally and cannot be debugged. Since the problem is not in the business code, it cannot be debugged by adding logs.

At this time, you can sacrifice the artifact Arthas.

Arthas analysis problem

Arthas is Alibaba’s open source Java diagnostic tool, which is very popular among developers. When you encounter the following similar problems and are at a loss, Arthas can help you solve them:

Which jar package is this class loaded from? Why are various class-related Exceptions reported?
Why is the code I changed not executed? Could it be that I didn’t commit? Wrong branch?
If you encounter a problem and cannot debug it online, can you only add logs and re-release it?
There is a problem with a user’s data processing online, but it cannot be debugged online and cannot be reproduced offline!
Is there a global view of how the system is performing?
Is there any way to monitor the real-time running status of the JVM?

The above is the official introduction of Arthas. This time I only need to use one of his small functions, trace. Dynamically calculate method call paths and times so I can locate where time is being spent.

The trace method calls the path internally and outputs the time spent on each node on the method path.
The trace command can actively search for class-pattern/method-pattern
Corresponding method call path, render and count all performance overhead on the entire call link and trace the call link.

Now that we have the artifact, what method should we use to track it? Since I am not very familiar with the Tomcat source code, I can only start with spring mvc. First, let’s trace the entrance of spring mvc:

[jboss@VM_10_91_centos tmp]$ curl -w "@curl-time.txt" http://127.0.0.1:7744/send
success
              http: 200
               dns: 0.001s
          redirect: 0.000s
      time_connect: 0.001s
   time_appconnect: 0.000s
  time_pretransfer: 0.001s
time_starttransfer: 0.115s
     size_download: 7bytes
    speed_download: 60.000B/s
                  ----------
        time_total: 0.115s

This call took 115ms on the calling side, but from the arthas trace, spring mvc only consumed 18ms, so where did the remaining 97ms go?

After local testing, the problem of spring mvc can be eliminated. The last and only possible problem is tomcat.

However, I am not familiar with the source code in tomcat, and I don’t even know the request entry. It is difficult to find the classes that require trace in tomcat. . .

But it doesn’t matter. With the artifact Arthas, you can use the stack command to reversely search the calling path, with org.springframework.web.servlet.DispatcherServlet as the parameter:

”

stack outputs the calling path of the current method being called

Many times we know that a method is executed, but there are many paths for this method to be executed, or you simply don’t know where this method is executed. At this time, what you need is the stack command.

”

From the stack log, we can intuitively see the call stack of DispatchServlet. So with such a long path, which class should be traced (the trace process of the filter in spring mvc is skipped here, and the trace process is also traced during the actual troubleshooting, but This weird time consumption is not caused by this filter)?

Veteran drivers with certain experience can probably guess where to start from the name, that is org.apache.coyote.http11.Http11Processor.service. Judging from the name, http1.1 processing device, this may be a better entry point. Let’s trace it:

There is a time-consuming point of 129ms in the log (the time is longer than when arthas is not turned on because of the performance consumption caused by arthas itself, so use it with caution in the production environment). This is the problem point to be found.

Now that the problem point has been found, how do you locate what caused the problem and how do you solve it?

Let’s continue tracing and drill down to specific code blocks or content. Due to performance considerations, trace will not display all call paths. If the call path is too deep, you can only manually go into the trace. The principle is to trace the method that takes the longest time:

After a boring manual in-depth trace…………

Found a point worth pausing to think about:

 + ---[min=0.004452ms,max=34.479307ms,total=74.206249ms,count=31] org.apache.catalina.webresources.TomcatJarInputStream:getNextJarEntry() #117

This line of code is loaded 31 times, taking a total of 74ms. Judging from the name, it should be the time it takes when tomcat loads the jar package, so is it the time it takes to load 31 jar packages, or is it the time it takes to load some of the jar packages? How long does it take for resources to be used 31 times?

The comment in the source code of the TomcatJarInputStream class reads:

The purpose of this sub-class is to obtain references to the JarEntry objects for META-INF/ and META-INF/MANIFEST.MF that are otherwise swallowed by the JarInputStream implementation.

The general meaning is to obtain the resources of META-INF/ and META-INF/MANIFEST in the jar package. This is a subclass, and more functions are in the parent class JarInputStream.

In fact, you can probably guess the problem after seeing this. Tomcat loads the META-INF/, META-INF/MANIFEST resources in the jar package, which causes time-consuming. As for why continuous requests do not take time, it should be tomcat’s caching mechanism ( Source code analysis is introduced below)

Don’t rush to locate the problem, try to finally locate the details of the problem through Arthas, and continue to manually in-depth trace

[arthas@24851]$ trace org.apache.catalina.webresources.TomcatJarInputStream *
Press Q or Ctrl + C to abort.
Affect(class-cnt:1, method-cnt:4) cost in 44 ms.
`---ts=2019-09-14 21:37:47;thread_name=http-nio-7744-exec-5;id=14;is_daemon=true;priority=5;TCCL=org.springframework.boot.loader .LaunchedURLClassLoader@20ad9418
    `---[0.234952ms] org.apache.catalina.webresources.TomcatJarInputStream:createZipEntry()
         + ---[0.039455ms] java.util.jar.JarInputStream:createZipEntry() #43
        `---[0.007827ms] java.lang.String:equals() #44

`---ts=2019-09-14 21:37:47;thread_name=http-nio-7744-exec-5;id=14;is_daemon=true;priority=5;TCCL=org.springframework.boot.loader .LaunchedURLClassLoader@20ad9418
    `---[0.050222ms] org.apache.catalina.webresources.TomcatJarInputStream:createZipEntry()
         + ---[0.001889ms] java.util.jar.JarInputStream:createZipEntry() #43
        `---[0.001643ms] java.lang.String:equals() #46
#There are a total of 31 trace logs here, and the rest have been deleted

Judging from the method name, it still means loading resources. We have already reached the jdk source code. Now let’s take a look at the source code of the TomcatJarInputStream class:

/**
 * Creates a new <code>JarEntry</code> (<code>ZipEntry</code>) for the
 * specified JAR file entry name. The manifest attributes of
 * the specified JAR file entry name will be copied to the new
 * <CODE>JarEntry</CODE>.
 *
 * @param name the name of the JAR/ZIP file entry
 * @return the <code>JarEntry</code> object just created
 */
protected ZipEntry createZipEntry(String name) {
    JarEntry e = new JarEntry(name);
    if (man != null) {
        e.attr = man.getAttributes(name);
    }
    return e;
}

This createZipEntry has a name parameter. Judging from the comments, it is the jar/zip file name. If you can get key information such as the file name, you can directly locate the problem; you can still use Arthas and use the watch command. , dynamic monitoring method call data

The watch method performs data observation

”

Allows you to easily observe the calling status of the specified method. The scope that can be observed is: return value, thrown exception, input parameters, and the corresponding variables can be viewed by writing OGNL expressions.

”

watch the input parameters of this method

Now I directly see the specific loaded resource name, such a familiar name: swagger-ui, a foreign REST interface documentation tool, and a domestic developer has made a set of spring mvc integration tools based on swagger-ui, which can be used through annotations. It can automatically generate the interface definition json file required by swagger-ui. It is relatively convenient to use, but it is highly intrusive.

After deleting the jar package of swagger, the problem disappeared after the weird 70 + ms.

<!--Delete these two references in the pom. These two packages are packaged by domestic developers. swagger-ui does not provide a support package for java spring-mvc. swagger is just a browser-side ui + editor - ->
<dependency>
    <groupId>io.springfox</groupId>
    <artifactId>springfox-swagger2</artifactId>
    <version>2.9.2</version>
</dependency>
<dependency>
    <groupId>io.springfox</groupId>
    <artifactId>springfox-swagger-ui</artifactId>
    <version>2.9.2</version>
</dependency>

So why does swagger cause the request to be time-consuming? Why does the static resource inside swagger load every time it is requested?

In fact, this is a bug of tomcat-embed. Let’s introduce the bug in detail below.

Tomcat embed Bug Analysis & Solution

The source code analysis process is too long and is not the focus of this article, so I will not introduce it. Let’s directly introduce the analysis results.

By the way, post a core class diagram of tomcat processing requests.

Why does each request load the static resources in the Jar package

The key lies in the org.apache.catalina.mapper.Mapper#internalMapWrapper method. There is a problem with the way requests are processed in this version, causing static resources to be verified every time.

Why there is no problem with continuous requests

Because Tomcat has a cache for parsing such static resources, it will first search from the cache, and then parse again after the cache expires. Please refer to org.apache.catalina.webresources.Cache for details. The default expiration time ttl is 5000ms.

Why does it not reappear locally

In fact, to be precise, it cannot be reproduced after packaging the plug-in through spring-boot. Due to different startup methods, tomcat uses different classes to handle static resources, so there is no problem

How to solve

Just upgrade the tomcat-embed version

The version where the bug currently appears is:

spring-boot:2.0.2.RELEASE, the built-in tomcat embed version is 8.5.31

This problem can be solved by upgrading tomcat embed version to 8.5.40+. The new version has been fixed.

By replacing springboot pom properties

If the project is maven, it inherits springboot, that is, the parent is configured as springboot, or the spring boot package is imported in dependencyManagement.

<parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.0.2.RELEASE</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>

Just overwrite the properties directly in the pom:

<properties>
    <tomcat.version>8.5.40</tomcat.version>
</properties>

Upgrade spring boot version

The tomcat embed version in springboot 2.1.0.RELEASE is already greater than 8.5.31, so directly upgrading springboot to this version and above can solve this problem.

Finally, follow the public account Internet Architect and reply in the background: 2T. You can get the Java series interview questions and answers I compiled, which is very complete.

End of text