Improper new ArrayList causes CPU spike. .





Yesterday, the CPU of the online container suddenly surged. This is the first time to troubleshoot this problem, so I will record it~

Foreword

First of all, the problem is this. I was writing a document on Thursday and suddenly received an online alarm. I found that the CPU usage reached more than 90%. I went to the platform monitoring system to check the container. In the jvm monitoring, I found that a pod generated 61% of the data in two hours. Once youngGc and once fullGc, this problem is very serious and rare. Since I have never checked this kind of problem before, it is also Baidu. However, I also have some thoughts on the whole process, so I will share it with you~

The scene

Let me first show you a normal gc curve monitoring (for confidentiality, I drew it myself according to the platform monitoring):

Normal jvm monitoring curve graph

36ff3bda7ee9eaeac097b8ee0ebc7538.jpeg

Normal jvm monitoring curve chart

JVM monitoring curve chart that caused the problem

f2c0c224b37e6c20d29e65f80793e052.jpeg

The jvm monitoring curve that caused the problem

It can be seen that under normal circumstances, the system has very few gc (depending on the business system usage and jvm memory allocation), but in Figure 2, a large number of abnormal gc situations even triggered fullGc, so I immediately conducted an analysis .

Detailed analysis

First of all, abnormal GC only occurs on one pod (the system has multiple pods). Find the corresponding pod in the monitoring system, enter the pod to check the cause of the problem, and be calm when troubleshooting.

  1. After entering the pod, enter top to check the usage of system resources by each Linux process (because I am making an afterthought, the resource usage is not high, you can just follow the steps)

490cd9721522e43885dcf120bc3b5069.jpeg

  1. Analyze resource usage in the context of the situation

df7b82768c905b4a702ac3a5127f8d9a.jpeg

top

At that time, the CPU of my process with pid 1 reached 130 (multi-core). Then I decided that there was a problem with the Java application. Control + c to exit and continue.

  1. Enter top -H -p pid. You can use this command to view the id of the thread that actually occupies the highest CPU. The pid is the pid number with the highest resource usage just now.

dbce70f55289afb87bb9487a4076c344.jpeg

top -H -p pid
  1. The resource usage of a specific thread appears. The pid in the table represents the id of the thread. We call it tid.

3053f80ae0ee864180398d3f6d4787c4.jpeg

tid
  1. I remember that the tip at that time was 746 (the above picture is just me repeating the steps for everyone). Use the command printf “%x\\
    ” 746 to convert the thread tid to hexadecimal.

bb8ebd15f9ebea98384dd1256b451ee2.jpeg

Convert tid to hexadecimal

Because our thread ID number is in hexadecimal in the stack, we need to do a hexadecimal conversion.

  1. Enter jstack pid | grep 2ea >gc.stack

jstack pid | grep 2ea >gc.stack

5e62595d823fa6804ecd39a743eeb6ef.jpeg

jstack

To explain, jstack is one of the monitoring and tuning gadgets provided by jdk. jstack will generate a thread snapshot of the JVM at the current moment. Then we can use it to view the thread stack information in a certain Java process. After that, we pass the stack information through the pipeline Collect the information of the 2ea thread, and then generate the information into a gc.stack file. I created it casually.

  1. At that time, I first cat gc.stack and found that there was a lot of data and it was difficult to see in the container, so I downloaded it and browsed it locally. Because the company restricted access to each machine, I could only use the springboard machine to find a useless machine first. a, download the file to a and then download the file in a to the local (local access to the springboard machine is OK), first enter python -m SimpleHTTPServer 8080, Linux comes with python, this is to open a simple http service for external access

7b553259a7b51a7a1b94250242164a77.jpeg

Enable http service

Then log in to the spring machine and use curl to download curl -o http://ip address/gcInfo.stack

For the convenience of demonstration, I replaced the IP address with a fake one in the picture.

8dc969485d6babdf555a3ec8ae97019b.jpeg

curl

Then use the same method to download the springboard machine locally. Remember to turn off the suggestion service enabled by python.

  1. Download the file locally, open the view editor and search for 2ea, and find the stack information with nid 2ea.

27df13c86a4057e81e07ac92a4b2a49d.jpeg

Find the stack information with nid 2ea

Then find the corresponding impl analysis program based on the number of lines

  1. It was found that when the file was asynchronously exported to Excel, the export interface used the public list query interface. The list interface query data can be paged in batches of 200 at most, and the amount of exported data allows each person to have permissions ranging from tens of thousands to hundreds of thousands.

b4206af405be92dba05169969fc50beb.jpeg

export excel

And this judgment method uses nested loop judgment, and it is easy to get the value when combined with the business. The new ArrayList under Java returns a List collection (it seems needless to say so detailed (; 1_1)), before the entire method ends , the life cycle of the generated lists is still there, so after multiple GC triggers and restarts, it also affects other pods. Then the code was fixed and went online urgently, and the problem was solved~

Conclusion

Don’t be afraid when you encounter production problems. When encountering a problem, first ensure that the service is available, and then analyze the limited information layer by layer to find out the final problem. If you know arthas, troubleshooting will be easier!

Source: https://juejin.cn/post/7139202066362138654

END

a7c6bbcf8e4429fb4271566c9b68da83.png