Improper new ArrayList causes CPU to spike. .

Click to follow the official account, Java dry goodsDelivery in time

C8Fe01A122567393C86AB2633333333333 f9d.png Dog Xiaoha Tutorial Network : www.quanxiaoha.com

Source: juejin.cn/post/7139202066362138654

  • Preface

  • The scene at that time

  • Normal jvm monitoring curve chart

  • The jvm monitoring curve that caused the problem

  • Specific analysis

  • conclusion

Yesterday, the CPU of the online container suddenly surged. This is the first time to troubleshoot this problem, so I will record it~

Foreword

First of all, the problem is this. I was writing a document on Friday when I suddenly received an online alarm. I found that the CPU usage reached more than 90%. I went to the platform monitoring system to check the container. In the jvm monitoring, I found that a pod generated 61% in two hours. Once youngGc and once fullGc, this problem is very serious and rare. Since I have never checked this kind of problem before, it is also Baidu. However, I also have some thoughts on the whole process, so I will share it with you~

The scene

Let me first show you a normal gc curve monitoring (for confidentiality, I drew it myself according to the platform monitoring):

Normal jvm monitoring curve graph

db1224d6605c9b6e1c18b970868fa265.jpegNormal jvm monitoring curve

A front-end and back-end separated blog based on Spring Boot + MyBatis Plus + Vue 3.2 + Vite + Element Plus, including a back-end management system that supports articles, categories, tag management, dashboards and other functions.

  • GitHub address: https://github.com/weiwosuoai/WeBlog

  • Gitee address: https://gitee.com/AllenJiang/WeBlog

JVM monitoring curve chart that caused the problem

2f8e0114745d8b57d857d7566dd5c56a.jpegJVM monitoring curve that caused the problem

It can be seen that under normal circumstances, the system has very few gc (depending on the business system usage and jvm memory allocation), but in Figure 2, a large number of abnormal gc situations even triggered fullGc, so I immediately conducted an analysis .

Detailed analysis

First of all, abnormal GC only occurs on one pod (the system has multiple pods). Find the corresponding pod in the monitoring system, enter the pod to check the cause of the problem, and be calm when troubleshooting.

  1. After entering the pod, enter top to check the usage of system resources by each Linux process (because I am making an afterthought, the resource usage is not high, you can just follow the steps)

69eb90f2e24ee06722702e605137bf9c.jpeg

picture
  1. Analyze resource usage in the context of the situation

9ccd279a1d4e5f321e64a46f8ef2fb30.jpegtop

At that time, the CPU of my process with pid 1 reached 130 (multi-core). Then I decided that there was a problem with the Java application. Control + c to exit and continue.

  1. Enter top -H -p pid. You can use this command to view the id of the thread that actually occupies the highest CPU. The pid is the pid number with the highest resource usage just now.

408d75f9c44f52ecdafce519708abe33.jpegtop -H -p pid

  1. The resource usage of a specific thread appears. The pid in the table represents the id of the thread. We call it tid.

71079ff057e8436e617b0c0347eb912a.jpegtid

  1. I remember that the tip at that time was 746 (the above picture is just me repeating the steps for everyone). Use the command printf “%x\
    ” 746 to convert the thread tid to hexadecimal.

b5bf2a1de1a88614316b1a8b28672d19.jpegtid converted to hexadecimal

Because our thread ID number is in hexadecimal in the stack, we need to do a hexadecimal conversion.

  1. Enter jstack pid | grep 2ea >gc.stack

jstack pid | grep 2ea >gc.stack

671e3d9761abe3bfbbc8582d260f39b6.jpegjstack

To explain, jstack is one of the monitoring and tuning gadgets provided by jdk. jstack will generate a thread snapshot of the JVM at the current moment. Then we can use it to view the thread stack information in a certain Java process. After that, we pass the stack information through the pipeline Collect the information of 2ea threads, and then generate the information as a gc.stack file, I just started it, at will

  1. At that time, I first cat gc.stack and found that there was a lot of data and it was difficult to see in the container, so I downloaded it and browsed it locally. Because the company restricted access to each machine, I could only use the springboard machine to find a useless machine first. a, download the file to a and then download the file in a to the local (local access to the springboard machine is OK), first enter python -m SimpleHTTPServer 8080, Linux comes with python, this is to open a simple http service for external access

61a141e66a64a4cfdb924eec12d7dac3.jpegEnable http service

Then log in to the springboard machine, use curl to download curl -o http://ip address/gcInfo.stack

For the convenience of demonstration, I changed the ip in the picture to a fake one

35d77581b9d366bcd19eef8a0098415a.jpegcurl

After that, use the same method to download the springboard machine locally. Remember to turn off the suggestion service opened by python.

  1. Download the file locally, open the view editor and search for 2ea, and find the stack information with nid 2ea

6bd39accad5ab3406e69724f7d991258.jpegFind the stack information with nid 2ea

Then find the corresponding impl and analyze the program according to the number of lines

  1. It was found that when the file was asynchronously exported to Excel, the export interface used the public list query interface. The list interface query data can be paged in batches of 200 at most, and the amount of exported data allows each person to have permissions ranging from tens of thousands to hundreds of thousands.

ec903d78871783b9d89b8b7b49fccb53.jpegExport excel

And this judgment method uses nested loop judgment, and it is easy to get the value when combined with the business. The new ArrayList under Java returns a List collection (it seems needless to say so detailed (; 1_1)), before the entire method ends , the life cycle of the generated lists is still there, so after multiple GC triggers and restarts, it also affects other pods. Then the code was fixed and went online urgently, and the problem was solved~

A front-end and back-end separated blog based on Spring Boot + MyBatis Plus + Vue 3.2 + Vite + Element Plus, including a back-end management system that supports articles, categories, tag management, dashboards and other functions.

  • GitHub address: https://github.com/weiwosuoai/WeBlog

  • Gitee Address: https://gitee.com/AllenJiang/WeBlog

Conclusion

Don’t be afraid when you encounter production problems. When encountering a problem, first ensure that the service is available, and then analyze the limited information layer by layer to find out the final problem. If you know arthas, it will be easier to troubleshoot!


86c00e4618484bf66829f429661c7f80.gif




1. Separation of front and back ends, open source Spring Boot + Vue 3.2 blog, Thai pants are hot!

2. Baidu has also open sourced a pressure testing tool that can simulate billions of concurrent scenarios, which is too powerful!

3. Demystify the battle between Bean Searcher and MyBatis Plus: who is the real king!

4. Alibaba’s multi-level caching framework must be mastered, it is very good!

57559eb7160a225a31f706bcfce6cb45.gif

I recently interviewed BAT and compiled an interview material "Java Interview BATJ Clearance Manual", which covers Java core technology, JVM, Java concurrency, SSM, microservices, databases, data structures, etc.
How to get it: Click "Watching", follow the official account and reply to Java to get it. More content will be provided one after another. 
PS: Because the official account platform has changed the push rules, if you don’t want to miss the content, remember to click “Reading” after reading and add a “star”, so that each new article will be pushed to you as soon as possible. in the subscription list.
Click "Watching" to support Xiaoha, thank you