Improper new ArrayList causes CPU to spike. .

This is a community that may be useful to you

One-on-one communication/interview brochure/resume optimization/job search solution, welcome to join the “Tao Road Rapid Development Platform” knowledge planet. The following is some information provided by the planet:

  • “Project Combat (Video)”: Learn from the book, “practice” from the past

  • “Internet High-frequency Interview Questions”: Facing the resume and learning, the spring is warm and the flowers are blooming

  • “Architecture x System Design”: Destroy the dead, master the high-frequency scene questions in the interview

  • “Advanced Java Study Guide”: Systematic Learning, Internet Mainstream Technology Stack

  • “Must-read Java Source Code Column”: Know what it is, know why it is

7c6ddeee77930ec6432d97e94a1928d6.gif

This is an open source project that may be useful to you

Domestic Star breaks 100,000+ open source projects, the front end includes management background + WeChat applet, and the back end supports monomer and microservice architecture.

Functions cover RBAC permissions, SaaS multi-tenancy, data permissions, mall, payment, workflow, large-screen reports, WeChat official account, etc.:

  • Boot address: https://gitee.com/zhijiantianya/ruoyi-vue-pro

  • Cloud address: https://gitee.com/zhijiantianya/yudao-cloud

  • Video tutorial: https://doc.iocoder.cn

Source: juejin.cn/post/
7139202066362138654

  • Preface

  • scene at that time

  • Normal jvm monitoring curve chart

  • The jvm monitoring curve that caused the problem

  • Specific analysis

  • Conclusion

44478e4e091e1723981525e840694de0.jpeg

Yesterday, the CPU of the online container suddenly surged. This is the first time to troubleshoot this problem, so I will record it~

Foreword

First of all, the problem is this. I was writing a document on Friday, and suddenly received an online alarm, and found that the cpu usage reached more than 90. I went to the platform monitoring system to check the container, and found that a pod generated 61 in two hours in the jvm monitoring. Once youngGc and once fullGc, this problem is particularly serious and rare. Since I have not checked this kind of problem before, I am also Baidu, but I also have some thoughts on the whole process, so I will share with you~

Background management system + user applet based on Spring Boot + MyBatis Plus + Vue & amp; Element, supports RBAC dynamic permissions, multi-tenancy, data permissions, workflow, three-party login, payment, SMS, mall and other functions

  • Project address: https://github.com/YunaiV/ruoyi-vue-pro

  • Video tutorial: https://doc.iocoder.cn/video/

Scene at that time

Let me first show you a normal gc curve monitoring (for confidentiality, I drew it myself according to the platform monitoring):

Backend management system + user applet implemented based on Spring Cloud Alibaba + Gateway + Nacos + RocketMQ + Vue & Element, supporting RBAC dynamic permissions, multi-tenancy, data permissions, workflow, three-party login, payment, SMS, mall and other functions

  • Project address: https://github.com/YunaiV/yudao-cloud

  • Video tutorial: https://doc.iocoder.cn/video/

Normal jvm monitoring curve chart

4c23e0fcb7698a70dac5f2706c8c1fb3.jpeg

Normal jvm monitoring curve chart

JVM monitoring curve chart that caused problems

8b9f51ce89f6737deb0c6f6045e52f33.jpeg

The jvm monitoring curve that caused the problem

It can be seen that under normal circumstances, the system has very little gc (depending on the usage of the business system and jvm memory allocation), but in Figure 2, a large number of abnormal gc situations even triggered fullGc, so I immediately analyzed it at that time .

Specific analysis

First, the abnormal gc only occurs on one pod (the system has multiple pods). Find the corresponding pod in the monitoring system, enter the pod to check the cause of the problem, and be calm when troubleshooting.

  1. After entering the pod, enter top to view the usage of system resources by each linux process (because this is an afterthought, the resource usage is not high, just follow the steps)

946ea9ea2c6048bf7af1e195a66fa6e6.jpeg

  1. Analyze resource usage in the context of the situation

2807c378ca012ad66e051709ee0b6583.jpeg

top

At that time, the CPU of my process with pid 1 reached 130 (multi-core). Then I decided that there was a problem with the Java application. Control + c to exit and continue.

  1. Enter top -H -p pid. You can use this command to view the id of the thread that actually occupies the highest CPU. The pid is the pid number with the highest resource usage just now.

4eab3aef5f747621684f13fbd8ec3959.jpeg

top -H -p pid
  1. The resource usage of a specific thread appears. The pid in the table represents the id of the thread. We call it tid.

00a2b5504ac9fe3139b64bd84ef409f9.jpeg

tid
  1. I remember that the tip at that time was 746 (the above picture is just me repeating the steps for everyone). Use the command printf “%x\
    ” 746 to convert the thread tid to hexadecimal.

5b3fc7870e174358aef54c5045309aa9.jpeg

Convert tid to hexadecimal

Because our thread ID number is in hexadecimal in the stack, we need to do a hexadecimal conversion.

  1. Enter jstack pid | grep 2ea >gc.stack

jstack pid | grep 2ea >gc.stack

36f98f9cb073687cad036b3e4a5d5e6d.jpeg

jstack

To explain, jstack is one of the monitoring and tuning tools provided by jdk. jstack will generate a thread snapshot of the JVM at the current moment, and then we can use it to view the thread stack information in a Java process, and then we pass the stack information through the pipeline Collect the information of the 2ea thread, and then generate the information into a gc.stack file. I created it casually.

  1. At that time, I first cated gc.stack and found that the data was a bit too much to read in the container, so I downloaded it to browse locally, because the company restricted the access of each machine, I could only use the springboard to find a useless machine first a, download the file to a and then I will download the file in a to the local (local access springboard machine OK), first enter python -m SimpleHTTPServer 8080, linux comes with python, this is to open a simple http service for external access

ecd9172579f00394373bfa892773ed79.jpeg

Enable http service

Then log in to the spring machine and use curl to download curl -o http://ip address/gcInfo.stack

For the convenience of demonstration, I replaced the IP address with a fake one in the picture.

ac626d99a19440bda0c1674681c38bbf.jpeg

curl

Then use the same method to download the springboard machine locally. Remember to turn off the suggestion service enabled by python.

  1. Download the file locally, open the view editor and search for 2ea, and find the stack information with nid 2ea.

2d0bf3f1cf76cf06dc9bbb6bfe4c284b.jpeg

Find the stack information with nid 2ea

Then find the corresponding impl analysis program based on the number of lines

  1. It is found that when the file is asynchronously exported to excel, the export interface uses the public list query interface. The list interface query data is up to 200 pages per batch, and the amount of exported data per person has tens of thousands to hundreds of thousands of permissions.

ba9719541740a6214a9c9eb6e21f8b09.jpeg

export excel

And the judgment method uses nested loops to judge, and it is easy to get less than value in conjunction with the business. The new ArrayList under Java returns a List collection (it seems needless to say that it is so detailed (; one_one), before the end of the whole method , the life cycle of the generated lists is still there, so after multiple gc trigger restarts, it also affects other pods. Then fix the code, go online urgently, and the problem is solved~

Conclusion

When encountering production problems, don’t be afraid. When encountering problems, first ensure whether the service is available, and then analyze the limited information layer by layer to find out the final problem. If you know arthas, troubleshooting will be easier!

Welcome to join my knowledge planet and comprehensively improve your technical capabilities.

To join, Long press” or “Scan” the QR code below:

ba731766e278df139d643f707fc2bb18.png

Planet’s content includes: project practice, interviews and recruitment, source code analysis, and learning routes.

1ac60acc76428cb9af64cec041381f0b.png

61d21f68ec10a631a12da4209a90a1fc.png036d97f456156b177f84c0f3599e44dd.png edeed7ec5916990d040852679f1ba593.png56da7c4bfd318afd543 6ba2ca73362c0.png

If the article is helpful, please read it and forward it.
Thank you for your support (*^__^*)