This is a community that may be useful to you
One-on-one communication/interview brochure/resume optimization/job search solution, welcome to join the “Tao Road Rapid Development Platform” knowledge planet. The following is some information provided by the planet:
“Project Combat (Video)”: Learn from the book, “practice” from the past
“Internet High-frequency Interview Questions”: Facing the resume and learning, the spring is warm and the flowers are blooming
“Architecture x System Design”: Destroy the dead, master the high-frequency scene questions in the interview
“Advanced Java Study Guide”: Systematic Learning, Internet Mainstream Technology Stack
“Must-read Java Source Code Column”: Know what it is, know why it is
This is an open source project that may be useful to you
Domestic Star breaks 100,000+ open source projects, the front end includes management background + WeChat applet, and the back end supports monomer and microservice architecture.
Functions cover RBAC permissions, SaaS multi-tenancy, data permissions, mall, payment, workflow, large-screen reports, WeChat official account, etc.:
Boot address: https://gitee.com/zhijiantianya/ruoyi-vue-pro
Cloud address: https://gitee.com/zhijiantianya/yudao-cloud
Video tutorial: https://doc.iocoder.cn
Source: juejin.cn/post/
7139202066362138654
-
Preface
-
scene at that time
-
Normal jvm monitoring curve chart
-
The jvm monitoring curve that caused the problem
-
Specific analysis
-
Conclusion
Yesterday, the CPU of the online container suddenly surged. This is the first time to troubleshoot this problem, so I will record it~
Foreword
First of all, the problem is this. I was writing a document on Friday, and suddenly received an online alarm, and found that the cpu usage reached more than 90. I went to the platform monitoring system to check the container, and found that a pod generated 61 in two hours in the jvm monitoring. Once youngGc and once fullGc, this problem is particularly serious and rare. Since I have not checked this kind of problem before, I am also Baidu, but I also have some thoughts on the whole process, so I will share with you~
Background management system + user applet based on Spring Boot + MyBatis Plus + Vue & amp; Element, supports RBAC dynamic permissions, multi-tenancy, data permissions, workflow, three-party login, payment, SMS, mall and other functions
Project address: https://github.com/YunaiV/ruoyi-vue-pro
Video tutorial: https://doc.iocoder.cn/video/
Scene at that time
Let me first show you a normal gc curve monitoring (for confidentiality, I drew it myself according to the platform monitoring):
Backend management system + user applet implemented based on Spring Cloud Alibaba + Gateway + Nacos + RocketMQ + Vue & Element, supporting RBAC dynamic permissions, multi-tenancy, data permissions, workflow, three-party login, payment, SMS, mall and other functions
Project address: https://github.com/YunaiV/yudao-cloud
Video tutorial: https://doc.iocoder.cn/video/
Normal jvm monitoring curve chart
JVM monitoring curve chart that caused problems
It can be seen that under normal circumstances, the system has very little gc (depending on the usage of the business system and jvm memory allocation), but in Figure 2, a large number of abnormal gc situations even triggered fullGc, so I immediately analyzed it at that time .
Specific analysis
First, the abnormal gc only occurs on one pod (the system has multiple pods). Find the corresponding pod in the monitoring system, enter the pod to check the cause of the problem, and be calm when troubleshooting.
-
After entering the pod, enter top to view the usage of system resources by each linux process (because this is an afterthought, the resource usage is not high, just follow the steps)
-
Analyze resource usage in the context of the situation
At that time, the CPU of my process with pid 1 reached 130 (multi-core). Then I decided that there was a problem with the Java application. Control + c to exit and continue.
-
Enter top -H -p pid. You can use this command to view the id of the thread that actually occupies the highest CPU. The pid is the pid number with the highest resource usage just now.
-
The resource usage of a specific thread appears. The pid in the table represents the id of the thread. We call it tid.
-
I remember that the tip at that time was 746 (the above picture is just me repeating the steps for everyone). Use the command printf “%x\
” 746 to convert the thread tid to hexadecimal.
Because our thread ID number is in hexadecimal in the stack, we need to do a hexadecimal conversion.
-
Enter jstack pid | grep 2ea >gc.stack
jstack pid | grep 2ea >gc.stack
To explain, jstack is one of the monitoring and tuning tools provided by jdk. jstack will generate a thread snapshot of the JVM at the current moment, and then we can use it to view the thread stack information in a Java process, and then we pass the stack information through the pipeline Collect the information of the 2ea thread, and then generate the information into a gc.stack file. I created it casually.
-
At that time, I first cated gc.stack and found that the data was a bit too much to read in the container, so I downloaded it to browse locally, because the company restricted the access of each machine, I could only use the springboard to find a useless machine first a, download the file to a and then I will download the file in a to the local (local access springboard machine OK), first enter python -m SimpleHTTPServer 8080, linux comes with python, this is to open a simple http service for external access
Then log in to the spring machine and use curl to download curl -o http://ip address/gcInfo.stack
For the convenience of demonstration, I replaced the IP address with a fake one in the picture.
Then use the same method to download the springboard machine locally. Remember to turn off the suggestion service enabled by python.
-
Download the file locally, open the view editor and search for 2ea, and find the stack information with nid 2ea.
Then find the corresponding impl analysis program based on the number of lines
-
It is found that when the file is asynchronously exported to excel, the export interface uses the public list query interface. The list interface query data is up to 200 pages per batch, and the amount of exported data per person has tens of thousands to hundreds of thousands of permissions.
And the judgment method uses nested loops to judge, and it is easy to get less than value in conjunction with the business. The new ArrayList under Java returns a List collection (it seems needless to say that it is so detailed (; one_one), before the end of the whole method , the life cycle of the generated lists is still there, so after multiple gc trigger restarts, it also affects other pods. Then fix the code, go online urgently, and the problem is solved~
Conclusion
When encountering production problems, don’t be afraid. When encountering problems, first ensure whether the service is available, and then analyze the limited information layer by layer to find out the final problem. If you know arthas, troubleshooting will be easier!
Welcome to join my knowledge planet and comprehensively improve your technical capabilities.
To join, “Long press” or “Scan” the QR code below:
Planet’s content includes: project practice, interviews and recruitment, source code analysis, and learning routes.
If the article is helpful, please read it and forward it. Thank you for your support (*^__^*)