“Simple” troubleshooting for the machine always restarting twice during pre-release deployment

Question

The day before yesterday, students reported that magellan, the core application of the search business, was always restarted twice when deployed in the pre-release environment. Just after it was deployed and joint debugging started, it suddenly restarted again, which also led to people always complaining that the search environment was unstable.

The first reaction is that there is a high probability that the application failed to start for the first time, right? It’s impossible for Aone (Alibaba’s internal CI/CD system) to schedule it twice. result:

I have never encountered this before, I can only say:

Does such a magical phenomenon occur occasionally or must it happen? Well, it turned out to be a must. . . . . .

Since it must appear, it can still be checked and is easy to check. Let’s try it~

Check the logs to confirm how many times it was deployed

The development students sent screenshots of the deployment log, as shown in the pictures above:

Stop the application at 01:41, then destroy the container and create a new container
The application started at 01:43 and the startup was successful.
At 01:50, the application started to stop and restart, but there was no new deployment record at this time.

Of course, it’s not very clear in the picture above. Let’s do a simple match: match the logs when stopping and starting (grep -E), and display one line of logs up and down (-C 1), mainly to match The time is printed.

cat magellan_deploy.log |grep -E "INFO: magellan try to stop...|INFO: magellan try to start..." -C 1

Startup log for machine 1

Startup log for machine 2

Of course, the above command only came to mind when I sorted out the overall process at this moment (so it is not consistent with the screenshots provided by the development students). When troubleshooting the problem, I just took a look at the log. However, through the above picture, I discovered another point that I missed before: The first startup time of the two machines is different (started in 2 batches, machine 1 is the first batch to be started, and machine 1 is started in the first batch. 2 is the second batch of starts), but their second restart time is the same, 21:00:02, indicating that they are scheduled at the same time, which also confirms the following conclusion , however, we will ignore this information for now. Whether it is from the picture taken by the development students beforehand or from the picture taken by me afterwards, it can be seen that: The application was indeed restarted twice. The first time should be scheduled by Aone, but I don’t know the second time. Who arranged it.

Will manually restarting the application also cause this problem?

I restarted it manually. After the startup was successful, the application came online. After waiting for 2 or 3 minutes, the application did not restart again. . . Um, does it only happen when Aone is deployed? After reading the deployment logs of the past week, it is indeed basically restarted twice.

Are there any scheduled tasks

Um, no.

I don’t have much idea, let’s see if there is any useful error message.

After looking through it carefully, there are no special or new errors in the application startup log. There is a warning message in the deployment log: Pandoraboot can be compared to Springboot.

I remember reading some articles before, saying that this error may affect the JDK version judgment (mainly because some JDK version judgment scripts are not very robust). In addition, when I deployed manually, there seemed to be an error in one line. It is probably caused by this error. I quickly re-executed sh stop.sh (Alibaba’s internal command to manually offline a certain machine):

It prompts that there is a problem with line 21 of update_setenv.sh (one of Alibaba’s internal application startup scripts). Take a look at the code:

After sorting out the logic, I probably executed java -version to get the version number. If it is greater than or equal to JDK8, replace some JVM parameters. It is estimated to be a legacy from the JDK8 upgrade that year. Why is an error reported? Log in to the machine and execute the following command:

Well, OpenJDK reported a Warning message when executing this command, and the error message was before the version line, so the first line would not match the real version number 8. According to the execution logic of the above code, JAVA_VERSION_MAJOR will get “Server The first “e” in ” retrieves a string, causing an error when compared with the number 8 below. So what will be the impact of error reporting here? The following replacement logic will not be executed. In the end, the JVM startup parameter settings are unreasonable, which may cause the JVM to be killed? Let’s take a look at the application’s current startup parameter settings:

Well, the startup parameters have been changed to those after JDK8,so if the above replacement logic is not executed, it will not affect the final startup parameters. Although it does not affect the startup parameters, there are so many error logs of this line in the deployment log, and the incorrect estimation of JDK version acquisition will also cause some unknown problems. Try to repair it first~

Fix the logic of incorrect JDK version judgment

After looking at the code of update_setenv.sh, I found that it is no longer useful, and the code logic for judging the JDK version is also very problematic and can be deleted directly. Pull changes, delete, deploy pre-releases.

After the deployment is successful and the application starts, log in to the machine and observe the deployment log: the application is deployed normally, there is no JDK warning message, and there is no redeployment after waiting for several minutes. It seems to be resolved. But why was it resolved? It’s not scientific. Logically speaking, this version has little to do with the two restarts. Forget it, don’t worry about it for now, and quickly give feedback to the development students.

In order to pass on this joy to the development as soon as possible, I also deliberately used the fast track-DING, but the result was a slap in the face very quickly.

Is it a runtime issue?

I quickly went to the machine and looked at the log, and found that the machine was successfully started at 20:54, and it was restarted after running for 5 minutes (I only waited 3 or 4 minutes). This means: It’s not a problem when the application starts, but a runtime problem. No wonder I couldn’t observe it before. It turns out that it will be restarted within an uncertain period of time after startup. I told my development classmates that I would try my best to look for it, but in fact I didn’t have any ideas in my heart. . . Ask your Aone classmates. What if Aone is scheduled twice?

Consult Aone students whether there is the possibility of duplicate scheduling

Aone (Alibaba’s internal CI/CD system) consulted R&D Xiaomi in the lower right corner and got a reply soon.

However, the answer given by the classmate was not related to my question, and he probably misunderstood what I meant. He probably thought that I wanted to pause the container and troubleshoot the problem. It is understandable that the problem is rather strange. Before asking the question, I also carefully read Aone’s historical question records and found that there were no similar questions in them. It was most likely not an Aone question. It’s better to try other methods first. If that doesn’t work, ask Aone to confirm whether there is duplicate scheduling.

From what angle can we investigate? It’s a bit hard to investigate anymore. Come to think of it, according to a developer classmate, only pre-release machines have this problem, and online machines don’t have this problem. Let’s compare online and pre-release machines.

Comprehensive comparison between online and pre-release machines

Compare application deployment logs: no exceptions. Compare application startup logs: no exceptions. Compare JVM parameters: a little different, as follows. The first one is the parameters of the online machine, and the second one is the parameters of the pre-release machine.

There are several differences:

The java paths used are different: online is /opt/taobao/install/ajdk-8_2_4-b52/bin/java, and pre-release is /opt/taobao/ java/bin/java
The machine in the pre-release environment has the Debug port enabled: -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8000
The machine in the pre-release environment has a Java Agent mounted: -javaagent:/home/admin/remote-debug/libs/remote-debug-agent .jar

Java path: Log in to the machine and take a look. The two paths have soft links and are essentially the same (the difference between pre-release and online is because I deleted update_setenv.sh in the pre-release environment):

The Debug port of the pre-release machine is open: It is estimated that some students are pre-release Debug, so there will be no problem. Mount Java Agent on the machine in the pre-release environment: This is generally a routine operation, such as start-ops used for startup time-consuming analysis, rbc used for code dyeing, etc., which only mount the pre-release to prevent the influence of the Agent. Real online users just don’t know what this remote-debug is used for. After comparing for a long time, I still have no clue. . . It can’t be caused by this Agent, right? Agent generally does not, and cannot restart the application. Forget it, let’s first see what this Agent does

See what remote-debug Agent does

Since it is a Java Agent, and there is no code related to Agent deployment in the application code, the Agent should be delivered through the operation and maintenance plug-in, so that the application is imperceptible. First search in the staragent plug-in market (Alibaba internal operation and maintenance container plug-in market) to see what it does:

I tried a series of keywords such as remote debug, remote debug agent, etc., but found that none of them could be searched. Strange, isn’t the plug-in called this name? ? Then I searched online and found that the information was useless. It was all about remote debugging. There was nothing I could do about it. Who told this plug-in to be called this name?

Yuque search, the internal network is public, and the top ones are useless (actually, I forgot to search on the external network here, in fact, you can search for some relevant information on github). Let’s search the code (Alibaba’s internal code warehouse search platform), but this time I used the keyword “remote-debug-agent” for a more accurate match, which is great!

The paths are exactly the same. It is produced by QA and placed in the euler-service warehouse. Is the name of the plug-in called euler? This script seems to have executed appctl.sh restart to restart! I quickly searched the plug-in market again, and sure enough it was there!

Log in to the pre-release machine and look at the plug-in directory of staragent. Yes, it’s you, originally called euler!

Go into the directory and look at the actual script. Sure enough, it is basically consistent with the code search:

Look at the logic:

In pre-release, test, daily, and project environments, if remote-debug-agent.jar does not exist, download it from oss to the /home/admin directory
If remote-debug-agent has been mounted, do nothing
If remote-debug-agent is not mounted, modify the JVM parameters in the setenv.sh script, add Agent parameters, and then restart the application

Is the second restart to mount the remote-debug-agent Agent?

Observe two restarts of the application

After the first restart, quickly observe:

After waiting for the second reboot, observe:

Sure enough, the second time there was -javaagent:/home/admin/remote-debug/libs/remote-debug-agent.jar! It was almost the same. It should be caused by this euler operation and maintenance plug-in.

Confirm with the person in charge of the plug-in

I asked the maintainer of the euler plug-in to confirm whether this logic exists:

Haha, I finally found it. The test student said: How did you find out? It is difficult to find out that it is caused by this plug-in. I said, it’s so difficult that I’m going to give up. I quickly made an appointment with the test students to communicate in person and confirmed their appeal: they hoped that without the development students needing to modify the code, this Java Agent could be mounted during the pre-release deployment of the application to assist them in collecting some data. But after they studied it, they found that it was impossible to modify the JVM parameters in the startup script before the application started, so they had to restart it. It is scheduled every 10 minutes or so, so there will be a sudden restart 5 minutes after startup. They believe that application startup is generally faster and has less impact on development. In fact, the impact is indeed very small under normal circumstances, but today’s applications generally take 3 minutes to complete the restart. At the same time, magellan releases applications frequently (about 150 pre-release deployments per month), with many developers and many demands. Therefore, it is easy to be discovered by development students:

Now that you know the problem, how do you solve it? Simple. Recently, I happened to be developing and promoting another plug-in with my classmates from the architecture group. Our plug-in was mounted the first time we restarted. There was no need to restart a second time. I quickly recruited my test classmates and classmates from the architecture group. Let’s discuss the modification plan together (the focus of this article is not on how to repair it, so I won’t release the content related to the repair plan). After confirming the plan, the test students hurried back to change the code, saying that they were too embarrassed and that they would definitely optimize it today.

Actually, there is nothing to be embarrassed about. Don’t we all hurt each other? If you are really embarrassed, it’s just a cup of coffee.
at last

First remove the plug-in from the magellan pre-release environment to ensure the development experience, and then remount it after the plug-in logic is repaired.

In terms of design, the operation and maintenance container and the main container are in the same Pod and share the disk. The operation and maintenance container does have permission to operate the processes in the main container, so in the architecture How to prevent it? I don’t have the answer yet, but I wonder if you have any. If so, please leave a message.
If I have to say something profound about this investigation process, it would be “don’t let go of any suspicious place”!

Supplement: If the code search cannot be found, are there any other methods?

On my way home last night, I was thinking about a question. If I couldn’t find any information related to “remote-debug-agent” using code search, are there any other methods? After thinking for a while, I found that there is another stupid method-keyword matching, that is, searching for the “restart” keyword in the full text. It’s stupid, but it’s effective. First search in the /home/admin directory. If there is no result, search in the /home directory. If there is no result, search in the root directory. Why is this method possible? Because looking at the deployment log, the restart command in the application startup script appctrl.sh must have been executed, so it must contain the “appctrl.sh restart” or “restart” keyword.

Well, there were too many files being searched, and the execution was a bit slow. I terminated it. Can it be done faster? Since you want to execute this command, it should generally be a script file. You can first search in the *.sh script file.

In the /home/admin directory, there are fewer results. Now that the search scope has been expanded, there are 57 records in the /home directory, which contain the script of the euler-service plug-in:

Author｜Quchi

Click to try the cloud product for free now and start your practical journey on the cloud!

Original link

This article is original content of Alibaba Cloud and may not be reproduced without permission.

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Java Skill TreeHomepageOverview 137861 people are learning the system