android system restart analysis process

Today I encountered a system restart problem. My analysis was confusing at first, but the problem gradually became clear later. I will summarize it here.

Premise

To analyze the system restart problem, first we must be familiar with the android system startup process.

The approximate process is:

  1. Start BootLoader: It can initialize hardware devices and establish a memory space map, thereby bringing the system’s software and hardware environment to a suitable state, so as to prepare the correct environment for the final call to the operating system kernel. [The most common thing we use is to turn off avb and selinux]
  2. Start LinuxKernel: set cache, load driver, etc., and then start the init process
  3. The init process is initialized according to init.rc: init.rc can be regarded as a script, in which you can modify file permissions, set attributes, pull up processes, etc. System processes such as zygote, servicemanager, and surfaceflinger are pulled up by it.
  4. Start zygote: When zygote starts, it will hatch the system_server process
  5. Start system_server: system_server will start system services such as PMS, WMS, AMS etc.
  6. Start AMS: When AMS starts, it will start some UI-related processes such as SystemUi, Launcher, etc.

1. Background

During the testing process, a test colleague found that a factory test application was stuck, and the system restarted after a while. The factory test application is maintained by me. Package name: com.ifpdos.xxx

2. Analysis process

1. The first step

The problem was originally reported to other colleagues, but when they saw the following line of log, they thought it was caused by my application anr.

10-27 19:56:22.946707 788 8448 I am_anr : [0,8369,com.ifpdos.xxx:float,814267975,executing service com.ifpdos.xxx/.FloatNavbarService]

But from my understanding, applying anr generally does not cause the system to restart. If you want to prove your innocence, you can only analyze the log carefully.

2. The second step

There are generally two situations when anr occurs:
1. A certain system service generates ANR, and there are keywords in the corresponding log.

ActivityManager: Dumping to /data/anr/anr_xxx

2. Anr appears in a certain process:

ActivityManager: ANR in xxx (application package name)

Next, we searched globally and found the following content: You can see that a certain service in the system generated anr.

3. The third step

Next, we analyze the anr_2023-10-27-19-56-23-276 file. The first thing that catches our eyes is
[Due to company confidentiality reasons, xxx represents a third-party library of the company]

"main" prio=5 tid=1 Runnable
  | group="main" sCount=0 dsCount=0 flags=0 obj=0x7255dc28 self=0x87ec6210
  | sysTid=8369 nice=0 cgrp=default sched=0/0 handle=0xf5e11470
  | state=R schedstat=( 1123868093 2684110237 2643 ) utm=14 stm=98 core=0 HZ=100
  | stack=0xff73c000-0xff73e000 stackSize=8192KB
  | held mutexes= "mutator lock"(shared held)
  at xxx.utils.BoardTypeUtils.getBoardType(BoardTypeUtils.java:105)
  at xxx.StorageHelper.initBordType(StorageHelper.java:119)
  at xxx.StorageHelper.<init>(StorageHelper.java:111)
  at xxx.StorageHelper.getInstance(StorageHelper.java:175)
  - locked <0x082099f2> (a java.lang.Class<xxx>)
  at com.ifpdos.xxx.App.onCreate(App.java:24)
  at android.app.Instrumentation.callApplicationOnCreate(Instrumentation.java:1192)
  at android.app.ActivityThread.handleBindApplication(ActivityThread.java:6735)
  at android.app.ActivityThread.access$1300(ActivityThread.java:239)
  at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1915)
  at android.os.Handler.dispatchMessage(Handler.java:106)
  at android.os.Looper.loop(Looper.java:223)
  at android.app.ActivityThread.main(ActivityThread.java:7680)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:592)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:952)
 

From the above log, you can clearly see that my application is waiting for a lock in onCreate.
At this point it is easy to think of a deadlock problem.

4. Step 4

Since we think it is a deadlock problem, we search globally for the keyword “waiting to lock” and we will find suprise and many wait locks.

According to the time when the problem occurred in my application, we checked the anr/anr_2023-10-27-19-57-19-998 file

----- pid 788 at 2023-10-27 19:57:21 -----
Cmd line: system_server
Build fingerprint: 'Droidlogic/t982_ar301/t982_ar301:11/RD2A.211001.002/eng.user2.20231025.150328:user/release-keys'
ABI: 'arm'
Build type: optimized
Zygote loaded classes=21264 post zygote classes=3587
Dumping registered class loaders


Indicates that the lock <0x07b403d3> is waiting for the thread with tid=148 to be released;

Search globally for the corresponding lock. The results are as follows:

"Binder:788_1D" prio=5 tid=148 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x8d480ab8 self=0x55005410
  | sysTid=6435 nice=-2 cgrp=default sched=0/0 handle=0x57e691c0
  | state=S schedstat=( 882900055 1221167633 3099 ) utm=37 stm=50 core=3 HZ=100
  | stack=0x57d6e000-0x57d70000 stackSize=1008KB
  | held mutexes=
  at com.android.server.wm.ActivityTaskManagerService$LocalService.getTopApp(ActivityTaskManagerService.java:7348)
  - waiting to lock <0x042f3493> (a com.android.server.wm.WindowManagerGlobalLock)
  at com.android.server.am.ActivityManagerService.getTopAppLocked(ActivityManagerService.java:17974)
  at com.android.server.am.OomAdjuster.updateOomAdjLocked(OomAdjuster.java:434)
  at com.android.server.am.OomAdjuster.updateOomAdjLocked(OomAdjuster.java:335)
  at com.android.server.am.ActivityManagerService.updateOomAdjLocked(ActivityManagerService.java:18020)
  at com.android.server.am.ActiveServices.sendServiceArgsLocked(ActiveServices.java:3235)
  at com.android.server.am.ActiveServices.bringUpServiceLocked(ActiveServices.java:2948)
  at com.android.server.am.ActiveServices.startServiceInnerLocked(ActiveServices.java:801)
  at com.android.server.am.ActiveServices.startServiceLocked(ActiveServices.java:735)
  at com.android.server.am.ActiveServices.startServiceLocked(ActiveServices.java:499)
  at com.android.server.am.ActivityManagerService.startService(ActivityManagerService.java:15056)
  - locked <0x07b403d3> (a com.android.server.am.ActivityManagerService)
  at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:2476)
  at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:2896)
  at android.os.Binder.execTransactInternal(Binder.java:1154)
  at android.os.Binder.execTransact(Binder.java:1123)

You can see that the tid=148 thread occupies this lock, causing a deadlock.

3. Conditions for deadlock

1. Mutually exclusive use, that is, when a resource is used (occupied) by one thread, other threads cannot use it.
2. It cannot be preempted. Resource requesters cannot forcibly seize resources from resource occupiers. Resources can only be released actively by resource occupiers.
3. Request and maintain, that is, when the resource requester maintains possession of the original resource while requesting other resources.
4. Loop waiting, that is, there is a waiting queue: P1 occupies the resources of P2, P2 occupies the resources of P3, and P3 occupies the resources of P1. This creates a waiting loop.
When the above four conditions are true, a deadlock is formed. Of course, in the case of deadlock, if any of the above conditions are broken, the deadlock can be solved
[Because the company code is involved, the solution will not be explained later.]

4. Why does the system restart?

10-27 20:07:32.153853 306 306 E Zygote: Zygote failed to write to system_server FD: Connection refused
10-27 20:07:32.154039 306 306 I Zygote: Process 788 exited due to signal 9 (Killed)
10-27 20:07:32.154151 306 306 E Zygote: Exit zygote because system server (pid 788) has terminated
10-27 20:07:32.156474 364 555 W AudioFlinger: power manager service died !!!

We continued to search the log and found that the system server was killed. Why was it killed?

10-27 20:07:30.766 788 813 W Watchdog: main annotated stack trace:
10-27 20:07:30.789 788 813 W Watchdog: at com.android.server.AlarmManagerService$DeliveryTracker.alarmComplete(AlarmManagerService.java:4697)
10-27 20:07:30.846 1687 2977 D PackageManager: getCustomerIcon: com.android.phasebeam
10-27 20:07:30.846 1687 2765 D PackageManager: getCustomerIcon: com.android.phasebeam
10-27 20:07:30.847 1687 2975 D PackageManager: getCustomerIcon: com.android.phasebeam
10-27 20:07:30.914 788 813 W Watchdog: - waiting to lock <0x0d33ff02> (a java.lang.Object)
10-27 20:07:30.915 788 813 W Watchdog: at com.android.server.AlarmManagerService$3.lambda$doAlarm$0$AlarmManagerService$3(AlarmManagerService.java:1555)
10-27 20:07:30.915 788 813 W Watchdog: at com.android.server.-$$Lambda$AlarmManagerService$3$jIkPWjqS66vG6aFVQoHxR2w4HPE.run(Unknown Source:4)
10-27 20:07:30.916 788 813 W Watchdog: at android.os.Handler.handleCallback(Handler.java:938)
10-27 20:07:30.916 788 813 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)
10-27 20:07:30.916 788 813 W Watchdog: at android.os.Looper.loop(Looper.java:223)
10-27 20:07:30.916 788 813 W Watchdog: at com.android.server.SystemServer.run(SystemServer.java:635)
10-27 20:07:30.916 788 813 W Watchdog: at com.android.server.SystemServer.main(SystemServer.java:419)
10-27 20:07:30.916 788 813 W Watchdog: at java.lang.reflect.Method.invoke(Native Method)
10-27 20:07:30.916 788 813 W Watchdog: at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:592)
10-27 20:07:30.916 788 813 W Watchdog: at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:930)
10-27 20:07:30.926 788 813 W Watchdog: *** GOODBYE!

It can be seen from the log that the watchdog killed SystemServer because it did not wait for the lock for a long time.

5. Summary

Restart problem

1. If you find that the process number of system_server is relatively large, then there is a high probability that it has been restarted.
2. You can use the “Exit zygote” keyword to search to find the time when system_server died, and then search for the following keywords to find the specific reason.
3. Search for “START com.android.internal.os.ZygoteInit” to see the restart time, and then search for the following keywords to find the specific cause of death:
DEBUG: When an exception occurs in the native layer, the stack will be printed.
crash: crash information printing
kill: Some exceptions trigger the system to be forcibly killed.
tombstone: Errors in the virtual machine and layer C trigger the generation of tombstone files
died: Some services died