In the Android system, a software-level Watchdog is also designed to protect some important system services, such as AMS, WMS, PMS, etc. Since the above core services run in the system_server process, when the above services are abnormal, usually The system_server process will be killed, that is, the Android system will be restarted.
The WatchDog function is mainly to analyze whether the important threads and locks of the system’s core services are in the Blocked state, that is, the following two functions:
- Monitor several key locks in system_server. The principle is to try to lock in the android_fg thread.
- Monitor the execution time of several commonly used threads. The principle is to execute tasks in these threads.
WatchDog is started after the system process is initialized.
/frameworks/base/services/java/com/android/server/SystemServer.java
991 private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) { 992 t.traceBegin("startBootstrapServices"); 993 994 // Start the watchdog as early as possible so we can crash the system server 995 // if we deadlock during early boot 996 t.traceBegin("StartWatchdog"); //Call Watchdog's constructor to initialize 997 final Watchdog watchdog = Watchdog.getInstance(); //Call the start method to start the thread 998 watchdog.start(); 999 t.traceEnd(); 1000 1001 Slog.i(TAG, "Reading configuration..."); 1002 final String TAG_SYSTEM_CONFIG = "ReadingSystemConfig";
1. Initialization of WatchDog
Call Watchdog’s constructor to initialize
/frameworks/base/services/core/java/com/android/server/Watchdog.java
public static Watchdog getInstance() { if (sWatchdog == null) { // Singleton mode, create Watchdog object sWatchdog = new Watchdog(); } return sWatchdog; }
Watchdog constructor
private Watchdog() { //Create a thread named watchdog, and call the run method on the thread mThread = new Thread(this::run, "watchdog"); // Listening lock mechanism and handler processing of foreground thread FgThread, which is also a singleton mode. // Provided for use by other objects, such as PermissionManagerService mMonitorChecker = new HandlerChecker(FgThread.getHandler(), "foreground thread", DEFAULT_TIMEOUT); // Save mMonitorChecker to mHandlerCheckers to monitor whether the handler times out. The default timeout is DEFAULT_TIMEOUT 30 seconds. mHandlerCheckers.add(mMonitorChecker); // Monitor the main thread of the system process mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()), "main thread", DEFAULT_TIMEOUT)); //Listen to the ui thread UI thread. mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(), "ui thread", DEFAULT_TIMEOUT)); //Listen to the Io thread mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(), "i/o thread", DEFAULT_TIMEOUT)); //Listen to the DisplayThread thread mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(), "display thread", DEFAULT_TIMEOUT)); // Monitor animation AnimationThread thread mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(), "animation thread", DEFAULT_TIMEOUT)); //Listen to the surface animation thread. It’s also about animation mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(), "surface animation thread", DEFAULT_TIMEOUT)); // Monitor whether there is an available binder thread addMonitor(new BinderThreadMonitor()); //Add the system process to the process queue of interest mInterestingJavaPids.add(Process.myPid()); // See the notes on DEFAULT_TIMEOUT. assert DB || DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS; mTraceErrorLogger = new TraceErrorLogger(); }
Watchdog is initialized during SystemServer startup. In addition to creating a thread mThread during initialization, Watchdog will also build many HandlerCheckers, which can be roughly divided into two categories:
- Monitor Checker is used to check possible deadlocks in Monitor objects. Core system services such as AMS, IMS, WMS PMS, etc. are all Monitor objects.
- Looper Checker, used to check whether the thread’s message queue is in a working state for a long time. Watchdog’s own message queue, ui, io, Display and other global message queues are all objects to be checked. In addition, the message queues of some important threads will also be added to the Looper Checker, such as AMS and WMS. These are added when the corresponding objects are initialized.
Constructor initialization of new HandlerChecker
// HandlerChecker implements the Runnable interface and will call back the run method public final class HandlerChecker implements Runnable { private final Handler mHandler; private final String mName; private final long mWaitMax; private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>(); private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>(); private boolean mCompleted; private Monitor mCurrentMonitor; private long mStartTime; private int mPauseCount; HandlerChecker(Handler handler, String name, long waitMaxMillis) { mHandler = handler; mName = name; //The maximum waiting time is set to mWaitMax mWaitMax = waitMaxMillis; //Initialize mCompleted to true mCompleted = true; } //To add a method to monitor the lock, the calling interface is addMonitor void addMonitorLocked(Monitor monitor) { // We don't want to update mMonitors when the Handler is in the middle of checking // all monitors. We will update mMonitors on the next schedule if it is safe mMonitorQueue.add(monitor); } public void addMonitor(Monitor monitor) { synchronized (mLock) { mMonitorChecker.addMonitorLocked(monitor); } }
The service of the system process is monitored by watchdog
/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java
From the previous analysis, if the monitoring time is long, the foreground thread will monitor it.
// ams implements the Watchdog.Monitor interface 431 public class ActivityManagerService extends IActivityManager.Stub 432 implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback, ActivityManagerGlobalLock { 2226 public ActivityManagerService(Context systemContext, ActivityTaskManagerService atm) { 2227 LockGuard.installLock(this, LockGuard.INDEX_ACTIVITY); 2228 mInjector = new Injector(systemContext); 2229 mContext = systemContext; . . . . // Monitor whether the lock is held for a long time 2328 Watchdog.getInstance().addMonitor(this); // Monitor whether the ams handler has timed out 2329 Watchdog.getInstance().addThread(mHandler); // ams implements the Watchdog.Monitor interface and will call back the monitor method 15024 public void monitor() { 15025 synchronized (this) { } 15026 }
Similarly, wms also monitors whether the lock has timed out.
/frameworks/base/services/core/java/com/android/server/wm/WindowManagerService.java
330 public class WindowManagerService extends IWindowManager.Stub 331 implements Watchdog.Monitor, WindowManagerPolicy.WindowManagerFuncs { 1417 public void onInitReady() { 1418 initPolicy(); 1419 1420 // Add ourselves to the Watchdog monitors. 1421 Watchdog.getInstance().addMonitor(this); ========= 6658 @Override 6659 public void monitor() { // Monitor mGlobalLock lock 6660 synchronized (mGlobalLock) { } 6661 }
2. Call the start method to start WatchDog thread monitoring
public void start() { //That is, call this::run method mThread.start(); }
Call this::run method
/frameworks/base/services/core/java/com/android/server/Watchdog.java
private void run() { boolean waitedHalf = false; while (true) { List<HandlerChecker> blockedCheckers = Collections.emptyList(); String subject = ""; boolean allowRestart = true; int debuggerWasConnected = 0; boolean doWaitedHalfDump = false; final ArrayList<Integer> pids; synchronized (mLock) { // timeout is 30 seconds long timeout = CHECK_INTERVAL; // Make sure we (re)spin the checkers that have become idle within // this wait-and-check interval // 2-1) Traverse all HandlerCheckers and post the message to see if it times out for (int i=0; i<mHandlerCheckers.size(); i + + ) { HandlerChecker hc = mHandlerCheckers.get(i); hc.scheduleCheckLocked(); } if (debuggerWasConnected > 0) { debuggerWasConnected--; } // Recording start time long start = SystemClock.uptimeMillis(); while (timeout > 0) { if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } try { // wait 30 seconds mLock.wait(timeout); // Note: mHandlerCheckers and mMonitorChecker may have changed after waiting } catch (InterruptedException e) { Log.wtf(TAG, e); } if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } // Ensure execution waits for 30 seconds and then breaks out of the loop timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start); } // 2-2) Calculate the waiting status evaluateCheckerCompletionLocked final int waitState = evaluateCheckerCompletionLocked(); if (waitState == COMPLETED) { waitedHalf = false; continue; } else if (waitState == WAITING) { continue; // 2-3) Processing time is greater than 30 seconds but less than 60 seconds } else if (waitState == WAITED_HALF) { if (!waitedHalf) { Slog.i(TAG, "WAITED_HALF"); // Then set waitedHalf = true waitedHalf = true; // We've waited half, but we'd need to do the stack trace dump w/o the lock. pids = new ArrayList<>(mInterestingJavaPids); //Set doWaitedHalfDump = true doWaitedHalfDump = true; } else { continue; } } else { // 2-4) Execution timeout process blockedCheckers = getBlockedCheckersLocked(); subject = describeCheckersLocked(blockedCheckers); allowRestart = mAllowRestart; pids = new ArrayList<>(mInterestingJavaPids); } } // END synchronized (mLock) if (doWaitedHalfDump) { // After 30 seconds of timeout, ams will dump the message first. ActivityManagerService.dumpStackTraces(pids, null, null, getInterestingNativePids(), null, subject); continue; } // The following log will be printed when timeout occurs EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
2-1) Traverse all HandlerCheckers and post the message to see if it times out
public void scheduleCheckLocked() { if (mCompleted) { // Safe to update monitors in queue, Handler is not in the middle of work mMonitors.addAll(mMonitorQueue); mMonitorQueue.clear(); } // When it is not the foreground thread FgThread and is in the polling state; or when pauseLocked is called, return directly. if ((mMonitors.size() == 0 & amp; & amp; mHandler.getLooper().getQueue().isPolling()) || (mPauseCount > 0)) { // Don't schedule until after resume OR // If the target looper has recently been polling, then // there is no reason to enqueue our checker on it since that // is as good as it not being deadlocked. This avoid having // to do a context switch to check the thread. Note that we // only do this if we have no monitors since those would need to // be executed at this point. mCompleted = true; return; } // mCompleted is false, which means querying if (!mCompleted) { // we already have a check in flight, so no need return; } mCompleted = false; mCurrentMonitor = null; //Set the time to start calling mStartTime = SystemClock.uptimeMillis(); // Insert this into the message queue if this is runable. mHandler.postAtFrontOfQueue(this); }
If the message is processed, the run method will be executed:
@Override public void run() { final int size = mMonitors.size(); // If it is FgThread, it monitors whether the lock is held for a long time, and the monitor method will be called back. for (int i = 0 ; i < size ; i + + ) { synchronized (mLock) { mCurrentMonitor = mMonitors.get(i); } // If it may get stuck, mCompleted will not be set to true. mCurrentMonitor.monitor(); } synchronized (mLock) { // After execution, mCompleted will be set to true mCompleted = true; mCurrentMonitor = null; } }
2-2) To calculate the waiting status evaluateCheckerCompletionLocked
- COMPLETED = 0: Waiting for completion;
- WAITING = 1: The waiting time is less than half of DEFAULT_TIMEOUT, that is, 30s;
- WAITED_HALF = 2: The waiting time is between 30s~60s;
- OVERDUE = 3: The waiting time is greater than or equal to 60s.
//There are the following 4 states private static final int COMPLETED = 0; private static final int WAITING = 1; private static final int WAITED_HALF = 2; private static final int OVERDUE = 3; private int evaluateCheckerCompletionLocked() { int state = COMPLETED; for (int i=0; i<mHandlerCheckers.size(); i + + ) { HandlerChecker hc = mHandlerCheckers.get(i); // Also traverses all HandlerCheckers state = Math.max(state, hc.getCompletionStateLocked()); } return state; } ======== //Call getCompletionStateLocked method public int getCompletionStateLocked() { // If 1. the corresponding handler has processed the queue head message; 2. Fgthread has processed the queue head message and the monitored lock has not timed out; then mCompleted is true if (mCompleted) { return COMPLETED; } else { long latency = SystemClock.uptimeMillis() - mStartTime; // If the processing time is less than 30 seconds, set the status to WAITING if (latency < mWaitMax/2) { return WAITING; // If the processing time is greater than 30 seconds but less than 60s, set the status to WAITED_HALF } else if (latency < mWaitMax) { return WAITED_HALF; } } return OVERDUE; }
2-3) The processing time is greater than 30 seconds but less than 60 seconds process WAITED_HALF
// 2-3) Processing time is greater than 30 seconds but less than 60 seconds } else if (waitState == WAITED_HALF) { if (!waitedHalf) { Slog.i(TAG, "WAITED_HALF"); // Then set waitedHalf = true waitedHalf = true; // We've waited half, but we'd need to do the stack trace dump w/o the lock. pids = new ArrayList<>(mInterestingJavaPids); //Set doWaitedHalfDump = true doWaitedHalfDump = true; } else { continue; } } else { //Execute the timeout process blockedCheckers = getBlockedCheckersLocked(); subject = describeCheckersLocked(blockedCheckers); allowRestart = mAllowRestart; pids = new ArrayList<>(mInterestingJavaPids); } } // END synchronized (mLock) if (doWaitedHalfDump) { // After 30 seconds of timeout, ams will dump the message first, and then continue without further execution. //dump out the process information in the NATIVE_STACKS_OF_INTEREST array ActivityManagerService.dumpStackTraces(pids, null, null, getInterestingNativePids(), null, subject); continue; }
2-4) Execution timeout process
} else if (waitState == WAITED_HALF) { if (!waitedHalf) { Slog.i(TAG, "WAITED_HALF"); } else { //Execute the timeout process // First, getBlockedCheckersLocked obtains the thread corresponding to the HandlerChecker whose execution timed out. blockedCheckers = getBlockedCheckersLocked(); // Get timeout information describeCheckersLocked subject = describeCheckersLocked(blockedCheckers); // Restart is allowed by default allowRestart = mAllowRestart; pids = new ArrayList<>(mInterestingJavaPids); } } // END synchronized (mLock) . . . . // Will print the following log EventLog.writeEvent(EventLogTags.WATCHDOG, subject); final UUID errorId; if (mTraceErrorLogger.isAddErrorIdEnabled()) { errorId = mTraceErrorLogger.generateErrorId(); mTraceErrorLogger.addErrorIdToTrace("system_server", errorId); } else { errorId = null; } FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED, subject); long anrTime = SystemClock.uptimeMillis(); StringBuilder report = new StringBuilder(); report.append(MemoryPressureUtil.currentPsiState()); ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false); StringWriter tracesFileException = new StringWriter(); final File stack = ActivityManagerService.dumpStackTraces( pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(), tracesFileException, subject); // Give some extra time to make sure the stack traces get written. // The system's been hanging for a minute, another second or two won't hurt much. SystemClock.sleep(5000); processCpuTracker.update(); report.append(processCpuTracker.printCurrentState(anrTime)); report.append(tracesFileException.getBuffer()); // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log doSysRq('w'); doSysRq('l'); // Try to add the error to the dropbox, but assuming that the ActivityManager // itself may be deadlocked. (which has happened, causing this statement to // deadlock and the watchdog as a whole to be ineffective) Thread dropboxThread = new Thread("watchdogWriteToDropbox") { public void run() { // If a watched thread hangs before init() is called, we don't have a // valid mActivity. So we can't log the error to dropbox. if (mActivity != null) { mActivity.addErrorToDropBox( "watchdog", null, "system_server", null, null, null, null, report.toString(), stack, null, null, null, errorId); } } }; dropboxThread.start(); try { dropboxThread.join(2000); // wait up to 2 seconds for it to return. } catch (InterruptedException ignored) {} IActivityController controller; synchronized (mLock) { controller = mController; } if (controller != null) { Slog.i(TAG, "Reporting stuck state to activity controller"); try { Binder.setDumpDisabled("Service dumps disabled due to hung system process."); // 1 = keep waiting, -1 = kill system int res = controller.systemNotResponding(subject); if (res >= 0) { Slog.i(TAG, "Activity controller requested to coninue to wait"); waitedHalf = false; continue; } } catch (RemoteException e) { } } // Only kill the process if the debugger is not attached. if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } if (debuggerWasConnected >= 2) { Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process"); } else if (debuggerWasConnected > 0) { Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process"); } else if (!allowRestart) { Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process"); } else { Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject); WatchdogDiagnostics.diagnoseCheckers(blockedCheckers); Slog.w(TAG, "*** GOODBYE!"); if (!Build.IS_USER & amp; & amp; isCrashLoopFound() & amp; & amp; !WatchdogProperties.should_ignore_fatal_count().orElse(false)) { breakCrashLoop(); } // Kill system process Process.killProcess(Process.myPid()); System.exit(10); } waitedHalf = false; } }
// First, getBlockedCheckersLocked obtains the thread corresponding to the HandlerChecker whose execution timed out.
private ArrayList<HandlerChecker> getBlockedCheckersLocked() { ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>(); // Also traverse all mHandlerCheckers for (int i=0; i<mHandlerCheckers.size(); i + + ) { HandlerChecker hc = mHandlerCheckers.get(i); // Use the following method isOverdueLocked to check whether it is timed out. if (hc.isOverdueLocked()) { checkers.add(hc); } } return checkers; } ========== // Use the following method to check whether it has timed out boolean isOverdueLocked() { return (!mCompleted) & amp; & amp; (SystemClock.uptimeMillis() > mStartTime + mWaitMax); }
// Get timeout information describeCheckersLocked
private String describeCheckersLocked(List<HandlerChecker> checkers) { StringBuilder builder = new StringBuilder(128); for (int i=0; i<checkers.size(); i + + ) { if (builder.length() > 0) { builder.append(", "); } builder.append(checkers.get(i).describeBlockedStateLocked()); } return builder.toString(); } ======= String describeBlockedStateLocked() { // If mCurrentMonitor is empty, it means that the problem is not with the lock timeout, but with the handler, then print the handler information. if (mCurrentMonitor == null) { return "Blocked in handler on " + mName + " (" + getThread().getName() + ")"; } else { // If it is not empty, there is a problem with the lock, which can only be used in FgThread. return "Blocked in monitor " + mCurrentMonitor.getClass().getName() + " on " + mName + " (" + getThread().getName() + ")"; } }
Watchdog detects abnormal information collection
- AMS.dumpStackTraces: Output stack information of Java and Native processes
- dSys
- dropBox
After collecting the information, the system_server process is killed. The default value of allowRestart here is true. When am hang operation is performed, restart is not allowed (allowRestart =false), and the system_server process will not be killed.