[Android source code] Android Watchdog mechanism

In the Android system, a software-level Watchdog is also designed to protect some important system services, such as AMS, WMS, PMS, etc. Since the above core services run in the system_server process, when the above services are abnormal, usually The system_server process will be killed, that is, the Android system will be restarted.

The WatchDog function is mainly to analyze whether the important threads and locks of the system’s core services are in the Blocked state, that is, the following two functions:

  • Monitor several key locks in system_server. The principle is to try to lock in the android_fg thread.
  • Monitor the execution time of several commonly used threads. The principle is to execute tasks in these threads.

WatchDog is started after the system process is initialized.

/frameworks/base/services/java/com/android/server/SystemServer.java

991 private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
992 t.traceBegin("startBootstrapServices");
993
994 // Start the watchdog as early as possible so we can crash the system server
995 // if we deadlock during early boot
996 t.traceBegin("StartWatchdog");

//Call Watchdog's constructor to initialize
997 final Watchdog watchdog = Watchdog.getInstance();

//Call the start method to start the thread
998 watchdog.start();
999 t.traceEnd();
1000
1001 Slog.i(TAG, "Reading configuration...");
1002 final String TAG_SYSTEM_CONFIG = "ReadingSystemConfig";

1. Initialization of WatchDog

Call Watchdog’s constructor to initialize

/frameworks/base/services/core/java/com/android/server/Watchdog.java

 public static Watchdog getInstance() {
        if (sWatchdog == null) {
// Singleton mode, create Watchdog object
            sWatchdog = new Watchdog();
        }

        return sWatchdog;
    }

Watchdog constructor

 private Watchdog() {
//Create a thread named watchdog, and call the run method on the thread
        mThread = new Thread(this::run, "watchdog");

// Listening lock mechanism and handler processing of foreground thread FgThread, which is also a singleton mode.
// Provided for use by other objects, such as PermissionManagerService
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);

// Save mMonitorChecker to mHandlerCheckers to monitor whether the handler times out. The default timeout is DEFAULT_TIMEOUT 30 seconds.
        mHandlerCheckers.add(mMonitorChecker);

// Monitor the main thread of the system process
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));

//Listen to the ui thread UI thread.
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
//Listen to the Io thread
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));

//Listen to the DisplayThread thread
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));

// Monitor animation AnimationThread thread
        mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
                "animation thread", DEFAULT_TIMEOUT));

//Listen to the surface animation thread. It’s also about animation
        mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
                "surface animation thread", DEFAULT_TIMEOUT));

// Monitor whether there is an available binder thread
        addMonitor(new BinderThreadMonitor());

//Add the system process to the process queue of interest
        mInterestingJavaPids.add(Process.myPid());

        // See the notes on DEFAULT_TIMEOUT.
        assert DB ||
                DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;

        mTraceErrorLogger = new TraceErrorLogger();
    }

Watchdog is initialized during SystemServer startup. In addition to creating a thread mThread during initialization, Watchdog will also build many HandlerCheckers, which can be roughly divided into two categories:

  1. Monitor Checker is used to check possible deadlocks in Monitor objects. Core system services such as AMS, IMS, WMS PMS, etc. are all Monitor objects.
  2. Looper Checker, used to check whether the thread’s message queue is in a working state for a long time. Watchdog’s own message queue, ui, io, Display and other global message queues are all objects to be checked. In addition, the message queues of some important threads will also be added to the Looper Checker, such as AMS and WMS. These are added when the corresponding objects are initialized.

Constructor initialization of new HandlerChecker

// HandlerChecker implements the Runnable interface and will call back the run method
    public final class HandlerChecker implements Runnable {
        private final Handler mHandler;
        private final String mName;
        private final long mWaitMax;
        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
        private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();
        private boolean mCompleted;
        private Monitor mCurrentMonitor;
        private long mStartTime;
        private int mPauseCount;

        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
            mHandler = handler;
            mName = name;
//The maximum waiting time is set to mWaitMax
            mWaitMax = waitMaxMillis;
//Initialize mCompleted to true
            mCompleted = true;
        }

//To add a method to monitor the lock, the calling interface is addMonitor
        void addMonitorLocked(Monitor monitor) {
            // We don't want to update mMonitors when the Handler is in the middle of checking
            // all monitors. We will update mMonitors on the next schedule if it is safe
            mMonitorQueue.add(monitor);
        }

    public void addMonitor(Monitor monitor) {
        synchronized (mLock) {
            mMonitorChecker.addMonitorLocked(monitor);
        }
    }

The service of the system process is monitored by watchdog

/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java

From the previous analysis, if the monitoring time is long, the foreground thread will monitor it.

// ams implements the Watchdog.Monitor interface
431 public class ActivityManagerService extends IActivityManager.Stub
432 implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback, ActivityManagerGlobalLock {

2226 public ActivityManagerService(Context systemContext, ActivityTaskManagerService atm) {
2227 LockGuard.installLock(this, LockGuard.INDEX_ACTIVITY);
2228 mInjector = new Injector(systemContext);
2229 mContext = systemContext;

. . . .
// Monitor whether the lock is held for a long time
2328 Watchdog.getInstance().addMonitor(this);
// Monitor whether the ams handler has timed out
2329 Watchdog.getInstance().addThread(mHandler);

// ams implements the Watchdog.Monitor interface and will call back the monitor method
15024 public void monitor() {
15025 synchronized (this) { }
15026 }

Similarly, wms also monitors whether the lock has timed out.

/frameworks/base/services/core/java/com/android/server/wm/WindowManagerService.java

330 public class WindowManagerService extends IWindowManager.Stub
331 implements Watchdog.Monitor, WindowManagerPolicy.WindowManagerFuncs {

1417 public void onInitReady() {
1418 initPolicy();
1419
1420 // Add ourselves to the Watchdog monitors.
1421 Watchdog.getInstance().addMonitor(this);

=========
6658 @Override
6659 public void monitor() {

// Monitor mGlobalLock lock
6660 synchronized (mGlobalLock) { }
6661 }

2. Call the start method to start WatchDog thread monitoring

 public void start() {
//That is, call this::run method
        mThread.start();
    }

Call this::run method

/frameworks/base/services/core/java/com/android/server/Watchdog.java

 private void run() {
        boolean waitedHalf = false;
        while (true) {
            List<HandlerChecker> blockedCheckers = Collections.emptyList();
            String subject = "";
            boolean allowRestart = true;
            int debuggerWasConnected = 0;
            boolean doWaitedHalfDump = false;
            final ArrayList<Integer> pids;
            synchronized (mLock) {

// timeout is 30 seconds
                long timeout = CHECK_INTERVAL;
                // Make sure we (re)spin the checkers that have become idle within
                // this wait-and-check interval

// 2-1) Traverse all HandlerCheckers and post the message to see if it times out
                for (int i=0; i<mHandlerCheckers.size(); i + + ) {
                    HandlerChecker hc = mHandlerCheckers.get(i);
                    hc.scheduleCheckLocked();
                }

                if (debuggerWasConnected > 0) {
                    debuggerWasConnected--;
                }

// Recording start time
                long start = SystemClock.uptimeMillis();
                while (timeout > 0) {
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {
// wait 30 seconds
                        mLock.wait(timeout);
                        // Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
// Ensure execution waits for 30 seconds and then breaks out of the loop
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                }

// 2-2) Calculate the waiting status evaluateCheckerCompletionLocked
                final int waitState = evaluateCheckerCompletionLocked();
                if (waitState == COMPLETED) {
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {
                    continue;
// 2-3) Processing time is greater than 30 seconds but less than 60 seconds
                } else if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        Slog.i(TAG, "WAITED_HALF");
// Then set waitedHalf = true
                        waitedHalf = true;
                        // We've waited half, but we'd need to do the stack trace dump w/o the lock.
                        pids = new ArrayList<>(mInterestingJavaPids);
//Set doWaitedHalfDump = true
                        doWaitedHalfDump = true;
                    } else {
                        continue;
                    }
                } else {
// 2-4) Execution timeout process
                    blockedCheckers = getBlockedCheckersLocked();
                    subject = describeCheckersLocked(blockedCheckers);
                    allowRestart = mAllowRestart;
                    pids = new ArrayList<>(mInterestingJavaPids);
                }
            } // END synchronized (mLock)

            if (doWaitedHalfDump) {
// After 30 seconds of timeout, ams will dump the message first.
                ActivityManagerService.dumpStackTraces(pids, null, null,
                        getInterestingNativePids(), null, subject);
                continue;
            }

// The following log will be printed when timeout occurs
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

2-1) Traverse all HandlerCheckers and post the message to see if it times out

 public void scheduleCheckLocked() {
            if (mCompleted) {
                // Safe to update monitors in queue, Handler is not in the middle of work
                mMonitors.addAll(mMonitorQueue);
                mMonitorQueue.clear();
            }

// When it is not the foreground thread FgThread and is in the polling state; or when pauseLocked is called, return directly.
            if ((mMonitors.size() == 0 & amp; & amp; mHandler.getLooper().getQueue().isPolling())
                    || (mPauseCount > 0)) {
                // Don't schedule until after resume OR
                // If the target looper has recently been polling, then
                // there is no reason to enqueue our checker on it since that
                // is as good as it not being deadlocked. This avoid having
                // to do a context switch to check the thread. Note that we
                // only do this if we have no monitors since those would need to
                // be executed at this point.
                mCompleted = true;
                return;
            }
// mCompleted is false, which means querying
            if (!mCompleted) {
                // we already have a check in flight, so no need
                return;
            }

            mCompleted = false;
            mCurrentMonitor = null;
//Set the time to start calling
            mStartTime = SystemClock.uptimeMillis();
// Insert this into the message queue if this is runable.
            mHandler.postAtFrontOfQueue(this);
        }

If the message is processed, the run method will be executed:

 @Override
        public void run() {

            final int size = mMonitors.size();
// If it is FgThread, it monitors whether the lock is held for a long time, and the monitor method will be called back.
            for (int i = 0 ; i < size ; i + + ) {
                synchronized (mLock) {
                    mCurrentMonitor = mMonitors.get(i);
                }
// If it may get stuck, mCompleted will not be set to true.
                mCurrentMonitor.monitor();
            }

            synchronized (mLock) {
// After execution, mCompleted will be set to true
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }

2-2) To calculate the waiting status evaluateCheckerCompletionLocked

  • COMPLETED = 0: Waiting for completion;
  • WAITING = 1: The waiting time is less than half of DEFAULT_TIMEOUT, that is, 30s;
  • WAITED_HALF = 2: The waiting time is between 30s~60s;
  • OVERDUE = 3: The waiting time is greater than or equal to 60s.
//There are the following 4 states
    private static final int COMPLETED = 0;
    private static final int WAITING = 1;
    private static final int WAITED_HALF = 2;
    private static final int OVERDUE = 3;

    private int evaluateCheckerCompletionLocked() {
        int state = COMPLETED;
        for (int i=0; i<mHandlerCheckers.size(); i + + ) {
            HandlerChecker hc = mHandlerCheckers.get(i);
// Also traverses all HandlerCheckers
            state = Math.max(state, hc.getCompletionStateLocked());
        }
        return state;
    }

========
//Call getCompletionStateLocked method
        public int getCompletionStateLocked() {

// If 1. the corresponding handler has processed the queue head message; 2. Fgthread has processed the queue head message and the monitored lock has not timed out; then mCompleted is true
            if (mCompleted) {
                return COMPLETED;
            } else {
                long latency = SystemClock.uptimeMillis() - mStartTime;
// If the processing time is less than 30 seconds, set the status to WAITING
                if (latency < mWaitMax/2) {
                    return WAITING;
// If the processing time is greater than 30 seconds but less than 60s, set the status to WAITED_HALF
                } else if (latency < mWaitMax) {
                    return WAITED_HALF;
                }
            }
            return OVERDUE;
        }

2-3) The processing time is greater than 30 seconds but less than 60 seconds process WAITED_HALF

// 2-3) Processing time is greater than 30 seconds but less than 60 seconds
                } else if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        Slog.i(TAG, "WAITED_HALF");
// Then set waitedHalf = true
                        waitedHalf = true;
                        // We've waited half, but we'd need to do the stack trace dump w/o the lock.
                        pids = new ArrayList<>(mInterestingJavaPids);
//Set doWaitedHalfDump = true
                        doWaitedHalfDump = true;
                    } else {
                        continue;
                    }
                } else {
//Execute the timeout process
                    blockedCheckers = getBlockedCheckersLocked();
                    subject = describeCheckersLocked(blockedCheckers);
                    allowRestart = mAllowRestart;
                    pids = new ArrayList<>(mInterestingJavaPids);
                }
            } // END synchronized (mLock)

            if (doWaitedHalfDump) {
// After 30 seconds of timeout, ams will dump the message first, and then continue without further execution.
//dump out the process information in the NATIVE_STACKS_OF_INTEREST array
                ActivityManagerService.dumpStackTraces(pids, null, null,
                        getInterestingNativePids(), null, subject);
                continue;
            }

2-4) Execution timeout process

                } else if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        Slog.i(TAG, "WAITED_HALF");
                } else {
//Execute the timeout process
// First, getBlockedCheckersLocked obtains the thread corresponding to the HandlerChecker whose execution timed out.
                    blockedCheckers = getBlockedCheckersLocked();

// Get timeout information describeCheckersLocked
                    subject = describeCheckersLocked(blockedCheckers);
// Restart is allowed by default
                    allowRestart = mAllowRestart;
                    pids = new ArrayList<>(mInterestingJavaPids);
                }
            } // END synchronized (mLock)

. . . .
// Will print the following log
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

            final UUID errorId;
            if (mTraceErrorLogger.isAddErrorIdEnabled()) {
                errorId = mTraceErrorLogger.generateErrorId();
                mTraceErrorLogger.addErrorIdToTrace("system_server", errorId);
            } else {
                errorId = null;
            }


            FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED, subject);

            long anrTime = SystemClock.uptimeMillis();
            StringBuilder report = new StringBuilder();
            report.append(MemoryPressureUtil.currentPsiState());
            ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
            StringWriter tracesFileException = new StringWriter();
            final File stack = ActivityManagerService.dumpStackTraces(
                    pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
                    tracesFileException, subject);

            // Give some extra time to make sure the stack traces get written.
            // The system's been hanging for a minute, another second or two won't hurt much.
            SystemClock.sleep(5000);

            processCpuTracker.update();
            report.append(processCpuTracker.printCurrentState(anrTime));
            report.append(tracesFileException.getBuffer());

            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
            doSysRq('w');
            doSysRq('l');

            // Try to add the error to the dropbox, but assuming that the ActivityManager
            // itself may be deadlocked. (which has happened, causing this statement to
            // deadlock and the watchdog as a whole to be ineffective)
            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                    public void run() {
                        // If a watched thread hangs before init() is called, we don't have a
                        // valid mActivity. So we can't log the error to dropbox.
                        if (mActivity != null) {
                            mActivity.addErrorToDropBox(
                                    "watchdog", null, "system_server", null, null, null,
                                    null, report.toString(), stack, null, null, null,
                                    errorId);
                        }
                    }
                };
            dropboxThread.start();
            try {
                dropboxThread.join(2000); // wait up to 2 seconds for it to return.
            } catch (InterruptedException ignored) {}

            IActivityController controller;
            synchronized (mLock) {
                controller = mController;
            }
            if (controller != null) {
                Slog.i(TAG, "Reporting stuck state to activity controller");
                try {
                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                    // 1 = keep waiting, -1 = kill system
                    int res = controller.systemNotResponding(subject);
                    if (res >= 0) {
                        Slog.i(TAG, "Activity controller requested to coninue to wait");
                        waitedHalf = false;
                        continue;
                    }
                } catch (RemoteException e) {
                }
            }

            // Only kill the process if the debugger is not attached.
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
                Slog.w(TAG, "*** GOODBYE!");
                if (!Build.IS_USER & amp; & amp; isCrashLoopFound()
                         & amp; & amp; !WatchdogProperties.should_ignore_fatal_count().orElse(false)) {
                    breakCrashLoop();
                }

// Kill system process
                Process.killProcess(Process.myPid());
                System.exit(10);
            }

            waitedHalf = false;
        }
    }

// First, getBlockedCheckersLocked obtains the thread corresponding to the HandlerChecker whose execution timed out.

 private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
        ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
// Also traverse all mHandlerCheckers
        for (int i=0; i<mHandlerCheckers.size(); i + + ) {
            HandlerChecker hc = mHandlerCheckers.get(i);

// Use the following method isOverdueLocked to check whether it is timed out.
            if (hc.isOverdueLocked()) {
                checkers.add(hc);
            }
        }
        return checkers;
    }

==========
// Use the following method to check whether it has timed out
        boolean isOverdueLocked() {
            return (!mCompleted) & amp; & amp; (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
        }

// Get timeout information describeCheckersLocked

 private String describeCheckersLocked(List<HandlerChecker> checkers) {
        StringBuilder builder = new StringBuilder(128);
        for (int i=0; i<checkers.size(); i + + ) {
            if (builder.length() > 0) {
                builder.append(", ");
            }
            builder.append(checkers.get(i).describeBlockedStateLocked());
        }
        return builder.toString();
    }

=======
        String describeBlockedStateLocked() {
// If mCurrentMonitor is empty, it means that the problem is not with the lock timeout, but with the handler, then print the handler information.
            if (mCurrentMonitor == null) {
                return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
            } else {

// If it is not empty, there is a problem with the lock, which can only be used in FgThread.
                return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
                         + " on " + mName + " (" + getThread().getName() + ")";
            }
        }

Watchdog detects abnormal information collection

  • AMS.dumpStackTraces: Output stack information of Java and Native processes
  • dSys
  • dropBox

After collecting the information, the system_server process is killed. The default value of allowRestart here is true. When am hang operation is performed, restart is not allowed (allowRestart =false), and the system_server process will not be killed.