Android fever monitoring practice

1. Background

I believe that now that mobile terminals are highly popular, everyone will have more or less battery anxiety and have had the bad experience of mobile phones getting hot. The heating problem is an indicator that exists for a long time and in multiple scenarios, and involves multiple impacts such as the terminal-side application layer, the mobile phone ROM manufacturer’s system, and the external environment. How to effectively measure heating scenarios, locate heating sites, and attribute heating problems have become three major challenges in end-side application layer heating monitoring. This article uses some existing monitoring practices on the Android side of Dewu. It cannot extricate itself without going into the power consumption calculation scenario. It focuses first on the heating scenario itself. I hope it can give you some reference.

2. Definition of fever

Temperature is the most intuitive indicator that can reflect the heating problem. Currently, on the Android side, we use the body temperature above 37° as the dividing line, and every 3° upward as a heating temperature interval. The upper limit temperature of the interval subdivision is 49°, which is divided into 37-40 , 40-43, 43-46, 46-49, 49 + five levels.

Using mobile phone temperature and CPU usage as the first and second factors to determine whether the user has fever, other parameters are obtained to support the fever scene.

The specific indicators are as follows:

Mobile phone temperature, CPU usage, GPU usage;

thread stack;

System service usage frequency;

Device front and back, screen on and off duration;

Battery capacity and charging status;

Heat Relief Fever Level;

System model and version;

….

3. Indicator acquisition

Temperature

Battery temperature

The system BatteryManger already provides a series of built-in interfaces and sticky broadcasts to obtain battery information.

BatteryManager.EXTRA_TEMPERATURE broadcast, the temperature value obtained is 10 times the value in degrees Celsius.

//Get the battery temperature BatteryManager.EXTRA_TEMPERATURE, the Fahrenheit temperature needs to be divided by 10
fun getBatteryTempImmediately(context: Context): Float {
    return try {
        val batIntent = getBatteryStickyIntent(context) ?: return 0f
        batIntent.getIntExtra(BatteryManager.EXTRA_TEMPERATURE, 0) / 10F
    } catch (e: Exception) {
        0f
    }
}

private fun getBatteryStickyIntent(context: Context): Intent? {
    return try {
        context.registerReceiver(null, IntentFilter(Intent.ACTION_BATTERY_CHANGED))
    } catch (e: Exception) {
        null
    }
}

In addition to supporting the system broadcast of battery temperature, BatteryManager also includes the reading of additional information such as battery power and charging status, all of which are defined in its source code.

The following are some worthy of attention:
//BATTERY_PROPERTY_CHARGE_COUNTER remaining battery capacity in microamp hours
//BATTERY_PROPERTY_CURRENT_NOW instantaneous battery current, unit is microampere
//BATTERY_PROPERTY_CURRENT_AVERAGE average battery current in microamps
//BATTERY_PROPERTY_CAPACITY remaining battery capacity, displayed as an integer percentage
//BATTERY_PROPERTY_ENERGY_COUNTER remaining energy, unit is nanowatts
// EXTRA_BATTERY_LOW Whether the battery is considered low
// EXTRA_HEALTH constant for battery health constant
// EXTRA_LEVEL battery value
// EXTRA_VOLTAGE voltage
// ACTION_CHARGING enters charging state
// ACTION_DISCHARGING enters discharge state

Sensor temperature

Android is an open source operating system modified based on Linux. Similarly, in the sys/class/thermal/ directory of the mobile phone system, thermal_zoneX represents the temperature zone of each sensor, and cooling_deviceX represents cooling equipment such as fans or radiators.

Taking OnePlus 9 as an example, there are a total of 105 temperature sensors or temperature partitions, and 48 cooling devices.

The specific parameter type is recorded under each temperature partition. We focus on the type file and temp file, which record the name of the sensor device and the current sensor temperature respectively. Taking thermal_zone29 as an example, the temperature value representing the fifth processing unit of the first core of the CPU is 33.2 degrees Celsius. For a single device, the name corresponding to the partition is fixed, so we can read the thermal_zone file to record the sensor whose first type file name contains the CPU as the CPU temperature.

Case temperature

Android 10 Google officially launched a thermal mitigation framework, which monitors underlying hardware sensors (mainly USB sensors and Skin sensors) through the HAL2.0 framework to provide USB and shell temperature thermal signal level change monitoring. The system PowerManager source code provides corresponding heat level changes. There are 7 levels of acquisition of callback and fever levels, which are provided to developers to acquire actively or passively.

final PowerManager powerManager = (PowerManager) mContext.getSystemService(Context.POWER_SERVICE);
powerManager.addThermalStatusListener(new PowerManager.OnThermalStatusChangedListener() {
    @Override
    public void onThermalStatusChanged(int status) {
       //Return the corresponding thermal state
    }
});

But in terms of heat levels, the case temperature is undoubtedly the most reflective of the phone’s heat. It can be seen that the API of the Android system actually provides an AIDL interface, and you can directly register the monitoring of the Thermal change event and obtain the Temperature object. But since Hide API is identified. The regular application layer cannot be obtained. Taking into account the compatibility of the Android version, it is read through the reflection proxy ThermalManagerService.

But contrary to expectations, domestic manufacturers have not fully adapted to the official thermal mitigation framework, and the thermal status callback is often not accurate enough. Instead, they need to separately access each manufacturer’s thermal mitigation SDK to directly obtain the shell temperature. The specific API is based on the application manufacturer’s Internal access documents shall prevail.

CPU usage

The CPU usage is collected and calculated by reading and parsing the Proc stat file.

In the system proc/[pid]/stat and /proc/[pid]/task/[tid]/stat respectively record the CPU information corresponding to the process ID and the thread ID under the process ID. The specific field description will not be described here. For details, see: https://man7.org/linux/man-pages/man5/procfs.5.html.

We focus on the 14.15 bits of information, which respectively represent the user mode running time and kernel mode running time of the process/thread.

By parsing the Stat file of the current process and the Stat files of all threads in the Task directory, the difference/sampling interval between the sum of utime + stime within the two sampling periods (currently set to 1s) can be considered as entering the thread. CPU usage. That is, thread CPU usage = ((utime + stime)-(lastutime + laststime)) / period

GPU usage

For Qualcomm chip equipment, we can refer to the file content under /sys/class/kgsl/kgsl-3d0/gpubusy and the instructions on Qualcomm’s official website.

GPU usage = (picture below) value 1 / value 2 * 100, which has been verified to be basically consistent with the value obtained by SnapDragonProfiler information collection.

For MediaTek chip devices, we can directly read the usage value under /d/ged/hal/gpu_utilization .

Similarly, by specifying the sampling interval of the period (1 time per second), the current GPU usage per second can be obtained.

System service usage

Android system services include Warelock, Alarm, Sensor, Wifi, Net, Location, Bluetooth, Camera, etc.

There is little difference from the conventional monitoring methods on the market. They all use the system Hook ServiceManager to monitor the Binder communication of the system service, match the corresponding calling method name, and perform callback record processing corresponding to the middle layer monitoring.

Students who are familiar with Android development know that Android’s Zygote process is the first process when the Android system starts. In the Zygote Fork process, the system service-related process SystemServer will be hatched. In its core RUN method, a large number of system services will be registered and started, and managed through ServiceManager.

Therefore, we can use LocationManager as an example to monitor by reflecting proxy ServiceManager, intercept the corresponding methods in LocationManager, and record the data we expect to obtain.

//Get the Class object of ServiceManager
Class<?> serviceManagerClass = Class.forName("android.os.ServiceManager");
// Get getService method
Method getServiceMethod = serviceManagerClass.getDeclaredMethod("getService", String.class);
// Call the getService method through reflection to obtain the original IBinder object
IBinder originalBinder = (IBinder) getServiceMethod.invoke(null, "location");
//Create a proxy object Proxy
Class<?> iLocationManagerStubClass = Class.forName("android.location.ILocationManager$Stub");
Method asInterfaceMethod = iLocationManagerStubClass.getDeclaredMethod("asInterface", IBinder.class);
final Object originalLocationManager = asInterfaceMethod.invoke(null, originalBinder);
Object proxyLocationManager = Proxy.newProxyInstance(context.getClassLoader(),
        new Class[]{Class.forName("android.location.ILocationManager")},
        new InvocationHandler() {
            @Override
            public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
                // Interception and processing of methods are carried out here
                Log.d("LocationManagerProxy", "Intercepted method: " + method.getName());
                //Execute original method
                return method.invoke(originalLocationManager, args);
            }
        });
// Replace the original IBinder object
getServiceMethod.invoke(null, "location", proxyLocationManager);

In the same way, we obtain records of the number of applications, calculation interval, etc. corresponding to each system service within a fixed sampling period.

The source code Power_profile file defines the current amount in each system service state.

After we need to record the working time of each component in different states, we can obtain the heating contribution ranking of the components through the following calculation method, namely:

Components Power consumption (heating contribution) ~~ Current * Running time * Voltage (generally fixed value, can be ignored)

Thread stack

Since the heating problem is a comprehensive problem, unlike the Crash problem, we can know which thread triggered it at the scene of the occurrence. If the stacks of all threads are dumped and recorded, the number of sub-threads currently running is 200 +, and it is undoubtedly unreasonable to store them all. The question becomes: How to find the thread stack of the hot code more accurately?

As mentioned above, when calculating the CPU usage, we read the Stat files of all threads under the process. We can get the CPU usage of the sub-threads, perform inverse ranking of the usage, and filter out the ones that exceed the threshold (currently defined as 50%) or The threads occupying Top N are stored. Since frequent stack collection timing will cause performance impairment, part of the stack sampling precision and accuracy are sacrificed. After indicators such as temperature and CPU usage exceed the threshold definition, stack information at the specified delivery time will be collected.

We also need to clarify a concept. The file name of the thread Stat file is the thread identification name, and Thread.id refers to the thread ID.

The two are not equivalent, but the Native method provides us with a corresponding way to establish the mapping relationship between the two.

In the Art Thread.cc method, the Thread object in Java is converted into a Thread object in C++, and ShortDump is called to print the thread-related information. We can obtain the thread by matching the string to the core Tid= information. Tid.

The core code logic is as follows:

//Get the latest CPU sampling data in the queue
 val threadCpuUsageData = cpuProfileStoreQueue.last().threadUsageDataList
       val hotStacks = mutableListOf<HotStack>()
        if (threadCpuUsageData != null) {
            val dataCount = if (threadCpuUsageData.size <= TOP_THREAD_COUNT) {
                threadCpuUsageData.size
            } else {
                TOP_THREAD_COUNT
            }
            val traces: MutableMap<Thread, Array<StackTraceElement>> = Thread.getAllStackTraces()
            //Define the mapping relationship map between tid and thread
            val tidMap: MutableMap<String, Thread> = mutableMapOf()
            traces.keys.forEach { thread ->
                //Call the native method to obtain tid information
                val tidInfo = hotMonitorListener?.findTidInfoByThread(thread)
                tidInfo?.let {
                    findTidByTidInfo(tidInfo).let { tid ->
                        if (tid.isNotEmpty()) {
                            tidMap[tid] = thread
                        }
                    }
                }
            }
            //Collect the heating stack of topN
            for (index in 1..dataCount) {
                val singleThreadData = threadCpuUsageData[index - 1]
                val isMainThread = singleThreadData.pid == singleThreadData.tid
                val thread = tidMap[singleThreadData.tid.toString()]
                thread?.let { findThread ->
                    traces[findThread]?.let { findStackTrace ->
                        //Get the current thread stack
                        val sb = StringBuilder()
                        for (element in findStackTrace) {
                            sb.append(element.toString()).append("\
")
                        }
                        sb.append("\
")
                        if (findStackTrace.isNotEmpty()) {
                            //Whether it is the main thread
                            //Assemble hotStack
                            val hotStack = HotStack(
                                //process id
                                singleThreadData.pid,
                                singleThreadData.tid,
                                singleThreadData.name,
                                singleThreadData.cpuUseRate,
                                sb.toString(),
                                thread.state
                                isMainThread
                            )
// Log.d("HotMonitor", sb.toString())
                            hotStacks.add(hotStack)
                        }
                    }
                }

            }
        }

4. Monitoring plan

Under the premise of understanding how the core indicator data is obtained, in fact, the core idea of the monitoring solution is nothing more than limited sampling configurations such as sampling thresholds, sampling periods, and data switches of each module issued by the remote APM configuration center, and the sub-thread Handler regularly sends messages. The data of each module is collected for assembly, and the data is reported at the appropriate time. The specific data disassembly and analysis work will be further processed by the heating platform.

Overall module architecture

Time to report

Core Collection Process

Online and offline distinction

Since the CPU collection and stack collection of all sub-threads will actually compromise performance, the overall reading time for 200+ threads is about 200ms, and the CPU usage of the sampling sub-thread is 10%. Considering the online Due to user experience issues, high-frequency sampling cannot be fully enabled.

Therefore, in terms of the overall plan: the offline scenario focuses on discovering, troubleshooting, and managing all problems, reporting all logs, and using CPU and GPU usage as the first measurement indicators;

The online scenario focuses on observing the overall heating market trend, analyzing potential problem scenarios, and reporting core logs, with battery temperature as the first measurement indicator.

Heating platform

With the support of classmates on the platform side, the heating field data is consumed through the platform side, and the core heating stack is aggregated through the Android stack anti-obfuscation service to complete basic fields such as charging status, main thread CPU usage, problem type, and battery temperature. , the platform side has the ability to discover, analyze, and solve process-based monitoring and advancement.

The specific stack information & fever information platform is displayed as follows:

Since battery temperature and CPU usage are the most intuitive indicators for running-time heating scenarios, and we focus on the management of heating scenarios in the first phase, we will not conduct continuous in-depth analysis of power consumption scenarios such as component hooks, so the current side of the acquisition is based on Battery temperature and CPU usage are the first and second indicators to establish the core four quadrants of heating problems, giving priority to high temperature and high CPU problem scenarios.

During the data analysis process, we encountered situations where the efficiency of data troubleshooting was not high enough and the accuracy of the questions was not accurate enough.

How to determine whether the high-temperature scene occurs inside the app and increases significantly during use? By filtering the scenes where the temperature is high from the start and the temperature is high when switching back to the background, we focus on the scenes where the temperature inside the app rises.
After online sampling, there are still 60,000+ data reported in a single day. How do we filter out more core data? The current approach is to define the concept of temperature span and give priority to Cases with larger temperature spans within the app.
The thread has a stack that is blocked by calling Wait and other methods, which consumes time allocation in the kernel state, but does not actually consume false positive data from the overall CPU. The running status of the thread and the State recorded in the Proc file are supplemented to facilitate priority processing of the CPU high temperature and high usage problem of the RUNNABLE thread.
As the temperature of mobile phones rises as a gradual scenario, how to achieve accurate attribution of pages in the scenario of temperature rise? While increasing the temperature sampling frequency, instantaneous data such as CPU usage and real-time stack are aggregated as data support. However, considering the volume of data, the data reporting aggregation and trimming method is still gradually exploring a more reasonable way, striving to achieve the best between the two. Find a balance between.

5. Income

Since the launch of Android end-side fever monitoring, with the support of the platform side, some problems have been discovered one after another and joint development students have done management optimization work for corresponding scenarios, such as:

Time-consuming independent thread tasks are connected to the unified thread pool scheduling management;

Animation execution infinite loop monitoring and repair;

Optimization of file reading and writing strategies in high IO scenarios;

High-concurrency task lock granularity optimization;

Log libraries and other Json parsing frequent scenarios adopt more efficient serialization methods;

Try to classify the collection parameter equipment with too high system power, such as system cameras;

Game scenes based on Webgl reduce frame rate and timely reclaim resources to optimize runtime memory;

….

This undoubtedly accumulated some valuable experience for the scene technology selection and technology implementation of future experience work, which is in line with the high standards and requirements for the ultimate pursuit of App experience.

6. Future Outlook

As a progressive experience scenario, mobile phone heating involves multiple factors such as mobile phone hardware, system services, software usage, and external environment. For end-side troubleshooting, the current priority is focused on unreasonable use of the application layer, including troubleshooting tool link enhancement, problem business attribution, low battery, dynamic policy reduction in low power consumption mode, automated diagnostic reports, etc. There are still many points worth digging into in this link, such as:

Monitoring/Tool Enhancements

App floating layer analysis tool (CPU\GPU/frequency/temperature/power consumption and other information)
Learn from BatteryHistorian, SnapdragonProfiler, Systrace and other tools to enhance the capabilities of self-developed TeslaLab.

Business Attribution

The heating stack is automatically allocated
Call traceability and attribution refinement

Scenario strategy, downgrade

CPU tuning, dynamic frame rate, resolution downgrade
In-device low power consumption mode exploration

Automated diagnostic reporting

Single-user targeted automated analysis output diagnostic report

7. Summary

This is just a rough introduction to some of the preliminary work that has been done to control heating, as well as ideas for future heating and power consumption-related developments. I hope that the App can bring a better experience and bring users a greater yearning for better things. feelings.

*Text/GavinX

This article is original to Dewu Technology. For more exciting articles, please see: Dewu Technology official website

Reprinting without the permission of Dewu Technology is strictly prohibited, otherwise legal liability will be pursued according to law!