ps vs top: two different ways of CPU usage statistics

How to calculate CPU utilization rate?

Simply put, the CPU usage of a process refers to how much time the CPU is spending running the process. In the Linux system, the running time of a process is counted by jiffies[1]. By calculating jiffies * HZ, the CPU time consumed by the process can be obtained, and then divided by The total time of the CPU, you can get the CPU usage of the process: jiffies * HZ / total_time.

Differences between ps and top

ps and top are the two most commonly used ways to view CPU usage, both of which can be used to quickly find processes with high current CPU usage. But in fact the statistical methods of these two tools are completely different.

We use the following simple Go program to test the difference between the two tools:

package main

import (
    "bytes"
    "fmt"
    "strconv"
    "sync"
    "time"
)

var testData = []byte(`testdata`)

func testBuffer(idx int) {
  m := map[string]*bytes.Buffer{}
  for i := 0; i < 100; i + = 1 {
    buf, ok := m[strconv.Itoa(i)]
    if !ok {
      buf = new(bytes. Buffer)
    }
    for j := 0; j < 1024; j + = 1 {
      buf. Write(testData)
    }
    m[strconv.Itoa(i)] = buf
  }
  fmt.Println("done, ", idx)
  wg. Done()
}

var wg sync.WaitGroup

func main() {
    for i := 0; i < 10; i + = 1 {
        wg. Add(1)
        j := i
        go testBuffer(j)
    }
    wg. Wait()
    fmt.Println("sleeping")
    time. Sleep(time. Hour)
}

Then we run this program and check the CPU usage of the process through top and ps aux respectively.

top -n 1:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME + COMMAND
39753 infini 20 0 14.663g 0.014t 1200 S 611.1 22.2 0:23.53 test-cpu

ps aux:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
infini 39881 767 39.1 26505284 25791892 pts/16 Sl + 07:04 0:38 ./test-cpu

It can be seen that the CPU usage statistics of ps and top are similar (because the time points are not completely consistent, the statistical values will also have slight differences). The difference between the two tools is reflected in the fact that after the end of testBuffer, the CPU usage of top is close to 0, but the statistics of ps are still very high CPU usage for:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
infini 39881 82.3 42.4 28638148 27953532 pts/16 Sl + 07:04 0:40 ./test-cpu

Why are ps and top statistics different?

The difference between these two tools comes from the way they operate: top can only run for a while, while ps returns immediately. This difference is reflected in running top -n 1 and ps aux, top returns after a delay, while ps is returned immediately. These two different operating modes will be reflected in the statistical algorithms of the two tools.

At the beginning of the article, we mentioned that the CPU time of Linux is counted according to jiffies. Considering the efficiency, Linux only counts the total value and does not record historical data. For ps, since only the instantaneous value can be counted, the statistical algorithm of this instantaneous value must not get the real-time CPU usage rate, because the real-time usage rate needs to pass (current_cpu_time - last_cpu_time) / time_duration, ps can only be counted once, so time_duration is 0, and the occupancy rate cannot be calculated . In fact, ps counts the CPU usage during the entire process running cycle[2]:

(total_cpu_time / total_process_uptime)

For the short-term occupancy rate increase of the test program, ps can get approximately accurate average CPU occupancy rate at the beginning, but after the cpu occupancy recovers, the ps The statistical value will not decrease immediately, but will decrease slowly as the process running time total_process_uptime increases.

The top command is different, top updates the CPU usage statistics by running continuously. -n 1 This parameter specifies that top exits after running an iteration, and the top command can use this delay to complete the CPU usage within an iteration Rate statistics:

(current_cpu_time - last_cpu_time) / iteration_duration

How to continuously monitor CPU utilization rate?

Generally speaking, the monitoring system is divided into two different components: collection and statistics. The collection component only collects indicator values, and the statistical function is realized through the database/Dashboard. To monitor the CPU usage, ps is a statistical method that is very consistent with the behavior of the collection component, and each collection can get the “current” CPU usage. However, limited by the statistical method of the algorithm itself, what we actually collect is the average CPU usage, which cannot reflect the real-time status of the process.

Taking INFINI Console as an example, we run a short-term data migration task load, and then check the CPU usage monitoring of the corresponding INFINI gateway instance (payload.instance.system.cpu, through ps code> to count the current CPU usage). It can be seen that the CPU usage will rise in a curve and will slowly drop after the task ends:

If we want to continuously monitor the real-time CPU usage, we need to learn from the statistical method of top to collect the original process CPU time, and then calculate the CPU usage by aggregating the data.

Under the Linux system, the ps and top commands will calculate the CPU usage through the information provided by /proc/[PID]/stat[2 ]:

## Name Description
14 utime CPU time spent in user code, measured in jiffies
15 stime CPU time spent in kernel code, measured in jiffies
16 cutime CPU time spent in user code, including time from children
17 cstime CPU time spent in kernel code, including time from children

After obtaining the process information of each sampling time, we can use this formula to calculate the CPU usage during the sampling period:

delta(cpu_time) / delta(timestamp)

In INFINI Console, we can use deriative function to calculate payload.instance.system.user_in_ms and payload.instance.system.sys_in_ms relative to timestamp to obtain accurate CPU usage statistics.

New utilization algorithm configuration in Console

In this way, we can count the real-time CPU usage of the gateway before and after running the task load:

New utilization algorithm

Summary

Although both top and ps can count CPU usage, but the statistical algorithms are completely different. After understanding the underlying principles of these two algorithms, we can design a data collection and data statistics method suitable for the monitoring system to collect accurate CPU usage.

Reference

Jiffies
Top and ps not showing the same cpu result