Learn runqlat in 5 minutes

Content preview

The first article in the learn X in 5 series. This article mainly introduces the process delay statistics method and rawtracepoint.

runqlat

“Why does the application freeze under high load scenarios”, “Why is process A not scheduled?”. When we have such questions in our work and life, the scheduling delay of the target process is a good starting point for observation. runqlat can help us complete this statistics. The following is the parent-child thread communication through pipe, and the delay information of both parties can be obtained through runqlat.bt statistics:

We did some tricks to make the running conditions of thread 524785 even worse. From the above figure, we can see that there is a significant difference in the scheduling delays of both parties. runqlat.bt was used to collect the data. It is bcc/tools/runqlat rewritten using bpftrace. For bpftrace syntax, refer to the previous article:

$ wget -qO - https://raw.githubusercontent.com/lilstaz/perf-tool-examples/main/bpftrace/runqlat.bt
BEGIN
{
    if (!$1) // 1
    {
        printf("Specify the pid of the task first\\
");
        exit();
    }
    @pid = $1;
    printf("Tracing latency of task %d. Hit Ctrl-C to end.\\
", @pid);
}

rt:sched_wakeup, // 2
rt:sched_wakeup_new
{
    $wakee = (struct task_struct*)arg0;
    if ($wakee->tgid == @pid) // 3
    {
        @qt[$wakee->pid] = nsecs;
    }
}

rt:sched_switch // 4
{
    $prev = (struct task_struct*)arg1;
    $next = (struct task_struct*)arg2;

    if ($prev->tgid == @pid & amp; & amp; $prev->state == TASK_RUNNING) // 5
    {
        @qt[$prev->pid] = nsecs;
    }

    if ($next->tgid == @pid & amp; & amp; @qt[$next->pid]) // 6
    {
        @[@pid] = hist((nsecs - @qt[$next->pid]) / 1000);
        delete(@qt[$next->pid]);
    }
}
# running result
$ wget -qO - https://raw.githubusercontent.com/lilstaz/perf-tool-examples/main/bpftrace/runqlat.bt| bpftrace - // 7
Attaching 5 probes...
Tracing latency of task 524783. Hit Ctrl-C to end.

@[524783]:
[128, 256) 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@|

@[524785]:
[8K, 16K) 79 |@@@@@@@@@@@@@ |
[16K, 32K) 307 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@|
[32K, 64K) 28 |@@@@ |

This script implements statistics on the scheduling delays of all threads in the specified process (from joining the run queue to actual running of the process), where:

Force the user to enter the process number to be observed, and assign the process number entered by the user to the global variable @pid;
By separating mount points by ,, this code segment can be mounted on multiple mount points. Here rt is the abbreviation of rawtracepoint, Will be introduced in the next section;
The first parameter of the mount point is the PCB of the process to be awakened. This structure saves the tgid (corresponding to the process ID of the user mode), pid (corresponding to the user mode) of the process. status thread ID), comm process name and other information. When the condition $wakee->tgid == @pid is met, it means that a thread of the target process has entered the wake-up logic , will be placed on the run queue, we use the thread number as the key of the @qt hash table to save the system nanoseconds at this time;
The following code segment will be executed when the process switches;
We use $prev to mark the process that has been deprived of the right to run, and use $next to mark the process that is about to gain the right to run. In Linux, the status of the process is running and waiting in the queue. Both are TASK_RUNNING. If the $prev process status is TASK_RUNNING, it means it will re-enqueue. We need to update its enqueue time to the current time;
$next The thread finally reaches the end of the run queue and will soon have the right to run the CPU. By calculating the difference between the current time and the queue entry time, we can get the scheduling delay of the thread. By hist() function, bpftrace can help statistics of data into a histogram;
If it doesn’t work please download the latest version of bpftrace.

rawtracepoint

Tracepoints are static event sources built into the kernel. The interface is stable and includes most subsystems. It is a high-quality source of information. For example, for the scheduling subsystem, the following tracepoints are available:

$ grep sched: /sys/kernel/debug/tracing/available_events
...
sched:sched_migrate_task
sched:sched_switch
...

Tracepoints are essentially print statements in the kernel. They are printed in a fixed format and can be viewed through the /sys/kernel/debug/tracing/events///format file. Format. Take sched:sched_switch as an example:

$ cat /sys/kernel/debug/tracing/events/sched/sched_switch/format
name: sched_switch
ID: 313
format:
        field:unsigned short common_type; offset:0; size:2; signed:0;
        field:unsigned char common_flags; offset:2; size:1; signed:0;
        field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
        field:int common_pid; offset:4; size:4; signed:1;

        field:char prev_comm[16]; offset:8; size:16; signed:1;
        field:pid_t prev_pid; offset:24; size:4; signed:1;
        field:int prev_prio; offset:28; size:4; signed:1;
        field:long prev_state; offset:32; size:8; signed:1;
        field:char next_comm[16]; offset:40; size:16; signed:1; // 1
        field:pid_t next_pid; offset:56; size:4; signed:1;
        field:int next_prio; offset:60; size:4; signed:1;

print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", ...) :

Assuming that we are interested in the name of the process that is about to obtain the right to run, we can use the following statement to print the information defined at label 1:

$ bpftrace -e 't:sched:sched_switch {printf("%s\\
", args->next_comm)}'
swapper/5
pthread_pipe
pthread_pipe
...

Tracepoint is perfect, it always appears in the critical path of the code. It is very useful for understanding the kernel code or debugging. But in the following scenarios:

There are requirements for the performance of observation scripts;
Want to get richer information from the original data structure. Using rawtracepoint may be more appropriate.

So what is rawtracepoint? It can be understood as the other side of tracepoint. Rawtracepoint can all use events available in /sys/kernel/debug/tracing/available_events, but relative to tracepoint, rawtracepoint provides specific values. Directly provide the kernel data structure. That is, the ‘first-hand data’ that is closer to the kernel. Taking sched:sched_switch as an example again, the relevant code (include/trace/events/sched.h ) as follows:

TRACE_EVENT(sched_switch,
 TP_PROTO(bool preempt, struct task_struct *prev, struct task_struct *next), // 1

 TP_fast_assign(
  memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
  __entry->prev_pid = prev->pid; // 2
  __entry->prev_prio = prev->prio;
  __entry->prev_state = __trace_sched_switch_state(preempt, prev);
  memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
  __entry->next_pid = next->pid; // 3
  __entry->next_prio = next->prio;
 ),

 TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", ...)
);

According to the definition of label 1, the tracepoint accepts the three parameters preempt, prev, next passed to it by the scheduling subsystem, and prints them to the ring buffer through TP_printk(). In the first section runqlat, in order to filter the scheduling delay of a specific process, we need to obtain the tgid and pid of the two processes when the process is switched. But according to labels 2 and 3, tracepoint can only obtain the pid. In order to use directly For the three variables preempt, prev, next, we can use rawtracepoint, which is abbreviated as rt in bpftrace. Review the code in the first section:

rt:sched_switch
{
    $prev = (struct task_struct*)arg1;
    $next = (struct task_struct*)arg2;
...

Here $prev $next actually obtains the second and third parameters defined by TP_PROTO in the tracepoint. Because only the parameters are obtained, there is no need to build a context structure, so that rawtracepoint is more efficient than tracepoint.

Postscript

There is actually another version of runqlat.bt in the bpftrace git repository, but it does not provide the function of tracking a specific process. When I used it to check problems, I wrote quite convoluted and ugly logic. After knowing that there is rawtracepoint, I The tracking and observation of the scheduling subsystem is obviously convenient by several orders of magnitude.

In scenarios where tracepoint cannot satisfy you, remember to look through the code to see TP_PROTO, rawtracepoint can always surprise you.

ref

Frequently asked questions about using raw tracepoint with ebpf/libbpf programs[1]