Linux kernel hung task detection mechanism principle and problem solving

All code analysis in this article is based on linux-5.4.18

1 hung task

The hung task mechanism in the Linux kernel is used to check whether any task has been in the D state for a long time (TASK_UNINTERRUPTIBLE uninterruptible sleep state).

If a task in D state is checked and has not been scheduled for more than 120s (kernel default value, which can be modified), it is considered that a hung task has occurred, and a warning message will be printed. Some critical tasks may cause system abnormalities if they remain in the hung state for a long time.

A common IO operation situation occurs in the hung task: for example, in the Linux system, it takes too long to write back the data in the memory cache to the disk, resulting in the hung task.

1.1 Implementation Principle

Hung task detection in Linux systems is completed through the kernel thread khungtaskd.

1. The khungtaskd kernel thread is created during the kernel startup phase. The relevant code is:

kernel/hung_task.c
static int __init hung_task_init(void)
{
    atomic_notifier_chain_register( & amp;panic_notifier_list, & amp;panic_block);

    /* Disable hung task detector on suspend */
    pm_notifier(hungtask_pm_notify, 0);

    /* Create hungtask detection kernel thread khungtaskd, which is specifically implemented in the watchdog() function */
    watchdog_task = kthread_run(watchdog, NULL, "khungtaskd");

    return 0;
}
subsys_initcall(hung_task_init);

2. The processing function watchdog() in the kernel thread khungtaskd has the following main tasks:

1) Get the hung task detection timeout time and detection interval interval

2) Determine whether the current time since the last check exceeds the detection time interval

3) If it exceeds, perform hung task check

4) After checking and performing relevant operations, set the timer to sleep and wait for the next wake-up time.

kernel/hung_task.c
static int watchdog(void *dummy)
{
    unsigned long hung_last_checked = jiffies;

    set_user_nice(current, 0);

    for ( ; ; ) {
        /*
         * Get the hungtask detection timeout time and detection interval time
         * sysctl_hung_task_timeout_secs = CONFIG_DEFAULT_HUNG_TASK_TIMEOUT, taken from the kernel compilation configuration item, the default is 120s,
         * The application layer can be viewed or modified through the sysctl parameter kernel.hung_task_timeout_secs or the corresponding device node.
         * sysctl_hung_task_check_interval_secs is not assigned directly, so the compiler assigns it a value of "0" by default
         */
        unsigned long timeout = sysctl_hung_task_timeout_secs;
        unsigned long interval = sysctl_hung_task_check_interval_secs;
        long t;

        if (interval == 0)
            interval = timeout;
            /*
             *The interval value is the minimum value of interval and timeout
             * Since the default value of interval is "0", the final value of interval is the same as timeout, which is 120s.
             * If the user adjusts the interval or timeout value, the minimum value of the adjusted interval and timeout is taken.
             */
            interval = min_t(unsigned long, interval, timeout);
            /*
             * At this time, it will be judged whether the timeout value is 0
             * If timeout is 0, t takes the value MAX_SCHEDULE_TIMEOUT (ie LONG_MAX),
             * The following logic cannot perform hungtask detection, and can be regarded as turning off hungtask detection;
             * If timeout is not 0, then t = last check time + interval time - current time,
             * Used to determine whether the current time since the last hungtask detection exceeds the interval time (default is equal to timeout 120s)
             */
            t = hung_timeout_jiffies(hung_last_checked, interval);
            /* If t is less than or equal to 0, it means that the current time has exceeded the interval value since the last check time, and the default is 120s */
            if (t <= 0) {
                if (!atomic_xchg( & amp;reset_hung_task, 0) & amp; & amp;
                            !hung_detector_suspended)
                    /* hungtask check */
                    check_hung_uninterruptible_tasks(timeout);
                    /* Record this detection time */
                    hung_last_checked = jiffies;
                    continue;
            }
            /*
             * Set the current task to TASK_INTERRUPTIBLE, set and start a timer with a duration of t,
             * Then call schedule() to give up the CPU, and the task will be removed from the ready queue. After the timer times out, wake up the task.
             * Similar to sleep function
             */
            schedule_timeout_interruptible(t);
        }

        return 0;
}

3. Hung task detection function check_hung_uninterruptible_tasks(timeout)

The main job of check_hung_uninterruptible_tasks() is to traverse all tasks in the system. If the task is in TASK_UNINTERRUPTIBLE, check_hung_task() is used to check the hung task of the task.

4. Task hung task check function check_hung_task()

By comparing the number of task switches between two checks that exceed the timeout time, we can determine whether the task is hung.

kernel/hung_task.c
static void check_hung_task(struct task_struct *t, unsigned long timeout)
{
    /*
     * Calculate the number of task switches (number of scheduling) of task t during this hung task check
     * nvcsw: number of active switching; nivcsw: number of passive switching
     */
    unsigned long switch_count = t->nvcsw + t->nivcsw;

    /*
     * Ensure the task is not frozen.
     * Also, skip vfork and any other user process that freezer should skip.
     */
    if (unlikely(t->flags & amp; (PF_FROZEN | PF_FREEZER_SKIP)))
        return;

    /*
     * When a freshly created task is scheduled once, changes its state to
     * TASK_UNINTERRUPTIBLE without having ever been switched out once, it
     * musn't be checked.
     */
    if (unlikely(!switch_count))
        return;

    /*
     * Hung task check, determine whether the number of switching times of task t in this inspection is the same as the number of switching times in the last inspection.
     * If they are not the same, it means that task switching occurred during the two hung task checks of task t. There is no hung task, and the last_switch_count is updated and returned directly.
     * If they are the same, follow-up time judgment is performed: 1) If the timeout time has not expired, no operation is performed and returns directly; 2) Otherwise, a hung task is considered to have occurred, and subsequent processing of the hung task is performed.
     */
    if (switch_count != t->last_switch_count) {
        t->last_switch_count = switch_count;
        t->last_switch_time = jiffies;
        return;
    }
    /*
     * Determine whether the current time has not exceeded the last check time + timeout,
     *If yes, it means that the timeout time has not arrived and returns directly, otherwise it is considered that a hung task has occurred
     */
    if (time_is_after_jiffies(t->last_switch_time + timeout * HZ))
        return;
    /* ftrace print */
    trace_sched_process_hang(t);
    /*
     * Determine whether to panic when a hung task occurs based on sysctl_hung_task_panic, and print out relevant information
     * sysctl_hung_task_panic takes the default value according to the kernel compilation configuration CONFIG_BOOTPARAM_HUNG_TASK_PANIC
     * The application layer can be viewed or modified through the sysctl parameter kernel.hung_task_panic or the corresponding device node
     * It can also be set through the kernel startup parameter "hung_task_panic=" when parsing kernel parameters at kernel startup.
     */
    if (sysctl_hung_task_panic) {
        console_verbose();
        hung_task_show_lock = true;
        hung_task_call_panic = true;
    }

    /*
     * Hang task warning information output.
     * The default value of sysctl_hung_task_warnings is 10, and the default output is 10 times
     * Can be viewed or modified through the sysctl parameter kernel.hung_task_warnings or the corresponding device node
     */
    if (sysctl_hung_task_warnings) {
        if (sysctl_hung_task_warnings > 0)
            sysctl_hung_task_warnings--;
        pr_err("INFO: task %s:%d blocked for more than %ld seconds.\\
",
                       t->comm, t->pid, (jiffies - t->last_switch_time) / HZ);
        pr_err(" %s %s %.*s\\
",
                    print_tainted(), init_utsname()->release,
                    (int)strcspn(init_utsname()->version, " "),
                    init_utsname()->version);
        pr_err(""echo 0 > /proc/sys/kernel/hung_task_timeout_secs""
                    " disables this message.\\
");
        /* Print out the scheduling stack of task t */
        sched_show_task(t);
        hung_task_show_lock = true;
    }


    touch_nmi_watchdog();
}

1.2 Related parameters

kernel.hung_task_panic: Determines whether to panic when a hung task occurs, and prints out the corresponding stack information

kernel.hung_task_check_count: The maximum number of tasks for hung task checking, the default is 4*1024*1024;

kernel.hung_task_timeout_secs: hung task checks timeout. If the task is in D state for more than timeout, a hung task is considered to have occurred.

kernel.hung_task_check_interval_secs: hung task check interval, the default is 0, has no effect, the check interval uses the timeout value. If this value is not “0”, the hung task code implementation determines the hung task check interval based on the minimum value of it and timeout.

kernel.hung_task_warnings: Determines the number of times warning messages will be output after checking that a hung task appears on a task. The default is 10 times.

1.3 Problem Handling

The reason why a hung task occurs is because a task has been in the D state for a long time and has not been scheduled. There are many reasons why the task cannot be scheduled for a long time. Most of the reasons may be external. Therefore, even if the hung task mechanism can move the task’s scheduling stack Even if you print it out, there is a high probability that you will not be able to find the source of the problem.

For example, we often solve some hung task problems by modifying the parameters related to writing data back to disk in the cache: vm.dirty_ratio and vm.dirty_background_ratio. This is also because it takes too long to write data back to disk and takes up too long related resources, resulting in Other tasks cannot obtain resources and remain in D state for too long.

If you want to fundamentally locate a hung task problem, it is recommended to use the kdump + kernel.hung_task_panic solution. When a hung task occurs, panic is triggered, and then the kdump method is used to locate the problem.