BUG: scheduling while atomic detection mechanism principle and problem handling

All code analysis in this article is based on linux-5.4.18

1 BUG: scheduling while atomic

During the task switching process, if the scheduler detects that the environment before switching is an atomic operation and task switching is prohibited, it will output an exception such as “BUG: scheduling while atomic:” and then decide whether to trigger panic based on the configuration. , restart the system.

1.1 Principle of monitoring mechanism in the kernel

In the Linux kernel scheduler, every time task scheduling occurs, it will check whether it is currently in an atomic operation environment. If so, it will print out the “BUG: scheduling while atomic:***” exception message and print out the scheduling stack. If “panic_on_warn” is configured in the kernel at this time, a kernel panic will be triggered, causing the system to restart.

In the code implementation of the kernel scheduler, __schedule() is the key function of task scheduling. When task scheduling occurs, this function will eventually be executed. In this function, schedule_debug(prev) is executed to determine whether “BUG: scheduling” has occurred. while atomic:***”.

The code implementation process is as follows:

1.1.1 __schedule()

Before the key scheduling function __schedule() performs task switching, schedule_debug(prev) is called to perform relevant checks. prev is the task that is running when the task is switched.

Two points need to be noted here: 1) Preemption must be prohibited before the __schedule() function is executed; 2) The __schedule() function is declared static and is only used in this file.

The first point can explain why the preempt_count count is compared with “1” instead of “0” when judging atomic operations in in_atomic_preempt_off() below. The reason is that before the execution of __schedule(), the preemption operation is prohibited, preempt_count An operation of adding “1” was performed.

The second point is declared with “static”. The function is only used in this file. For task scheduling operations elsewhere, encapsulated functions are used. In the function, operations such as disabling preemption, __scheduler(), and turning on preemption are performed.

/*
 * __schedule() is the main scheduler function.
 * WARNING: must be called with preemption disabled!
 */
static void __sched notrace __schedule(bool preempt)
{
    . . . . . .
    /* Monitor task switching environment, including "BUG: scheduling while atomic:***" check*/
    schedule_debug(prev);
    . . . . . .
}

1.1.2 schedule_debug()

“BUG: scheduling while atomic:***” is checked in the schedule_debug() function.

First, check whether the current environment is an atomic operation through in_atomic_preempt_off(), mainly through preempt_count() != PREEMPT_DISABLE_OFFSET. Only when the preempt_count count is equal to PREEMPT_DISABLE_OFFSET (equal to 1), task switching can occur, otherwise it is considered to be in an atomic operation (in_atomic_preempt_off).

preempt_count can be understood as a 32-bit per-CPU variable. The implementation methods are different in different architectures (in arm64 architecture, it is placed in the task-related thread_info structure), but it can be obtained using the preempt_count() method. Different bits of the preempt_count variable represent PREEMPT_MASK, SOFTIRQ_MASK, HARDIRQ_MASK, etc. Operations such as disabling/enabling preemption and disabling/enabling interrupts only operate on the corresponding bits in preempt count. Please refer to “include/linux/preempt.h” for details.

If it is currently in an atomic operation, call __schedule_bug(prev), print out the exception information, and execute panic().

The relevant code is implemented as follows:

static inline void schedule_debug(struct task_struct *prev)
{
    . . . . . .
    /*
     * Determine whether it is currently in an atomic operation
     * preempt_count = PREEMPT_DISABLE_OFFSET, non-atomic operation, exit check directly
     * preempt_count! = PREEMPT_DISABLE_OFFSET atomic operation, output exception and panic()
     */
    if (unlikely(in_atomic_preempt_off())) {
        /* Print out exception information and execute panic() */
        __schedule_bug(prev);
        preempt_count_set(PREEMPT_DISABLED);
    }
    . . . . . .
}

1.1.3 __schedule_bug()
After a problem occurs, operate __schedule_bug(prev), print out the exception information and perform a panic operation. The specific implementation is as follows:

/*
 * Print scheduling while atomic bug:
 */
static noinline void __schedule_bug(struct task_struct *prev)
{
    . . . . . .
    /* 1) Print out "BUG: scheduling while atomic:" related information */
    printk(KERN_ERR "BUG: scheduling while atomic: %s/%d/0x x\\
",
                        prev->comm, prev->pid, preempt_count());

    /* 2) After an exception occurs, print out relevant information */
    debug_show_held_locks(prev);
    print_modules();

    /* 3) If panic_on_warn is configured, execute the panic() action */
    if (panic_on_warn)
        panic("scheduling while atomic\\
");

    /* 4) dump scheduling stack */
    dump_stack();
}

1.2 Case 1:

1.2.1 Problem description:

In the customer application scenario, the device restarted abnormally. After checking the log, it was found that “BUG: scheduling while atomic:” was printed abnormally, and the kernel panicked and restarted.

1.2.2 log

[ 2004.917769][ 14] BUG: scheduling while atomic: ktimersoftd/14/131/0x00000002
... ...
[ 2004.917816][ 14] CPU: 14 PID: 131 Comm: ktimersoftd/14 Not tainted 4.19.90-25.2.rt.2101.gfb01.ky10.aarch64 #1
... ...
[2004.917820][14] Call trace:
[2004.917829][14] dump_backtrace + 0x0/0x170
[2004.917830][14] show_stack + 0x24/0x30
[2004.917834][14] dump_stack + 0xa4/0xe8
[2004.917838][14] __schedule_bug + 0x68/0x88
[2004.917840][14]__schedule + 0x618/0x748
[2004.917841][14] schedule + 0x48/0x100
[2004.917844][14]rt_spin_lock_slowlock_locked + 0x110/0x2c0
[2004.917845][14]rt_spin_lock_slowlock + 0x54/0x70
[2004.917847][14]rt_spin_lock + 0x60/0x70
[2004.917850][14] timerfd_triggered + 0x24/0x60
[2004.917851][14] timerfd_tmrproc + 0x20/0x30
[2004.917854][14] __hrtimer_run_queues + 0xf8/0x388
[2004.917855][14]hrtimer_run_softirq + 0x78/0xf0
[2004.917858][14] do_current_softirqs + 0x1c4/0x3e8
[2004.917859][14] run_ksoftirqd + 0x30/0x50
[2004.917860][14] smpboot_thread_fn + 0x1a8/0x2a0
[2004.917862][14]kthread + 0x134/0x138
[2004.917864][14] ret_from_fork + 0x10/0x18
[ 2006.350914][ 14] Kernel panic - not syncing: Fatal exception

1.2.3 Reason analysis:

Confirmed through the log, the reason is: the timerfd timer is used in the soft interrupt processing thread, and the sleepable spin_lock is used in timerfd.

In Preempt RT linux, soft interrupts are processed in the kernel thread ksoftirqd. They can originally sleep. However, a previous modification determined that if the hrtimer timer is in soft interrupt processing, raw_spin_lock is used to lock the soft interrupt processing. This The operation will increment the preempt_count count.

In this soft interrupt processing thread run_ksoftirqd, the ktimersoftd mechanism in the soft interrupt processed uses the sleepable spin_lock. After the spin_lock fails to acquire the lock, the current task sleeps and task switching occurs. During the switching process, in __schedule(), pass preempt_count() != PREEMPT_DISABLE_OFFSET, it is judged that it is currently in an atomic operation, so an error is reported, and a restart operation is performed after panic.

1.2.4 Solution

In the Preempt RT linux transformation plan, the implementation of spin_lock is changed to rt_spin_lock, so that spin_lock becomes sleepable, and both itself and the critical section it protects can be preempted, thus ensuring the real-time performance of the system.

However, in the specific implementation of the Linux kernel, atomic operations must be used in some places. Sleep and task switching are prohibited in these places. In this case, problems will arise if the sleepable spin_lock is used. Therefore, the Preempt RT linux solution still retains the non-sleepable and non-preemptible spin_lock mechanism: raw_spin_lock. In some scenarios where task switching is prohibited, raw_spin_lock needs to be used to replace the preemptible spin_lock.

Solution: To address the current problem, in areas where sleep and preemption are prohibited, non-sleepable raw_spin_lock is used to protect critical section resources instead of sleepable spin_lock.

Some modifications are shown below:

In addition, in non-PREEMPT RT Linux, functions that may cause task switching (active sleep or passive release of the CPU) must not be used in the top half of the interrupt and the critical section protected by spinlock. However, in my impression, in the implementation of the top half of the interrupt, if a sleepable function is used, the compiler will directly report an error. I am not sure whether I remember it wrong. I can verify it when I have time.