Why the Linux CFS scheduler doesn’t bring amazing crushing effects

Anyone who understands the Linux kernel knows the CFS process scheduling algorithm of the Linux kernel. Whether it is the paper when it was first introduced in 2.6.23, or various source code analyses, articles, and Linux kernel-specific books, they all give people this impression. There is a feeling that theCFS scheduler is revolutionary and will revolutionize process scheduling algorithms. As expected, people expect it to bring stunning results.

But this is an illusion.

People hope that CFS will win quickly, but after analysis, it is only slightly better than the O(1) scheduler in some aspects. It’s not even as good as the ancient 4.4BSD scheduler in some aspects. But people are still flocking to it, especially source code analysis, which is so overwhelming!

Why does CFS not have a crushing effect on other scheduling algorithms?

First of all, in the real world, crushing does not exist. Since people and people and things are compared in the same heavyweight echelon, the difference between them is not as big as imagined. It does not matter who is crushing whom. . Don’t be fooled by novels, TV series and movies. In addition, Xu Xiaodong punching Lei Lei doesn’t count, because they are not in the same echelon.

In any field, revolutionary crushing innovations are not impossible, but the probability is extremely low. People are generally arrogant in that they always think that the environment they are in is undergoing some kind of crushing change, but in fact, there is a high probability that in the end, there will be a crushing change. Nothing but mediocrity.

Eventually there was a struggle and a stalemate.

Secondly, we should see that the CFS scheduler claims that it will bring good news to interactive processes. In this regard, CFS does better than O(1), but the amazing effect comes from the recognition of fans. There are not many interactive processes in the Linux system. Linux is mostly installed on servers, and from the perspective of servers, throughput is more important than interactive response.

What about the interactive Android system? We know that Android also uses CFS scheduler, and there are also some things about BFS. Why doesn’t it also bring amazing results?

I admit that there was no Android when CFS appeared around 2008. By the time Android appeared, the Linux kernel it used had already defaulted to the CFS scheduler. Let’s take a look at the relationship between Android version, Linux kernel version and release time:

The Linux kernel adopted the CFS scheduler in 2.6.23. So one reason is that there is no comparison. On the Android system, CFS has no chance to compare with O(1).

In addition, even if an O(1) scheduler is moved back to the Android system to do AB with CFS, in my opinion, CFS will not be amazing. The reason is very simple. The Android system is almost all interactive processes, but there is always only one foreground process. , you can hardly feel the lag in process switching. In other words, even if CFS treats interactive processes much better than O(1), you will not feel it, because for mobile phones and tablets, the time it takes for you to switch APPs Much larger than the time granularity of process switching.

So, what is so good about CFS?

To put it simply, the significance of CFS is that in a system mixed with a large number of computing processes and IO interactive processes, the CFS scheduler treats IO interactive processes more friendly and fair than the O(1) scheduler. Understanding this is critical.

In fact, the concept of CFS scheduler is very old. In the industry, the idea of CFS has long been applied in the fields of disk IO scheduling, data packet scheduling, etc., and even in the process scheduling of the oldest SRV3 and 4.3BSD UNIX systems. With the presence of CFS, it can be said that Linux just uses the CFS scheduler instead of designing the CFS scheduler!

Taking the 4.3BSD scheduler as an example, let’s take a look at its scheduling principles.

4.3BSD adopts a 1-second preemption system. Every 1 second, the entire system process will be prioritized, and then the one with the highest priority will be found and put into operation. It is a very simple idea. Now let’s see how it calculates the priority.

If you want to understand the essence of CFS, the above is it. In terms of language description, the essence of CFS is “A system with n processes, any long time period TT, each process runs for Tn time!”

Of course, in reality and implementation, 80% of the code will handle the 20% of the remaining problems, such as how to reward processes that sleep for too long, etc., but these are not the essence.

To sum up, we summarized:

In the real world, it is difficult to crush people or things of the same level.
A large number of Linux servers do not need to take care of interactive processes, and the advantages of CFS cannot be highlighted.
A large number of Android systems do not have the opportunity to compete with O(1).
It is difficult for a large number of interactive processes in the Android system to perceive process scheduling.
The idea of CFS scheduling has been around since ancient times.

Therefore, whether in terms of concept or effect, the Linux CFS scheduler does not bring eye-catching wow effects. But there is still something missing. Well, the technical explanation.

Before analyzing and explaining any mechanism, you must first ask what is the goal of this mechanism and what problem does it solve? This will make sense. Rather than just understanding how it works.

So when the Linux CFS scheduler is adopted, what problem does it aim to solve? It was definitely introduced to replace O(1) in response to a problem with the O(1) algorithm. The problem may not be notorious, but it is indeed a nail in the coffin that must be removed.

The essential problem with the O(1) scheduler is that there is a strong mapping between the priority of the process and the time slice in which the process can run!

That is to say, given a process priority, a corresponding time slice will be calculated. We ignore the dynamic priority related to rewards and punishments and look at the calculation of a process time slice in the original O(1) algorithm:

#define BASE_TIMESLICE(p) (MIN_TIMESLICE + /
((MAX_TIMESLICE - MIN_TIMESLICE) */
(MAX_PRIO-1 - (p)->static_prio) / (MAX_USER_PRIO-1)))

static inline unsigned int task_timeslice(task_t *p)
{
return BASE_TIMESLICE(p);
}

Intuitive point of view:

In response to the above problems, the O(1) of the 2.6 kernel introduces double slopes to solve:

static unsigned int task_timeslice(task_t *p)
{
if (p->static_prio < NICE_TO_PRIO(0))
return SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);
else
return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
}

The intuitive diagram is as follows:

It seems that the problem has been solved, but if you just focus on a certain priority sub-range in the above figure, there will still be problems. This is the problem of relative priority. We see that high-priority time slices increase and decrease slowly, while low-priority time slices increase and decrease suddenly. They are also processes with the same priority, and their priority distribution affects their time slice allocation. .

It was originally intended to treat a lame person, but the leg was cured but the arm was broken.

Essentially, this is due to the following two reasons:

Fixed priorities map to fixed time slices.
Relative and absolute priorities are mixed.

So how to solve this problem?

Priority and time slice are originally two concepts, and there must be a variable between them to communicate. A high priority just means that the process can run longer, but how long it will take is not determined by just the priority. It also needs to be considered comprehensively. In other words, if there is only one process, then even if its priority is No matter how low it is, it can run forever. If there are many processes in the system, even the highest priority process will have to give up some time to other processes.

Therefore, considering the overall process situation in the system, converting priorities into weights and time slices into shares, CFS is the way to go. The final coordinate system should be the weight ratio/time slice coordinate system rather than the weight (or priority)/time slice. It should look like this smooth:

It seems that Linux CFS is only designed to solve a “static priority/time slice mapping” problem in O(1). So one can imagine what amazing effects it can bring? There is another “but” here. This problem of the O(1) scheduler is actually not a problem for computing-intensive daemons, but a good thing. After all, high-priority processes can continue to run unconditionally for a long time. Not switching. This is good for improving throughput and cache utilization. It just interferes with the interaction process, so what’s the harm.

Of course, when tuning CFS, you will inevitably have to deal with remaining issues such as IO sleep rewards and penalties to design some trick algorithms, which is a waste of energy.

By the way, you also need to set your kernel to HZ1000, which can better reflect the smoothness of CFS, just as it claims. I can’t imagine that apart from fancy desktop distributions such as Ubuntu and Suse, which Linux needs to open HZ1000? Wouldn’t it be good to use HZ250 for servers?

The topic of scheduling is basically finished, but before entering the next step of the inherent trolling process, there are two more points to emphasize:

The time slice of CFS is dynamic and is a function of system load balancing and its priority. This allows the process schedule to be dynamically adapted to the optimal system to save switching overhead.
Even in the multi-core era, real-time processes still strictly follow the highest priority scheduling as in the single-core era.

I still want to say that when it comes to scheduler design, most people focus on the wrong thing!

In an era when the number of CPU cores is increasing, people should be more concerned about where to schedule the process on the CPU core rather than which process should be run on a certain CPU core.

Linux, which has come through the single-core era, has developed rapidly. This is understandable, but what makes an operating system kernel is not just technology, but also other things. Of course, programmers don’t like to hear this. Programmers are most annoyed by non-technical things. Programmers compete with everyone in writing code. Programmers especially like to complain about their leaders’ inability to write code.

Linux is not excellent in terms of pure technology. The reason why Linux is generally excellent is because there is a group of non-code programmers who are making it better and better. On the other hand, it is also thanks to open source and the community. The learning threshold for Linux is extremely low. If a company can recruit a Linux programmer effortlessly, then why should it bother to recruit high-end BSD programmers? The end result is that so many people use Linux that they can’t change it even if they want to.

But in any case, it cannot make up for some principle errors in the Linux kernel.

The Linux kernel is still based on the original main line. Take books about the Linux kernel as examples, the classic “Linux Kernel Design and Implementation” by Robert Love, and “In-depth Understanding of the Linux Kernel”. When talking about process scheduling, about multi-core load There are few or even no balanced words. Such a classic work has led many fans into the abyss of code that will never be recovered. Ever since, there has been an overwhelming number of CFS source code analyses.

But in fact, aside from such an ordinary Linux kernel, modern operating systems have entered the multi-core era. The core of it is the innovation in cache utilization, and the changes brought about are the innovations in process scheduling and memory management. If you review the Linux kernel source code, these changes have already been reflected.

Sadly, the classic books about the Linux kernel have never been updated. All those who come from traditional schools and like to read and study are still reading tomes from 10 years ago.

Of course, as a code, the Linux kernel is universal, so it is difficult for the community to see and pay attention to multi-core issues alone. The community is most concerned about maintainability, not performance. If the new Linux features run without problems on an i386 machine with 128MB of memory, then it is OK. As long as it is not a new problem encountered by more than 80% of people, the community will never care. At the same time, because of this, the community will also introduce bugs, which is something that makes people sigh even if they want to.

In my opinion, the community is just a programmer community where everything is based on code. The community will not pay too much attention to the development of the architecture and new features. These are the manufacturers’ business.

Original author: Linux code reading field

Original address: Why the Linux CFS scheduler does not bring amazing crushing effects – Tencent Cloud Developer Community – Tencent Cloud (The copyright belongs to the original author, please contact us to delete any infringement messages)