Python multi-threading/multi-process doubts explained: why, when and how to use it?

The purpose of this guide is to explain why multithreading and multiprocessing are needed in Python, when to use them, and how to use them in your programs. As an AI researcher, I use them extensively when preparing data for my models!

Before getting to the point, let me tell you a story:

A long time ago, in a distant galaxy…

A clever and powerful wizard lives in a small, remote village. Let’s call him Dumbledore. Not only was he smart and capable, but he was also willing to help anyone who asked for help, which meant that people came from all over to ask for the wizard’s help. Our story begins one fine day when a young traveler brings a magic scroll to the wizard. The traveler didn’t know what was in the scroll, but he knew that if anyone could decipher the secret of the scroll, it would be the great wizard Dumbledore.

Chapter 1: Single thread, single process

If you haven’t guessed the meaning of my story, I’m actually making a metaphor about the CPU and its functions. Our wizard is the CPU, and the magic scroll is a list of URLs that lead to the power of Python and the knowledge to use that feature.

The wizard deciphered the scroll without much difficulty, and his first thought was to send his trusted friends to each of the locations given on the scroll to see and bring back what he could find.

As you can see, we are just using a for loop to iterate through the urls one after another and read the responses. Thanks to the magic of %% time obtained from IPython, we can see that it takes about 12 seconds on my poor internet.

Chapter 2: Multithreading

The wizard’s wisdom was renowned throughout the land, and he soon came up with a more effective method. Instead of sending one person to each location in order, gather a group of (trustworthy) people and send them to each location individually at the same time! Once they are all back, the wizard can simply combine everything they brought back stand up.

Yes, we can use multi-threading to access multiple URLs at the same time instead of traversing the list one after another.

Much better! It’s like…magic. Using multithreading can significantly speed up many IO-bound tasks. Here, most of the time taken to read the url is due to network latency. Programs bound to io spend most of their time waiting for input/output (you guessed it, similar to how a wizard needs to wait for his friend/friends to get to a given position in the scrollbar and back). This could be I/O from the network, database, files, or even users. This kind of I/O tends to take a lot of time because the source itself may need to perform its own processing before delivering the I/O. For example, a CPU works much faster than a network connection can transfer data.

Note: Multithreading is very useful in tasks such as web scraping.

Chapter 3: Multiprocessing

As time passed and our wizard’s reputation grew, a rather nasty dark wizard, driven by jealousy, used cunning means to place a terrible spell on Dumbledore. Once the spell was broken, Dumbledore knew he had only a few moments to break it. In desperation, he rummaged through his spell book and found a counterspell that seemed to work. The only problem is that it requires him to calculate the sum of all prime numbers below 1 million.

Now, the wizard knew that given enough time, the calculated value would be trivial, but time was not a luxury he had. Although he is a wizard, he is also limited by human nature and can only calculate one number at a time. If so, it would be too time-consuming to add up the prime numbers one by one. With a few seconds left, he suddenly remembered the multi-processing spell he had learned from the magic scroll years ago. This spell allows him to make copies of himself, and by splitting the numbers up, he can check whether multiple numbers are prime at the same time. In the end, all he has to do is add up all the prime numbers he and his replicas have discovered.

Since modern CPUs often have multiple cores, we can speed up CPU-bound tasks by using multi-processing modules. CPU-bound tasks are programs that spend most of their time performing calculations on the CPU (mathematical calculations, image processing, etc.). If calculations can be performed independently of each other, we can distribute them across the available CPU cores, significantly increasing processing speed.

All you have to do is;

  1. Define the function to be applied

  2. Prepare a list of projects to which features will be applied

  3. Use Pool to generate processes. The number passed to Pool() will be the number of processes spawned. Embedding within a with statement ensures that the process is terminated after completing execution.

  4. Use the map function of the pool process to combine the output. The input to the mapping function is the function to be applied to each item, and the list of items.

Note: This function can be defined to perform any task that can be executed in parallel. For example, a function might contain code that writes the results of a calculation to a file.

So why do we need separate multiprocessing and multithreading? If you try to use multithreading to improve performance on a CPU-bound task, you may notice that what you actually get is a performance penalty. Heresy! Let’s see why this is the case.

Just like wizards are limited by human nature and can only calculate one number at a time, Python also comes with a Global Interpreter Lock (GIL). Python will happily let you spawn any number of threads, but the GIL ensures that only one thread is executing at any given time.

For an io-bound task, this is perfectly fine. One thread makes a request to one URL, and while it’s waiting for a response, that thread can be replaced by another thread that makes another request to another URL. Because a thread doesn’t need to do anything until it receives a response, it doesn’t matter if only one thread is executing at a given time.

For CPU bound tasks, since only one thread is executed at a time, even if multiple threads are spawned and each thread has its own number to check for prime numbers, the CPU will still only process one thread at a time. In fact, these numbers are still checked one by one. If you use multithreading in a CPU-bound task, the overhead of handling the multithreading will cause performance degradation.

To overcome this “limitation” we use a multiprocessing module. Instead of using threads, multiprocessing uses multiple processes. Each process has its own interpreter and memory space, so the GIL doesn’t block anything. Essentially, each process uses a different CPU core to process different numbers at the same time.

You may notice that the CPU utilization is much higher when using multiprocessing than using a simple for loop, or even multithreading. This is because your program uses multiple CPU cores, not just one.

Keep in mind that multiprocessing inherently comes with the overhead of managing multiple processes, which is typically more expensive than multithreading. (Multiprocessing generates a separate interpreter and allocates a separate memory space for each process) This means that, as a rule of thumb, when lightweight multithreading is available, it is best to use it (io bound tasks). When CPU processing becomes the bottleneck, it is often necessary to call the multiprocessing module. But remember, with great power comes great responsibility.

If you spawn more processes at once than your CPU can handle, you’ll notice performance starts to degrade. This is because the operating system now has to do more work to swap processes in and out of the CPU core because you have more processes than cores. The reality is probably more complicated than a simple explanation, but that’s the basic idea. When we reach 16 processes, you can see the performance of my system decrease. This is because my CPU only has 16 logical cores.

Chapter 4: TLDR

  • For IO-bound tasks, using multi-threading can improve performance.

  • For IO-bound tasks, using multiprocessing can also improve performance, but the overhead is often higher than using multithreading.

  • The Python GIL means that only threads can execute at any given time in a Python program.

  • For CPU-bound tasks, using multiple threads can actually reduce performance.

  • For CPU-bound tasks, using multiprocessing can improve performance.

The above is an introduction to multithreading and multiprocessing in Python. Now please, go forward bravely and conquer everything!

End

Finally:

Python learning materials

If you want to learn Python to help you automate your office, or are preparing to learn Python or are currently learning it, you should be able to use the following and get it if you need it.

① Python learning roadmap for all directions, knowing what to learn in each direction
② More than 100 Python course videos, covering essential basics, crawlers and data analysis
③ More than 100 Python practical cases, learning is no longer just theory
④ Huawei’s exclusive Python comic tutorial, you can also learn it on your mobile phone
⑤Real Python interview questions from Internet companies over the years, very convenient for review

There are ways to get it at the end of the article

1. Learning routes in all directions of Python

The Python all-direction route is to organize the commonly used technical points of Python to form a summary of knowledge points in various fields. Its usefulness is that you can find corresponding learning resources according to the above knowledge points to ensure that you learn more comprehensively.

2. Python course video

When we watch videos to learn, we cannot just move our eyes and brains but not our hands. The more scientific learning method is to use them after understanding. At this time, hands-on projects are very suitable.

3. Python practical cases

Optical theory is useless. You must learn to follow along and practice it in order to apply what you have learned to practice. At this time, you can learn from some practical cases.

Four Python Comics Tutorial

Use easy-to-understand comics to teach you to learn Python, making it easier for you to remember and not boring.

5. Internet company interview questions

We must learn Python to find a high-paying job. The following interview questions are the latest interview materials from first-tier Internet companies such as Alibaba, Tencent, Byte, etc., and Alibaba bosses have given authoritative answers. After finishing this set I believe everyone can find a satisfactory job based on the interview information.


This complete version of Python learning materials has been uploaded to CSDN. If friends need it, you can also scan the official QR code of csdn below or click on the WeChat card at the bottom of the homepage and article to get the method. [Guaranteed 100% free]

syntaxbug.com © 2021 All Rights Reserved.