Based on python understanding generator generator (yield)

We know that a function whose body contains the yield keyword is not an ordinary function. This kind of function is called generator ( generator ), which is generally used for loop processing structures. Proper application can greatly optimize memory usage efficiency. For example, design a function that opens a file, converts each line to uppercase, and returns:

def read_file_upper(path):
    lines = []
    with open(path) as f:
        for line in f:
            lines.append(line.upper())
    return lines

This version of the function internally creates a list object to save the conversion results. The for loop iterates through each line of the file, converts it to uppercase and appends it to the list. In this way, each line in the file needs to be saved in the list. If the file is large, the memory overhead can be imagined.

We can use the yield keyword to change the read_file_upper function into a generator version. There is no change in the logic of the function body, except that the processing results of each row of data are returned one by one through yield instead of collecting the list object and then returning it.

def iter_file_upper(path):
    with open(path) as f:
        for line in f:
            yield line.upper()

If there is a text file data.txt now, it contains the following content:

hello, world
life is short, use python
my wechat id is: coding-fan
bye

Using the iter_file_upper generator, we can process it like this:

>>> for line in iter_file_upper('text.txt'):
...print(line.strip())
HELLO, WORLD
LIFE IS SHORT, USE PYTHON
MY WECHAT ID IS: CODING-FAN
BYE

The usage of the iter_file_upper generator is roughly the same as the read_file_upper function, but it will not hold all the data lines of the file at once, but will process them line by line and return them one by one, so that the memory Usage is reduced to a minimum.

Behavior Observation

So, why does the generator have such miraculous effects? We then observe:

>>> g = iter_file_upper('text.txt')
>>>g
<generator object iter_file_upper at 0x103becd68>

After we call iter_file_upper, we get a generator object instead of the file processing result. At this time, iter_file_upper has not yet started execution.

When we call the next function to receive the next data from the generator, iter_file_upper starts execution and stops at yield, and changes the first line The processing results are returned to us:

>>> next(g)
'HELLO, WORLD\
'

At this time, the generator is in a paused state and will not process the second row of data without our instructions.

When we execute the next function again, the generator resumes execution again, processes the next row of data and pauses again at yield:

>>> next(g)
'LIFE IS SHORT, USE PYTHON\
'

The generator remembers its own execution progress. Every time it calls the next function, it always processes and produces the next data, without us having to worry about it at all:

>>> next(g)
'MY WECHAT ID IS: CODING-FAN\
'
>>> next(g)
'BYE\
'

When the iter_file_upper code logic is completed, it will throw an exception to next to notify the caller that it has ended:

>>> next(g)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Therefore, we can simply think that the for-in loop is implemented inside the Python virtual machine like this:

Continuously call the next function to let the generator produce data;
Until the generator throws StopIteration exception;

task context

In the classic threading model, each thread has an independent execution flow and can only perform one task. If a program needs to handle multiple tasks at the same time, it can use multi-process or multi-thread technology. Assuming that a site needs to serve multiple client connections at the same time, a separate thread can be created for each connection for processing.

Regardless of thread or process, switching will bring huge overhead: user mode/kernel mode switching, execution context saving and restoration, CPU cache refresh, etc. Therefore, using threads or processes to drive the execution of small tasks is obviously not an ideal choice.

So, is there any other solution besides threads and processes? Before we begin the discussion, let us first summarize the key points of implementing a multi-task execution system.

If a program wants to handle multiple tasks at the same time, it must provide a mechanism that can record the progress of task execution. In the classic threading model, this mechanism is provided by the CPU:

Image description

As shown in the figure above, the program memory space is divided into multiple segments such as code, data, heap and stack. The CS register in CPU points to the code segment, and SS >The register points to the stack segment. When the program task (thread) is executed, the IP register points to the instruction currently being executed in the code segment, the BP register points to the current stack frame, and the SP register points to the current stack frame. >The register points to the top of the stack.

With the IP register, the CPU can fetch the next instruction that needs to be executed; with the BP register, when the function call ends, The CPUcan return to the caller to continue execution. Therefore, the CPU registers and the memory address space together constitute the task execution context and record the task execution progress. When a task is switched, the operating system first saves the current registers of the CPU to memory, and then restores the registers of the task to be executed.

So far, we have received some inspiration: Can’t the generator remember its own execution progress? So, is it possible to use generators to implement task execution flow? Since the generator runs in user mode, the switching cost is much smaller than that of threads or processes, making it an ideal means of organizing micro-tasks.

Now, we use the generator to write a toy coroutine to experience the operating mechanism of the coroutine:

def co_process(data):
    print('task with data {} started'.format(data))
    
    yield
    print('step one for data {} finished'.format(data))
    
    yield
    print('step two for data {} finished'.format(data))
    
    yield
    print('step there for data {} finished'.format(data))

The coroutine co_process is used to process data data. The processing is divided into 3 steps. Task switching may occur between each step, that is, through yield Release executive power.

Next, we create 3 coroutines to process 3 different data:

>>> t1 = co_process('a')
>>> t2 = co_process('b')
>>> t3 = co_process('c')

At this point, the coroutine has completed initialization, but has not yet started execution. We need to call the next function to activate them one by one:

>>> next(t1)
task with data a started
>>> next(t2)
task with data b started
>>> next(t3)
task with data c started

After the coroutine is activated, it starts executing and starts outputting prompt information until it encounters the first yield statement. yield will give up the execution rights of the coroutine, and the execution rights will return to our hands. In actual projects, coroutines generally give up execution rights through the yield statement when waiting for IO.

Note that in this example, we play the role of the scheduler and have the right to schedule the coroutine – let the first coroutine task execute two steps:

>>> next(t1)
step one for data a finished
>>> next(t1)
step two for data a finished

Then let the second and third coroutines execute two steps alternately:

>>> next(t2)
step one for data b finished
>>> next(t3)
step one for data c finished
>>> next(t2)
step two for data b finished
>>> next(t3)

Please note that in actual projects, the event loop generally plays the role of scheduler, rather than through our hands. In the section “Teaching you step-by-step on how to design a coroutine library”, we will discuss ideas for implementing event loops.

We then schedule the coroutine task. When a coroutine completes execution, it will throw a StopIteration exception through the next function to notify us:

>>> next(t1)
step there for data a finished
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration
>>> next(t3)
step there for data c finished
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration
>>> next(t2)
step there for data b finished
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeHomepageOverview 345963 people are learning the system