A “flyweight” C language coroutine library

Coroutine, as its name implies, is “co-operative routines”. Unlike threads with operating system concepts, coroutines are programming techniques that can achieve logical multitasking in user space by using the syntax and semantics of programming languages. In fact, the concept of coroutine is earlier than thread. According to Knuth, “Subroutine is a special case of coroutine”. A subroutine is a sub-function call, so in fact, coroutine is With function-like program components, you can easily create hundreds of thousands of coroutines in one thread, just like hundreds of thousands of function calls. It’s just that the subroutine has only one call entry starting point, and it ends after returning. The coroutine entrance can be the starting point, and execution can continue from the previous return point. That is to say, coroutines can be transferred through yield. Execution rights call each other symmetrically and levelly, rather than having a superior-subordinate calling relationship like a routine. Of course, Knuth’s “special case” means that coroutines can also simulate the superior-subordinate calling relationship like routines. This is called asymmetric coroutines

Based on event-driven model

Let’s take an example to look at a symmetric coroutine calling scenario, the most familiar “producer-consumer” event-driven model. One coroutine is responsible for producing products and adding them to the queue, and the other is responsible for producing products and adding them to the queue. One is responsible for taking the product from the queue and using it. For efficiency, you want to add or remove multiple products at once. Pseudocode could look like this:

# producer coroutine
loop
while queue is not full
  create some new items
  add the items to queue
yield to consumer

#consumer coroutine
loop
while queue is not empty
  remove some items from queue
  use the items
yield to producer

Most textbooks use this model as an example of multi-threading. In fact, the application of multi-threading here is still a bit “heavyweight”. Due to the lack of yield semantics, threads have to use synchronization mechanisms to avoid global resource conflicts. state, which inevitably generates system overhead such as sleeping, scheduling, and context switching, and thread scheduling will also generate timing uncertainty. For coroutines, the concept of “suspension” is just to transfer the code execution right and call another coroutine. After the transferred coroutine comes to an end, it will be called again and “wake up” from the suspension point. This coroutine The calls between processes are logically controllable and the timing is determined. It can be said that everything is under control.

Some of today’s languages with coroutine semantics include heavyweight ones such as C#, erlang, golang, lightweight python, lua, javascript, ruby, and functional scala, scheme, etc. In contrast, C, as a native language, is in an awkward position because C relies on a routine call called a stack frame, and the state quantities and return values inside the routine are retained. On the stack, this means that producers and consumers cannot implement level calls to each other. Of course, you can rewrite it to use the producer as the main routine and then call the consumer routine with the product as the passed parameter. Such code is written It’s thankless and looks uncomfortable, especially when the number of coroutines reaches hundreds of thousands, this way of writing is too rigid.

This leads to the concept of coroutines. If the context of each coroutine (such as the program counter) is saved elsewhere instead of on the stack, when coroutines call each other, the called coroutine only needs to be retrieved from the stack. Just restore the context before the last transfer point elsewhere. This is a bit similar to the context switching of the CPU. Unfortunately, it seems that only lower-level assembly language can do this.

Can C language only use multi-threading? Fortunately, the C standard library provides us with two coroutine scheduling primitives: one is setjmp/longjmp, and the other is the ucontext component. They implement coroutine context switching internally (in assembly language, of course). In comparison, the former will cause considerable uncertainty in application (for example, it is not easy to encapsulate, please refer to the online documentation for specific instructions), so the latter is more widely used. Most C coroutine libraries on the Internet are also implemented based on the ucontext component. .

“Flyweight” coroutine library

Here, I will introduce protothreads, a “fly-weight” open source C coroutine library. This is a library written entirely in ANSI C. The reason why it is called “flyweight” means that the implementation can no longer be streamlined and is almost at the primitive level. In fact, the entire protothreads library does not need to be linked and loaded, because all source codes are header files, which are similar to STL and do not rely on any third-party libraries and are portable on any platform; there are only 5 header files in total, and the effective code amount is less than 100 lines; The API is all defined by macros, so there is no call overhead; finally, the space overhead of each coroutine is 2 bytes (yes, you read that right, it is a short unit “stack”!) Of course, this simplification This comes at the expense of limitations in use, as the following analysis will illustrate.

Let’s first take a look at the author of protothreads, Adam Dunkels, a handsome computer genius from the Royal Institute of Technology in Sweden. By the way, this guy is quite interesting. He has written many lightweight works, all of which are licensed under the BSD license. By the way, there are a lot of lightweight open source software around the world, but not many are as famous as this guy. For example, the embedded network operating system Contiki and the TCP/IP protocol stacks uIP and lwIP that are familiar to Chinese people are also from its hands. The above-mentioned software has been tested by enterprise-level applications for decades, and the high quality can be imagined.

Many people will be curious about how such a “flyweight” code is implemented? Before analyzing the protothreads source code, let me first give you a basic lesson in C language;-^) In short, this uses a “wonderful trick” in the characteristics of C language, and this trick may not be connected to many Veteran C programmers with more than ten years of experience may not know this. Of course, I must first state here that I do not recommend everyone to use this. In fact, this is at the expense of destroying the coding standards of the language. In some serious projects, you need to treat it with caution, unless you want to be fired.

The “yield semantics” of C language

The following tutorial comes from Simon Tatham, an ARM engineer and genius hacker (author of the open source Telnet/SSH client PuTTY and the assembler NASM). As a comment, the source code of PuTTY is known as the most difficult C to hack among all official projects. You should guess What language is the author from?)’s blog post: Coroutines in C. Chinese translation here.

We know that python’s yield semantic function is similar to an iterative generator. The function will retain the last call state and continue execution from the last return point the next time it is called. Writing it in C language looks like this:

int function(void) { int i; for (i = 0; i < 10; i + + ) return i; /* won't work, but wouldn't it be nice */ }

Called 10 times in a row, it will return 0 to 9 respectively. How to achieve this? We can use the goto statement. If we add a state variable to the function, we can achieve it like this:

int function(void) { static int i, state = 0; switch (state) { case 0: goto LABEL0; case 1: goto LABEL1; } LABEL0: /* start of function */ for (i = 0; i < 10; i + + ) { state = 1; /* so we will come back to LABEL1 */ return i; LABEL1:; /* resume control straight after the return */ } }

This method is feasible. We add labels everywhere that require yield: one at the beginning, and one after all return statements. Each label is numbered, and we save this number in a state variable so that it tells us which label to jump to the next time we call it. Before each return, update the state variable to point to the correct label; no matter how many times it is called, the switch statement for the state variable can find the location we want to jump to.

But it’s still very ugly. The worst part is that all the labels need to be maintained manually, and the labels in the function must be consistent with the labels in the opening switch statement. Every time you add a return statement, you must think of a new label name and add it to the switch statement; every time you delete a return statement, you must also delete the corresponding label. This doubles the effort of maintaining the code.

If you think about it carefully, we can actually not use the switch statement to decide where to jump to for execution, but directly use the switch statement itself to implement the jump:

int function(void) { static int i, state = 0; switch (state) { case 0: /* start of function */ for (i = 0; i < 10; i + + ) { state = 1; /* so we will come back to "case 1" */ return i; case 1:; /* resume control straight after the return */ } } }

cool! I didn’t expect that the switch-case statement could be used in this way. In fact, to put it bluntly, the C language was born out of assembly language. Switch-case, like if-else, is nothing more than an alternative implementation of the assembly conditional jump instruction (this also indirectly explains why Assembly programmers often deride C as “turd code”). We can also use the __LINE__ macro to make it more general:

int function(void) { static int i, state = 0; switch (state) { case 0: /* start of function */ for (i = 0; i < 10; i + + ) { state = __LINE__ + 2; /* so we will come back to "case __LINE__" */ return i; case __LINE__:; /* resume control straight after the return */ } } }

In this way, we can use macros to extract a paradigm and encapsulate it into components:

#define Begin() static int state=0; switch(state) { case 0: #define Yield(x) do { state=__LINE__; return x; case __LINE__:; } while (0) #define End() } int function(void) { static int i; Begin(); for (i = 0; i < 10; i + + ) Yield(i); End(); }

How about it, does it look like you invented a whole new language? In fact, we took advantage of the branch jumping feature of switch-case and the precompiled __LINE__ macro to implement an implicit state machine, ultimately achieving “yield semantics”.

Another question is, when you happily apply this little-known technique to your project and successfully use it to ask for credit from your boss, what will your boss think of your code? The curly braces in your macro definition are not completely matched, unused cases are included in the code block, and the Begin and Yield macros are incomplete and patchwork… You are simply a negative example of non-compliance with coding standards in the company!

Don’t worry, in the original article Simon Tatham is a great expert to help you find a firm rebuttal reason. I think it is a golden word for programmers.

It is incorrect to use programming conventions here. The sample code given in the article is not very long or complex, and it can still be understood even if it is rewritten in the form of a state machine. However, as the code becomes longer and longer, it will become more and more difficult to rewrite, and the loss of intuitiveness caused by rewriting will also become quite large.

Think about a function that contains a small block of code like this:

case STATE1: /* perform some activity */ if (condition) state = STATE2; else state = STATE3;

To someone looking at the code, this is not much different from a function containing the following small block of code:

LABEL1: /* perform some activity */ if (condition) goto LABEL2; else goto LABEL3;

Yes, the structure of the two functions is visually the same, and regarding the algorithms implemented in the functions, being the same in both functions is not conducive to viewing. The same people who fired you because you used coroutine macros will also yell at you because the functions you wrote are composed of small pieces of code and goto statements. Only this time they didn’t wrong you, because a function designed like that would seriously mess up the structure of the algorithm.

The goal of programming standards is to make the code clear. If you hide some important things, such as switch, return and case statements, into macros that act as “blind eyes”, from the perspective of programming standards, you can say that you have disturbed the grammatical structure of the program and violated the meet the requirement of code clarity. But we do this to highlight the algorithmic structure of the program, and the algorithmic structure is exactly what people who read the code want to know more about.

Any programming specification that insists on sacrificing algorithmic clarity for grammatical clarity should be rewritten. If your boss fires you for using this technique, keep telling him this as the security guards drag you out.

The author of the original article finally provided a coroutine.h header file under the MIT license. It is worth mentioning that, as mentioned in the article, this coroutine implementation method has a limitation in use, that is, the saving of coroutine scheduling state depends on static variables, not local variables on the stack , In fact, local variables (stacks) cannot be used to save state, which makes the code non-reentrant and multi-threaded. Later, the author added a technique, which is to wrap local variables into a fictitious context structure pointer passed in as function parameters, and then use a dynamically allocated heap to “simulate” the stack, which solves the problem of thread reentrancy. However, this will harm the clarity of the code. For example, all local variables must be written as references to object members, which is very troublesome especially when there are many local variables. Another example is that the macro definition malloc/free method is too complex and difficult to control. Being bad also increases the risk of getting fired (except this time you deserve it).

I personally think that since the coroutine itself is a single-threaded solution, we should assume that the application environment is single-threaded and there is no code reentrancy problem, so we can boldly use static variables to maintain the simplicity and readability of the code. sex. In factWe should not consider using such a simple coroutine in a multi-threaded environment. If we must use it, the ucontext component of glibc mentioned earlier is also a feasible alternative. It provides a The context of a coroutine private stack. Of course, this usage is not without restrictions on cross-threads. Please read the online documentation carefully.

Context of Protothreads

Thanks to Simon Tatham for his teachings, we can hack the source code next. Let’s first take a look at the data structure that implements protothreads. In fact, it is the context structure of the coroutine, which is used to save state variables. I believe you will soon understand why its “stack” only has 2 words. Festival:

struct pt { lc_t lc; }

There is only one short type variable inside, which is actually used to save the program counter of the last transfer point. This also reflects that coroutines are more flexible than threads, that is, coroutines can be stackless. If the function that needs to be implemented is very single, such as event notification like the producer-consumer model, then in fact coroutines The state variable that needs to be saved is just a program counter. Like python generators, they are also stackless. Of course, implementing an iterative generator may also require retaining the previous iteration value. The previous C example uses static variables to save it. You can also set member variables and add them to the context structure. If you are really not sure how many state variables need to be saved when using coroutine scheduling, it is better to use ucontext. Its context provides stacks and signals, but the user is responsible for allocating resources. For detailed usage, see the online documentation. .

typedef struct ucontext { struct ucontext_t *uc_link; sigset_t uc_sigmask; stack_t uc_stack; ... } ucontext_t;

Primitives and components of Protothreads

A little digression, go back to protothreads and look at the coroutine “primitives” provided. There are two implementation methods. Under ANSI C, it is the traditional switch-case statement:

#define LC_INIT (s) s = 0; // There is a semicolon in the source code, a low-level bug, aha~ #define LC_RESUME(s) switch (s) { case 0: #define LC_SET(s) s = __LINE__; case __LINE__: #define LC_END(s) }

But this “primitive” has a subtle flaw:That is, you cannot use switch-case statements in the code between LC_RESUME and LC_END (or the components containing them), because this will cause a peripheral switch jump. Wrong transfer! To this end, protothreads implements GNU C-based scheduling “primitives”. Under GNU C, there is also a kind of syntax sugar called label pointer, which is to add & amp; & amp; in front of a label (not an address, but a GNU custom symbol). It can be saved with void pointer type and then goto jump. :

typedef void * lc_t; #define LC_INIT(s) s = NULL #define LC_RESUME(s) \ do { \ if (s != NULL) { \ goto *s; \ } } while (0) #define LC_CONCAT2(s1, s2) s1##s2 #define LC_CONCAT(s1, s2) LC_CONCAT2(s1, s2) #define LC_SET(s) \ do { \ LC_CONCAT(LC_LABEL, __LINE__): \ (s) = & amp; & amp;LC_CONCAT(LC_LABEL, __LINE__); \ } while (0)

Okay, with the previous basic knowledge, understanding these “primitives” is just a piece of cake. Let’s take a look at how to create a “component”, which is also the protothreads API. Let’s first define four exit codes as thescheduling of the coroutine. State machine:

#define PT_WAITING 0 #define PT_YIELDED 1 #define PT_EXITED 2 #define PT_ENDED 3

The following APIs can be called directly from the application:

/* Initialize a coroutine, that is, initialize state variables */ #define PT_INIT(pt) LC_INIT((pt)->lc) /* Declare a function with a return value of char, which is the exit code, indicating that proto thread is used in the function body (I personally think it is a bit unnecessary) */ #define PT_THREAD(name_args) char name_args /* Coroutine entry point, PT_YIELD_FLAG=0 means transfer, =1 means no transfer, placed in front of the switch statement, you can jump to the last transfer point to continue execution when called next time */ #define PT_BEGIN(pt) { char PT_YIELD_FLAG = 1; LC_RESUME((pt)->lc) /* Coroutine exit point. At this point, a coroutine is terminated and all contexts and flags are cleared */ #define PT_END(pt) LC_END((pt)->lc); PT_YIELD_FLAG = 0; \ PT_INIT(pt); return PT_ENDED; } /* Coroutine transfer point. If the coroutine status variable lc has changed to __LINE__ at this time, then PT_YIELD_FLAG = 1, indicating that execution will continue from the transfer point. */ #define PT_YIELD(pt) \ do { \ PT_YIELD_FLAG = 0; \ LC_SET((pt)->lc); \ if(PT_YIELD_FLAG == 0) { \ return PT_YIELDED; \ } \ } while(0) /* Additional transfer conditions */ #define PT_YIELD_UNTIL(pt, cond) \ do { \ PT_YIELD_FLAG = 0; \ LC_SET((pt)->lc); \ if((PT_YIELD_FLAG == 0) || !(cond)) { \ return PT_YIELDED; \ } \ } while(0) /* Coroutine blocking point (blocking), essentially equivalent to PT_YIELD_UNTIL, except that the exit code is PT_WAITING, used to simulate semaphore synchronization */ #define PT_WAIT_UNTIL(pt, condition) \ do { \ LC_SET((pt)->lc); \ if(!(condition)) { \ return PT_WAITING; \ } \ } while(0) /* Same as PT_WAIT_UNTIL condition reversal */ #define PT_WAIT_WHILE(pt, cond) PT_WAIT_UNTIL((pt), !(cond)) /* Coroutine scheduling, calling coroutine f and checking its exit code until 0 is returned when the coroutine terminates, otherwise 1 is returned. */ #define PT_SCHEDULE(f) ((f) < PT_EXITED) /* This is used for asymmetric coroutines. The caller is the main coroutine. pt is the context handle associated with the sub-coroutine thread (can be multiple). The main coroutine blocks itself from scheduling sub-coroutines until all sub-coroutines terminate. */ #define PT_WAIT_THREAD(pt, thread) PT_WAIT_WHILE((pt), PT_SCHEDULE(thread)) /* Used for nested coroutine scheduling, child is the context handle of the sub-coroutine */ #define PT_SPAWN(pt, child, thread) \ do { \ PT_INIT((child)); \ PT_WAIT_THREAD((pt), (thread)); \ } while(0)

That’s all for now. Users can also expand components according to their own needs, such as implementing semaphores. You will find that semaphores are so simple without operating system environment:

struct pt_sem { unsigned int count; }; #define PT_SEM_INIT(s, c) (s)->count = c #define PT_SEM_WAIT(pt, s) \ do { \ PT_WAIT_UNTIL(pt, (s)->count > 0); \ --(s)->count; \ } while(0) #define PT_SEM_SIGNAL(pt, s) + + (s)->count

I shouldn’t need to say more about this. Haha, let’s go back to the producer-consumer model we gave in the beginning and see how protothreads perform.

#include "pt-sem.h" #define NUM_ITEMS 32 #defineBUFSIZE 8 static struct pt_sem mutex, full, empty; PT_THREAD(producer(struct pt *pt)) { static int produced; PT_BEGIN(pt); for (produced = 0; produced < NUM_ITEMS; + + produced) { PT_SEM_WAIT(pt, & amp;full); PT_SEM_WAIT(pt, & amp;mutex); add_to_buffer(produce_item()); PT_SEM_SIGNAL(pt, & amp;mutex); PT_SEM_SIGNAL(pt, & amp;empty); } PT_END(pt); } PT_THREAD(consumer(struct pt *pt)) { static int consumed; PT_BEGIN(pt); for (consumed = 0; consumed < NUM_ITEMS; + + consumed) { PT_SEM_WAIT(pt, & amp;empty); PT_SEM_WAIT(pt, & amp;mutex); consume_item(get_from_buffer()); PT_SEM_SIGNAL(pt, & amp;mutex); PT_SEM_SIGNAL(pt, & amp;full); } PT_END(pt); } PT_THREAD(driver_thread(struct pt *pt)) { static struct pt pt_producer, pt_consumer; PT_BEGIN(pt); PT_SEM_INIT( & amp;empty, 0); PT_SEM_INIT( & amp;full, BUFSIZE); PT_SEM_INIT(&mutex, 1); PT_INIT(&pt_producer); PT_INIT( &pt_consumer); PT_WAIT_THREAD(pt, producer( & amp;pt_producer) & amp; consumer( & amp;pt_consumer)); PT_END(pt); }

example-buffer.c in the source package contains a complete runnable example, so I won’t post them all. The overall framework is asymmetric coroutines, including a main coroutine driver_thread and two sub-coroutines producer and consumer. In fact, it goes without saying that everyone understands it. The code is very clear and intuitive. We can completely implement a simple event processing requirement through a single thread. You can add hundreds of thousands of coroutines at will, which will hardly cause any additional system overhead and resource usage. The only thing that needs attention is that there is no local variable because protothreads are stackless, but this is not a problem. First of all, we have assumed that the running environment is single-threaded, and secondly, not many “local variables” are used under a simplified requirement. If you need to save some additional state variables when transferring a coroutine, such as an iterative generator, you can extend the coroutine context structure by yourself as long as the number and size are determined and controllable.

Of course, this does not mean that protothreads is omnipotent, it just contributes a model. If you want to use it, you must first learn to adapt to it. Here are some limitations on the use of protothreads:

Since the coroutine is stackless, try not to use local variables unless the variable is insignificant to the state of the coroutine. By the same token, the code where the coroutine is located is not reentrant.

If a coroutine uses a component encapsulated by a switch-case primitive, it is forbidden to use the switch-case statement in actual applications unless it is replaced by a label pointer in GNU C syntax.

A coroutine can call other routines, such as library functions or system calls, but the routine must be non-blocking, otherwise all coroutines in the thread will be blocked. After all, threads are the smallest unit of execution, and coroutines are just routines based on “time slice turns”.

There are more examples on the official website, all of which are very practical. In addition, an engineer named Craig Graham extended pt.h so that protothreads supports operations such as sleep/wake/kill. The file is here graham-pt.h.

Coroutine library DIY guide

After seeing this, do you want to DIY a coroutine component? Even though many dynamic languages themselves already support coroutine semantics, many C programmers still tend to implement components themselves. Many open source codes on the Internet mainly use glibc’s ucontext component at the bottom layer. After all, coroutine components that provide stacks are more versatile and convenient to use. You can write a scheduler yourself, then simulate the thread context, and then… you can create a cross-platform COS (laughs). This is how the GNU Pth thread library is implemented. Its original author, German Ralf S. Engelschall (another open source expert who also wrote many works such as OpenSSL), wrote a paper to teach everyone how to implement a thread library. In addition, there are a lot of recommended readings on the protothreads official website. Have fun!

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. C Skill Tree Home Page Overview 195087 people are learning the system