I've discussed a stackful coroutine implementation to coordinate CUDA streamlast year.
That was an implementation based on
APIs. Increasingly, when I thought about porting nnc
over to WASM, it becomes problematic because these APIs are more or less deprecated. Popular libc implementations such as musl
don't have implementation of these methods.
After thearticle, it became obvious that I cannot
into the internal CUDA thread (that thread cannot launch any kernels). Thus, the real benefit of such stackful coroutine is really about convenience. Writing a coroutine that way is no different from writing a normal C function.
This is the moment where C++ makes sense. The coroutine proposal in C++20 is a much better suit. The extra bits of compiler support just make it much easier to write.
If we don't use
, the natural choice is either
or good-old Duff's device
. It is a no-brainer to me that I will come back to Duff's device
. It is simple enough and the most platform-agnostic way.
There are many existing stackless coroutines implemented in C. The most interesting one with Duff's device
. To me, the problem with Protothreads
is its inability to maintain local variables. Yes, you can allocate additional states by passing in additional parameters. But it can quickly become an exercise and drifting away from a simple stackless coroutine to one with all bells-and-whistles of structs for some parameters and variables. You can declare everything as
. But it is certainly not going to work other than the most trivial examples.
I've spent this weekend to sharpen my C-macro skills on how to write the most natural stackless coroutine in C. The implementation preserves local variables. You can declare the parameters and return values almost as natural as you write normal functions.
Here is an example of how you can write a function-like stackless coroutine in C:
will declare the interface and the implementation. You can also separate the interface into header file with
and implementation into
. In this case,
keyword continues to work to scope the coroutine to file-level visibility. Taking a look at this:
The first parameter is the return type, and then function name, parameters, all feel very natural to C functions. The local variable has to be declared within the
block, that's the only catch.
To access parameters and local variables, you have to use
macro to wrap the access, otherwise it is the same.
Of course, there are a few more catches:
There is no magic really, just some ugly macros hide away the complexity of allocating parameters / local variables on the heap and such.
There are examples in the repo that shows the usage of
in varies formats. You can check out more there: https://github.com/liuliu/co
Currently, I have a single-threaded scheduler. However, it is not hard to switch that to a multi-threaded scheduler with the catch that you cannot maintain the dependencies as a linked-list, but rather a tree.
It is a weekend exercise, I don't expect to maintain this repo going forward. Some form of this will be ported into nnc .
can make a much more complex interaction between functions that an extra scheduler object is not needed. For what it's worth, Protothreads
also doesn't have a central scheduler. But in practice, I found it still miles easier to have a scheduler like what libtask
does. Tracking and debugging is much easier with a central scheduler especially if you want to make that multi-thread safe as well.