What is the basis for Mojo's `async`?

As per the title, I'm wondering what the underlying mechanism for async libraries will be. In the community meeting, it was mentioned that async and coroutines would be areas of focus in Mojo's design for the next few months, so I think this is the right time to discuss their design. In C++, the underlying mechanism on which async libraries were built was initially chosen to be coroutines, but there's work going on right now to introduce a better alternative:
In a suite of generic async algorithms that are expected to be callable from hot code paths, the extra allocations and indirections are a deal-breaker. It is for these reasons that we consider coroutines a poor choice for a basis of all standard async.
Source Can Mojo avoid these disadvantages or is there work to be done here?
15 Replies
Dmitry Salin
Dmitry Salin4mo ago
Current Rust async code generation can also be far from optimal. It can avoid allocations, but leads to multiple levels of indirection. https://github.com/newpavlov/fsm-bench
Melody Daniel
Melody Daniel4mo ago
Mojo async is based on coroutines. The recent work is on a new coroutine dialect created by modular, we don't have all the details yet.
Darkmatter
Darkmatter4mo ago
C++ wants you to be able to dynamically link async functions, which is absolute insanity in my opinion and is what is producing a lot of that overhead. Rust is much better, although it does produce very large binaries due to creating large static state machines. There are some inefficiences in Rust due to LLVM limitations and those limitations are why Zig is leaving LLVM. If anyone has the capabilties to go fix LLVM, it would be the Mojo team. I can beat Rust's async by hand for heterogeneous state machines right now, but the method I use would be interesting to try to implement in a compiler. I essentally pack the state machines together and have fully disjoint state machines contained in the same coroutine implementation. So, I would have a network state machine and a disk state machine represented by a single tagged union, with each variant having a function pointer that will give a next state pointer.
Dmitry Salin
Dmitry Salin4mo ago
It seems that a function pointer representing the current state is the fastest possible approach, only one level of indirection. I'm doing something similar in Rust, but not using compiler-generated async.
Heyitsmeguys
Heyitsmeguys4mo ago
@Dmitry Salin @Darkmatter have you ever experimented with senders/receivers (the better alternative mentioned above)? There are many implementations: stdexec, HPX, libunifex.
Darkmatter
Darkmatter4mo ago
Our discussion ended up 5 or 6 layers down in the weeds about the implementation of coroutines themselves. You build basically all other concurrency models on top of them. I prefer the actor model because it handles heterogeneous hardware and distributed computation better. Erlang and other BEAM languages have extensively proven out this model in production environments. If all you care about is moving messages as fast as possible, then you hardcode the data flow graph and use something like DPDK’s ring buffer (channel) implementation, which have a practical throughput limit of somewhere above 1 billion messages per second per CPU core. The limit is higher than that I just haven’t ever talked to anyone, even people working on DPDK internals, who felt the need to push numbers on them because you can burn memory for more throughput so long as you have either a DMA device or memory bandwidth can keep up. However, you are likely to run out of CPU cycles first, so for programs where you aren’t starting to count clock cycles, just use the actor model for the loose coupling, since you can strictly wire actors up for a particular hardware configuration later.
Heyitsmeguys
Heyitsmeguys4mo ago
It’s also worth mentioning that the term computation applies to multiple paradigms. It can be easily used to describe imperative work, it can be well assimilated by functional programmers, it can apply to reactive programming and to all stream-based paradigms; although we haven’t talked about it, we can think of computations also in the context of the actor model. We haven’t explicitly pursued this, but one can infer that computations can be used as a basis for concurrency in multiple programming paradigms.
Source - p2504r0 - Computations as a global solution to concurrency S&R can be used as the basis for an actor framework, as mentioned above. If you're interested in an actor framework for Mojo, you can look at #1445 . I wrote a whole bunch of stuff about S&R in a comment there.
GitHub
Proposal For An Actor System Based On Mojo by reid-spencer · Pull R...
This is currently a work in progress. There are no code changes, just a proposal written in the proposals section. This was pre-approved by Chris Lattner in a conversation in June 2023. I will kee...
Heyitsmeguys
Heyitsmeguys4mo ago
I think the current interest is making a generic, high level, general purpose async computation and/or I/O framework. I don't know if what you're suggesting can be made into one.
Darkmatter
Darkmatter4mo ago
My question would be whether senders and receivers are a performance bottleneck due to the indirection they add. Sometimes you only have a few logical actors and the communication between them crushes your performance. Also, my understanding is that senders and receivers is incompatible with hardware message forwarding like the Intel Dynamic Load Balancer, which runs an actor model scheduler in hardware and skips on quite a few Xeon SKUs. Mostly due to how dynamic the message forwarding can be. So the question is "Is it worth it to exclude the state of the art in performance to make it easier to use async?" Not a fun question for a performance focused language. As a point of reference, DPDK is what io_uring benchmarks itself against for networking performance, and io_uring was at 60% throughput per core with much higher latency last I saw Jens present on it. If you were to write a distributed AI framework that didn't use RDMA, because RDMA doesn't scale and some clusters have already run into this, DPDK is what you would write it with. I would have less of an objection if you could have static connections between actors that are defined at compile time and the slower, dynamic behavior was opt-in. A mix would also be fine.
Heyitsmeguys
Heyitsmeguys4mo ago
There are ways to allow programmers to specialize certain parts of their async code to their hardware (if that part needs it) while keeping the rest high level. This is already what C++ plans on doing and it is one of the promises of Mojo.
Darkmatter
Darkmatter4mo ago
If that restriction is "You cannot add actors while the system is running" is that doable? That's what the DLB imposes right now. In exchange for removing the need for you to use CPU cores for scheduling. At least in task per core scenarios. Keep in mind that's 288 tasks on upcoming CPUs, which is more than most systems actually need.
Heyitsmeguys
Heyitsmeguys4mo ago
Sometimes you only have a few logical actors and the communication between them crushes your performance.
Under the hood, S&R just uses function calls:
The first (and most common) way for such optimizations to happen is thanks to the structure of the implementation: because all the work is done within callbacks invoked on the completion of an earlier sender, recursively up to the original source of computation, the compiler is able to see a chain of work described using senders as a tree of tail calls, allowing for inlining and removal of most of the sender machinery. In fact, when work is not submitted to execution resources outside of the current thread of execution, compilers are capable of removing the senders abstraction entirely, while still allowing for composition of functions across different parts of a program.
S&R is basically callbacks on steroids. Their overhead and size are meant to be around the same as functions. So I don't think that'll be an issue.
Darkmatter
Darkmatter4mo ago
It might work, assuming LLVM likes this approach better than Rust's version. Rust's async is much slower than it needs to be because of pass ordering in LLVM iirc.
Heyitsmeguys
Heyitsmeguys4mo ago
That's the thing. Mojo doesn't have to rely on LLVM. They've built the compiler from scratch and can replace any part of LLVM they don't like with their own.
Darkmatter
Darkmatter4mo ago
So can Rust. But it's labor intensive. Given a good enough set of optimization passes, I can see that S&R would converge to decent code-gen. The next challenge would be the borrow checker making callbacks painful. This is part of why Rust went with the async/await syntax. You can express linear chains easily, but the borrow checker kicks in more strongly when you want to have multiple children in the chain of work.
Want results from more Discord servers?
Add your server