Fastest Matrix Multiplication
I have read somewhere in this discord that the matmul function showcased in the matmul notebook is not the fastest implementation possible in Mojo. When will the Mojo language include the fastest implementation, if ever?
19 Replies
The fast matrix multiplication (which is written in Mojo) is a part of the Modular AI Engine. It's possible we could expose it for use from other Mojo code, but it may still formally remain part of the AI Engine, and is not likely to be open sourced.
I hope it will be distributed similar to MKL so people can still use it when running mojo on their own computer.
Does this mean using the current mojo version someone could write this fast matrix multiplication? Or are there additional mojo decorators or features that would allow this that can only be used internally?
The fast matrix multiplication we have is normal Mojo (just a lot of it), I don't think we're doing anything there that you couldn't do yourself if you put enough effort into it. We've put a ton of effort into our matrix multiplication, though, yielding state-of-the-art performance: https://www.modular.com/blog/the-worlds-fastest-unified-matrix-multiplication. In addition, part of what makes our matrix multiplication so compelling is how it can be fused with other operations, which is squarely in the domain of our AI Engine.
Wow that is great to hear! I know AI engine is the product but I'm glad mojo features aren't being restricted, allowing someone to write their own operators with similar performance. I asked a similar question in a different channel, but to achieve this operator fusion is the AI engine parsing textual static graph IR's created by say like onnx or torchscript? Then fusing operators it sees like matmul and relu, and instead of using libtorch and onnxruntime to run these operators it would use a matmul_relu function written in mojo?
It's a lot more general than pattern matching; there are whole classes of types of operators that can be automatically fused without needing to manually specify a fusion. With your example matmul and relu, there are no rules saying "fuse matmul and relu into a
matmul_relu
" -- the AI Engine looks at the definitions of matmul and relu and sees if it has the ability to fuse them (which it can do because it has the MLIR of each of the kernels still, it's not machine code yet), and it can automatically create a matmul_relu
if it thinks it would be beneficial. We recently announced our developer conference ModCon, and we'll likely discuss more about the AI Engine's fusion capabilities there. This automatic fusion is probably not something you could write on your own with the Mojo SDK we've released at this point, though you could probably write any type of manual fusion you want.That is very cool! Thank you so much for your insight on the AI engine, I just want to make sure I understand this all correctly. So functions written in mojo parse static graph IR's and instead of running kernels written in C++ by onnxruntime or libtorch to run the operators, kernels written in mojo are used. This mojo written code is compiled into MLIR, and at this level is where a lot of the magic of the AI engine is where it is able to perform various optimizations like automatic operator fusion. Afterwards the code is compiled into machine code. Is this correct?
On a side note, would be really awesome if in the future the AI engine provided a framework for users to write their own custom operators in mojo. That would then allow them to take advantage of these AI engine optimizations like automatic fusion.
Yes, basically. And you're right that this plays well with custom operators. We'll likely be talking about this more at ModCon, and I think (not 100% sure, but I would be surprised if not) we will have recordings to share afterwards for people who are not able to attend as well. ๐
Great, I have a much better understanding about the AI engine thanks again! Looking forward to hearing more about it from ModCon videos
Hi Alex, I improved the matmul example's perf by 2.3x by implementing tile swizzling (L3 cache), better unrolling and better loop ordering. Should I make a PR to improve the example? Would be keen to see how close it's gotten to AI engine perf ๐
GitHub
[examples] Improve
examples/matmul
perf by 2.3x by jon-chuang ยท ...CPU Results
Python: 0.005 GFLOPS
Numpy: 125.389 GFLOPS
Naive: 7.611 GFLOPS 1684.48x Python 0.06x Numpy
Vectorized: 31.086 GFLOPS 6880.12x Python 0.25x Numpy
Parall...
@Alex Kirchhoff how might I get access to ai engine to perform an apples to apples comparison? ๐
The matmul example in the notebooks is there to show how to use Mojo's features for improving performance. Its primary purpose is educational, so we want to keep it simple and easy for beginners to understand, at the expense of performance. I have not checked but I think
examples/matmul.mojo
is a copy of the code from the notebook in our documentation, so I think the same principles apply. We could probably manage to be a little bit more explicit about this. For people who really want to do matrix multiplications fast, I believe Modular would want to push people towards the AI Engine's implementation. (I understand that this might be frustrating when the Engine is not currently available.)
Because of this, I don't know that it makes sense to include your improved implementation in our official examples repository. But, I think people would still be interested in seeing this, so you might consider putting your improved matrix multiplication in a different repository and getting it included in one of the awesome-mojo lists.
Regarding access to the AI Engine: We are currently offering early access to a select set of customers. If you have a production use case, you can reach out to our sales team, and they can evaluate whether you would be a good candidate for early access: https://www.modular.com/contact/sales. However, most people won't fit in this category -- for everyone else, please stay tuned until ModCon, where we will likely have additional news for you. ๐@jon-chuang Very impressive result. Thanks for sharing
I spent some time talking with @Jack Clayton and I am revising my thoughts somewhat -- we likely would be interested in integrating your proposed changes into the examples. I think Jack will take over from me w.r.t. communication around these changes from here.
Hi Alex, hope you can answer my question regarding this fast MM implementation. Is all of the computation done as pure codegen through MLIR? Or do you offload parts of the computation to calls to precompiled kernels? Thanks!
Is all of the computation done as pure codegen through MLIR?To my understanding, Mojo takes a compiler-first approach i.e. no hand-tuned kernel lowering. I.e. what you code is what gets compiled/what gets codegen-ed.
All of our kernels are written in Mojo, so if there is any offloading to pre-compiled kernels (which I am not sure), then those pre-compiled kernels would themselves have gone through MLIR.
Thanks, that's pretty cool! Looking forward to seeing some of the compiler internals being open sourced some day