Modular•9mo ago

Matmul.mojo

Hello, Mojocians! I'm thrilled to share my implementation of matrix multiplication in Mojo. Check out the benchmark results included in the repository to see the performance metrics. It outperforms numpy(OpenBlas) and achieves close performance to the Max engine! Feel free to explore the repository, run the benchmarks, and integrate it into your projects. Happy coding :). https://github.com/YichengDWu/matmul.mojo

GitHub

GitHub - YichengDWu/matmul.mojo: High Performance Matrix Multiplica...

High Performance Matrix Multiplication in Pure Mojo 🔥 - YichengDWu/matmul.mojo

24 Replies

Ryulord•9mo ago

I wonder if the fact that numpy uses e cores could be sabotaging its perf. Would be interesting to see a benchmark on a cpu with only 1 kind of core or with e cores disabled

Mohamed Mabrouk•9mo ago

Amazing effort .... it would be really interesting to further benchmark this implantation in other scenarios: single-core performance, scaling with the number of cores, comparison against Intel MKL, on a bunch of architectures... etc

sora•9mo ago

Great work! How does it compare to MAX on shapes benchmarked in this blog? https://www.modular.com/blog/the-worlds-fastest-unified-matrix-multiplication

Modular: The world's fastest unified matrix multiplication

We are building a next-generation AI developer platform for the world. Check out our latest post: The world's fastest unified matrix multiplication

EthanOP•9mo ago

You're right. Great insight! I will push a commit reflecting that.

EthanOP•9mo ago

EthanOP•9mo ago

The revenge of numpy. I don't have enough motivation to benchmark certain shapes. I spent about two weeks on matmul.mojo and probably won't have more free time to put into it.

Darin Simmons•9mo ago

@adakkak You might be interested in this I like the lines and things go brrrrr as well as kudos for a great accomplishment in mojo alone. For this graph, a BIG FAT legend beneath the lines would be great. At this resolution, even with zooming in it's hard to see. Also, it says "Max" and it feels like it's short for maximum instead of "MAX" or "MAX engine" which would be clearer. I think you should get a hat AND a cup AND a shirt 😃 . Also feel goods.

EthanOP•9mo ago

I noticed that too. The image shown by plt.show() looks great, but for some reason, the saved image is quite different. I'll try adding markers and changing the format.

EthanOP•9mo ago

@Darin Simmons How about this one?

Darin Simmons•9mo ago

very nice, legible and colors are clear, last nit would be to turn "Max" into "MAX" Of course, putting your name, contacts, github, a title on it would typically be suggested but that's really up to you

EthanOP•9mo ago

All great suggestions, thank you!

Ryulord•9mo ago

still impressive how close you are. I think this is the best open source mojo matmul we have? Is running on e cores really default behavior for numpy? It's an insane difference.

EthanOP•9mo ago

NumPy's default behavior is to utilize all available threads, which is quite sensible. When using the MKL backend, leveraging all cores is recommended as MKL optimally utilizes them. On my machine, MKL can achieve over 930 GFLOP/s. Disabling e-cores results in performance close to OpenBLAS.

EthanOP•9mo ago

Updated graph.

ModularBot•9mo ago

Congrats @Ethan, you just advanced to level 9!

Darin Simmons•9mo ago

OOOOOOhhhhh, aaaaaahhh 😃
Also, don't forget to acknowledge yourself for all the work you've put in. In other words, take a victory lap, pat yourself on your back, etc.

EthanOP•9mo ago

Thanks for the reminder 😂! Yeah it's definitely a side project that I can take some pride in.

TilliFe•9mo ago

nice work! 🔥

EthanOP•9mo ago

Thank you!

Dune•9mo ago

@Ethan Hi, wonderful work! Quick question: Why is hyperthreading disabled ?

ModularBot•9mo ago

Congrats @Dune, you just advanced to level 2!

EthanOP•9mo ago

I was following the setting in this blog:https://www.modular.com/blog/how-to-be-confident-in-your-performance-benchmarking

Modular: How to Be Confident in Your Performance Benchmarking

We are building a next-generation AI developer platform for the world. Check out our latest post: How to Be Confident in Your Performance Benchmarking

DobyDabaDu•7mo ago

@Ethan I tried to implement my own but things didnt work well in mojo (unlike C). Amazing how you did it even without using built-in prefetch and tile functions. Anyway, I renewed your code according to Mojo 24.5 as I'm gonna use it in a project and opened a PR to your repo. After GPU support, it would be amazing to see a gpu version as well:mojonightly: Thanks

Gaming

Programming

Matmul.mojo

Did you find this page helpful?