Matrix Multiplication (matmul): `numpy` hard to beat? even by mojo?
Super interested in mojo and wanted to try out some of the documentation/blog examples. π€ π₯
https://docs.modular.com/mojo/notebooks/Matmul
Great explanations and the step by step speed improvements are amazing to see! π
However, in the end a comparison to a real world alternative is interesting. No one would seriously do matmul in pure python π .
So I compared the performance to
numpy
which is a much better "baseline" for comparison.
Results on my machine:
- Naive matrix multiplication
- 0.854 GFLOP/s
- Vectorized matrix multiplication without vectorize
- 5.71 GFLOP/s
- Vectorized matrix multiplication with vectorize
- 5.81 GFLOP/s
- Parallelized matrix multiplication
- 35.2 GFLOP/s
- Tiled parallelized matrix multiplication
- 36.8 GFLOP/s
- Unrolled tiled parallelized matrix multiplication
- 35.3 GFLOP/s
- Numpy matrix multiplication
- 134.2 GFLOP/s
Results
- gigantic speedup comparing against naive, pure python π₯
- still almost 4x SLOWER compared to numpy
π
Wondering if numpy
is so heavily optimised for this operation that there is little way to keep up or improve upon?
Does anyone have ideas for further optimisations to get mojo closer to numpy?
Is this something that only a framework like MAX or super low level bit manipulation can achive? π€Matrix multiplication in Mojo | Modular Docs
Learn how to leverage Mojo's various functions to write a high-performance matmul.
2 Replies
https://www.linkedin.com/posts/pavanmv_benchmarking-mojo-the-supercharged-superset-activity-7220047088577888256-siYI/
You can use MAX Engine for speedy matmul.
There's a whole discussing here as well:
https://github.com/modularml/mojo/issues/2660
Pavan MV on LinkedIn: Benchmarking Mojo π₯: The Supercharged Superse...
Benchmarking Mojo π₯: The Supercharged Superset of Python
For the past few months, I'd been hearing about Mojoβa new programming language touted as a supersetβ¦
GitHub
Slower Matrix multiplication than numpy Β· Issue #2660 Β· modularml/m...
Bug description I've tried running the Mojo matmul file available in the repository inside examples directory (https://github.com/modularml/mojo/blob/main/examples/matmul.mojo) The output of th...
While it is true that numpy is heavily optimized, it's also true that numpy has its own engine.
Decades of work have occurred on the engines, BLAS, LAPACK, OpenBLAS, and the new kid IntelMKL. The typical numpy installation uses OpenBLAS. Intel's MKL numpy build is even faster than OpenBLAS. numpy BLAS info
The community is attacking this issue in the inaugural #mojo-marathons . While Ethan Darkmatter and others are making strides in pure mojo, the solution has not been found yet.
Additionally, other members of the community have been making computing libraries #community-showcase . Mojo is built on MLIR which is different from the IR than C rubs on; and the Modular team has been writing their own MLIR. The MAX engine is also in development. There are many areas for optimization still left to try.