Advice on Nbabel benchmark with Mojo?
I'm writing a long blog post on Mojo. I should be able to share it here quite soon but before I'd like to ask again about a question I wrote on Github: https://github.com/modularml/mojo/discussions/1275.
I was not able to get so good performance with Mojo on the NBody problem in particular compared to a very efficient Julia implementation. It would be great if I could get some help to speed up this code.
23 Replies
Does the Julia implentation also use SIMD4 to hold 3 values? In this line it appears to be looping over the 3 axes. Organizing the data differently in Mojo to make use of the full SIMD vector length seems like it would recover 25% of performance.
GitHub
nbabel/julia/nbabel4_threads.jl at main · paugier/nbabel
Contribute to paugier/nbabel development by creating an account on GitHub.
The fastest Julia implementations (for example this one https://github.com/paugier/nbabel/blob/reply-zwart2020/julia/nbabel5_serial.jl) also use 4 floats to hold 3 values and it's very efficient because it uses SIMD.
I tried to reproduce this in https://github.com/paugier/nbabel/blob/main/mojo/bench_no_part.mojo but it is not much better that the other Mojo implementation.
GitHub
nbabel/mojo/bench_no_part.mojo at main · paugier/nbabel
Contribute to paugier/nbabel development by creating an account on GitHub.
Yes, I agree that using SIMD can be very fast. My intuition was that by using SIMD4 with only 3 values you were missing potential. Also organising that way makes it difficult to use other SIMD sizes and to work with data from more than two particles while benefitting from SIMD ops.
I did a quick trial of my thinking and just reorganizing the data layout got a speedup of 6%. That is using SIMD1 . Using SIMD4 as you are but keeping x,y, z as separate pointers so I am actually working with 4 floats across 4 particles got a speed up of 2.5x. And on my machine SIMD16 is generally fastest and got a speedup of 2.9x. With your layout I am not sure how you would try other SIMD sizes though maybe with cleverness it is possible. My limited understanding is that the vector length matters a lot and the optimal value varies by the hardware it is running on. So designing for a variable SIMD width is valuable.
My implementation based off of your gist is here. . Let me know if you find any mistakes I made. The values in my method were slightly different at about the 10th decimal place.
GitHub
nbody/nbody.mojo at main · mikowals/nbody
Contribute to mikowals/nbody development by creating an account on GitHub.
The way you are using simd is wrong for the application. You aren't using the full buffer and InlinedFixedVector is slower than a DType pointer. Additionally those decorators in your julia are doing a lot of heavy lifting behind the scenes that mojo can do with slightly more difficulty.
I am the author of the Mojo Arrays package (soon to be renamed numojo). I am holding off on further development until we have traits, and know what they intend to do with the native Tensor struct. But, after my next update I will try the NBody calculation to see if I can make it faster.
GitHub
GitHub - MadAlex1997/Mojo-Arrays
Contribute to MadAlex1997/Mojo-Arrays development by creating an account on GitHub.
Thanks a lot! It's great! I'm going to use this to create a more efficient Mojo implementation for the whole problem. I'd like to avoid having the number of particles as an alias.
What do you think about the technical feasibility of this potential project about implementing the Python array API standard in Mojo?
GitHub
Implementation of the "Python array API standard" in Mojo? · modula...
It would be great to build a Mojo implementation of the Python array API standard, that would be usable both from Python (after creation of a Python package with AOT compilation) and from Mojo. Thi...
I think that's quit achievable after the std
Tensor
type has been builtup a bit moreI did not know that it existed. From a quick skim, it seems pretty similar to my goals for the ndarray struct. I will have to read more, and do some side by side performance testing before I decide whether numojo should fully follow this standard.
I am also waiting to see what they do with tensor. If tensor is usable then numojo can focus purely on fast math operations.
Or being dissolved all together, suppose
Tensor
provides more or less what NumPy provides. But Chris did mention Tensor
is mostly a placeholder.I don't think they are going to give us all the optimizers, calculus, etc. If I don't have to worry about the base data type numojo will include many of the scipy features.
Much of what is in numpy does not belong in a standard lib anyway.
Ah, I thought you meant broadcasting, fancy indexing, shape arithmetic etc without which, the
Tensor
(NDArray
) is not really that "ND".I want that stuff and so far it looks like tensor might not have it. But if it does then I can focus on what I really want to do which is super fast math.
I have a slice implementation in my 2d array for example but tensor lacks it. Depending on how traits and composition works I may still tack things onto Tensor anyway regardless of how good it is and as long as I don't change how the data is stored it would still be convertible to the standard library tensors.
I don’t think having ‘number of particles’ as an alias is a relevant detail. The main thing is keeping your data grouped in a way that data you want to work with at the same time are next to each other.
I get very good result (something like 2.5 speedup) with your approach (with
nelts = 16
) on a recent computer but on an older one, the other code is actually a bit faster, so I'm getting a "speedup" of ~0.8 with your file. On this old computer the best results are obtained with nelts = 2
, and consistently the "speedup" is 0.4 with nelts = 1
. This is not a big deal but I don't understand why your approach is slower on this computer, and I have no idea how to investigate this fact.Yes, hardware matters. You can use ‘simdwidthof’ to find sensible default width for a specific machine. But generally nelts=1 would be expected to be slower by not using SIMD.
Modular Docs - info
Module
Also I found another 20% improvement by eliminating an extra ‘/ distance_cube’. The change is in my repo. So cumulative improvement on my machine is 3.8x.
Congrats @Michael K, you just advanced to level 3!
I have now a very efficient implementation in Mojo https://github.com/paugier/nbabel/blob/main/mojo/bench_simd.mojo However, it seems slightly buggy. The results depend (a lot) on the size of the simd vector used... But the only function that use simd is "tested" in https://github.com/paugier/nbabel/blob/main/mojo/mikowals_experiments.mojo and seems to work.
GitHub
nbabel/mojo/bench_simd.mojo at main · paugier/nbabel
Contribute to paugier/nbabel development by creating an account on GitHub.
GitHub
nbabel/mojo/mikowals_experiments.mojo at main · paugier/nbabel
Contribute to paugier/nbabel development by creating an account on GitHub.
I think with
accelerate_tile
and accelerate_vectorize
you are doing exactly the same thing and the performance differences between them are noise. I think when the code for the vectorize
function is open-sourced we will see that it is built on tile_and_unswitch
which is in turn built on tile
. I originally made the change to tile
because it makes the code more readable by avoiding the math to create a range starting at 0 and then doing math inside the loop to recreate the appropriate index. That is particularly important I think when you then add on reasoning about moving in SIMD width sized steps and working with SIMD vectors of data.
Once tile
was there though I added the multiple SIMD width options because widths bigger than 1 are usually faster trying to do as many iterations as you can with bigger widths will be faster. The tile
function is only different if you include other values besides nelts
and 1
for simd_width. So between simplifying the range and iterator for readability and being able to reduce the total number of iterations by maximising the use of SIMD widths larger than 1, I think tile
is the better choice for this use.Yes about tile versus vectorize. My goal was just to check that they were basically equivalent.
My issue is not about that but about the strange numerical results that I have with the different Mojo implementations. I tried to summarize the problems in a README file in the nbabel/mojo directory https://github.com/paugier/nbabel/tree/main/mojo. I don't understand what happens but the Mojo implementations seem somehow buggy.
Have you tried making a minimal reproducer of the "numerical instability". I didn't try comparing across runs so I figured what I was seeing was a version of one of these open issues. But I expect Modular would be keen to learn of and fix or explain any new issues you open.
GitHub
Issues · modularml/mojo
The Mojo Programming Language. Contribute to modularml/mojo development by creating an account on GitHub.
FYI, I opened a PR to Modular's nbody example. By changing a few lines of code I get a 6x performance gain.
GitHub
nbody.mojo - remove unnecessary memory writes and @unroll inner loo...
Edit: I found a second optimisation unrolling the inner loop so the total improvement is 6.53 -> 1.13 seconds on M1 Pro. That change is discussed in the next comment.
On M1 Pro the result of b...
This might be the issue you were seeing. It appears specific to architecture it is running on and doesn't occur on Apple Silicon.
GitHub
[BUG]: Code showing non deterministic behaviour · Issue #1755 · mo...
Bug description As title. Doesn't seem to reproduce on M3 Mac. Steps to reproduce from algorithm import vectorize struct Vec: var length: Int var data: DTypePointer[DType.float32] fn init(i...