Advice on Nbabel benchmark with Mojo?

I'm writing a long blog post on Mojo. I should be able to share it here quite soon but before I'd like to ask again about a question I wrote on Github: https://github.com/modularml/mojo/discussions/1275. I was not able to get so good performance with Mojo on the NBody problem in particular compared to a very efficient Julia implementation. It would be great if I could get some help to speed up this code.
23 Replies
Michael K
Michael K10mo ago
Does the Julia implentation also use SIMD4 to hold 3 values? In this line it appears to be looping over the 3 axes. Organizing the data differently in Mojo to make use of the full SIMD vector length seems like it would recover 25% of performance.
GitHub
nbabel/julia/nbabel4_threads.jl at main · paugier/nbabel
Contribute to paugier/nbabel development by creating an account on GitHub.
Pierre Augier
Pierre Augier10mo ago
The fastest Julia implementations (for example this one https://github.com/paugier/nbabel/blob/reply-zwart2020/julia/nbabel5_serial.jl) also use 4 floats to hold 3 values and it's very efficient because it uses SIMD. I tried to reproduce this in https://github.com/paugier/nbabel/blob/main/mojo/bench_no_part.mojo but it is not much better that the other Mojo implementation.
GitHub
nbabel/mojo/bench_no_part.mojo at main · paugier/nbabel
Contribute to paugier/nbabel development by creating an account on GitHub.
Michael K
Michael K10mo ago
Yes, I agree that using SIMD can be very fast. My intuition was that by using SIMD4 with only 3 values you were missing potential. Also organising that way makes it difficult to use other SIMD sizes and to work with data from more than two particles while benefitting from SIMD ops. I did a quick trial of my thinking and just reorganizing the data layout got a speedup of 6%. That is using SIMD1 . Using SIMD4 as you are but keeping x,y, z as separate pointers so I am actually working with 4 floats across 4 particles got a speed up of 2.5x. And on my machine SIMD16 is generally fastest and got a speedup of 2.9x. With your layout I am not sure how you would try other SIMD sizes though maybe with cleverness it is possible. My limited understanding is that the vector length matters a lot and the optimal value varies by the hardware it is running on. So designing for a variable SIMD width is valuable. My implementation based off of your gist is here. . Let me know if you find any mistakes I made. The values in my method were slightly different at about the 10th decimal place.
GitHub
nbody/nbody.mojo at main · mikowals/nbody
Contribute to mikowals/nbody development by creating an account on GitHub.
mad alex 1997
mad alex 199710mo ago
The way you are using simd is wrong for the application. You aren't using the full buffer and InlinedFixedVector is slower than a DType pointer. Additionally those decorators in your julia are doing a lot of heavy lifting behind the scenes that mojo can do with slightly more difficulty. I am the author of the Mojo Arrays package (soon to be renamed numojo). I am holding off on further development until we have traits, and know what they intend to do with the native Tensor struct. But, after my next update I will try the NBody calculation to see if I can make it faster.
GitHub
GitHub - MadAlex1997/Mojo-Arrays
Contribute to MadAlex1997/Mojo-Arrays development by creating an account on GitHub.
Pierre Augier
Pierre Augier10mo ago
Thanks a lot! It's great! I'm going to use this to create a more efficient Mojo implementation for the whole problem. I'd like to avoid having the number of particles as an alias.
Pierre Augier
Pierre Augier10mo ago
What do you think about the technical feasibility of this potential project about implementing the Python array API standard in Mojo?
GitHub
Implementation of the "Python array API standard" in Mojo? · modula...
It would be great to build a Mojo implementation of the Python array API standard, that would be usable both from Python (after creation of a Python package with AOT compilation) and from Mojo. Thi...
sora
sora10mo ago
I think that's quit achievable after the std Tensor type has been builtup a bit more
mad alex 1997
mad alex 199710mo ago
I did not know that it existed. From a quick skim, it seems pretty similar to my goals for the ndarray struct. I will have to read more, and do some side by side performance testing before I decide whether numojo should fully follow this standard. I am also waiting to see what they do with tensor. If tensor is usable then numojo can focus purely on fast math operations.
sora
sora10mo ago
Or being dissolved all together, suppose Tensor provides more or less what NumPy provides. But Chris did mention Tensor is mostly a placeholder.
mad alex 1997
mad alex 199710mo ago
I don't think they are going to give us all the optimizers, calculus, etc. If I don't have to worry about the base data type numojo will include many of the scipy features. Much of what is in numpy does not belong in a standard lib anyway.
sora
sora10mo ago
Ah, I thought you meant broadcasting, fancy indexing, shape arithmetic etc without which, the Tensor (NDArray) is not really that "ND".
mad alex 1997
mad alex 199710mo ago
I want that stuff and so far it looks like tensor might not have it. But if it does then I can focus on what I really want to do which is super fast math. I have a slice implementation in my 2d array for example but tensor lacks it. Depending on how traits and composition works I may still tack things onto Tensor anyway regardless of how good it is and as long as I don't change how the data is stored it would still be convertible to the standard library tensors.
Michael K
Michael K10mo ago
I don’t think having ‘number of particles’ as an alias is a relevant detail. The main thing is keeping your data grouped in a way that data you want to work with at the same time are next to each other.
Pierre Augier
Pierre Augier10mo ago
I get very good result (something like 2.5 speedup) with your approach (with nelts = 16) on a recent computer but on an older one, the other code is actually a bit faster, so I'm getting a "speedup" of ~0.8 with your file. On this old computer the best results are obtained with nelts = 2, and consistently the "speedup" is 0.4 with nelts = 1. This is not a big deal but I don't understand why your approach is slower on this computer, and I have no idea how to investigate this fact.
Michael K
Michael K10mo ago
Yes, hardware matters. You can use ‘simdwidthof’ to find sensible default width for a specific machine. But generally nelts=1 would be expected to be slower by not using SIMD.
Want results from more Discord servers?
Add your server