splitting vectorize and parallelize
My first approach when optimizing a single loop is to apply vectorize. Now I wonder if it in some cases makes sense to transform the single loop into a nested loop, vectorizing the inner loop and parallelize the outer
Instead of vectorizing
`
using
and then vectorize over j and parallize over k.
I f it makes sense, how to find a good balance between vectorize and parallize. In my concrete example, i have a loop of around 120 million .... (updating parameters in llm.mojo)
What i also wonder in this regard if the compiler is detecting these optimizations anyway so better to keep the code simple and let the compiler do these type of standard optimization.
Thanks
6 Replies
this is a very interesting question, something i’ve not directly thought about but have kind of done multiple times
this would be something you’d have to benchmark, but i’d assume just vectorization would be better solely due to parallelize overhead, but 100% could be wrong
I made some basic tests with an artificial example and my feeling is that when most of the work is still done in the inner loop (small k large j rangej) it can benefit. What i would love to learn if there is a more general strategy here, what works on my device might not be so on other and the other way around.
No idea if this is doable but i would love Mojo to have a
par_and_vec()
method where you tell Mojo to find hardware dependent the best balanace between parallize and vectorize and we developer dont need to start being clever. When optimizinig llm.mojo
it became clear to me how all these optimization steps are following pretty much the same pattern in most cases and it feels like something the compiler can take care of to some degree when we as a developer give the go ... Maybe even better if Mojo does the parallize internally anyway whenever possibleHey @Martin Dudek parallelize has overhead associated with it that will actually slow things down if the jobs aren't big enough. Vectorizing the inner loop and parallelizing the outer loop is the best way, because it keeps the pipeline full for the 16+ SIMD registers your CPU has. This is also why using 2x SIMD_WIDTH is good for a lot of algorithms and CPU's, because it keeps the pipeline for the SIMD registers more full to execute SIMD operations in parallel. But it's highly dependent on both the algorithm and CPU. Some compilers do what you're describing but that'll never get peak performance, Mojo is focused on being explicit, and Autotuning (currently being reimplemented) is focused on finding the best params for max performance across different machines.
The
par_and_vec()
idea is interesting, I'll have a go at thatThanks for the explanation @Jack Clayton , very helpful to learn these details. Peak performance is something we aim for for sure, at least for the crucial parts of our programs,. Exciting sphere :mojo:
In our MoCodes project heavy lifting functions we had foreseen vectorized inner loop and parallel outer loop from the getgo, yet had to discover that setting num_workers to 1 (effectively disabling parallellism completely) for the outer loop lead to best performance, despite the fact that on a GPU you would definitely want the outer loop to be executed by hundreds of threads in parallel. Discovered that through profiling and noticing a big percentage of time being spent in sync_wait. Guess our core functions don't perform a big enough job. Had to take parallellism one macro level higher using it only for sub- batching the batch of to be decoded words.
thanks for sharing your experiences @alain , very interesting for me. in my project a relative small number of num_workers seems to work best but havent done solid tests on that. The inner jobs in my project are pretty well behaving, meaning they all need a very similar amout of compute to complete, so that should make things easier.
i am coming from high level programming, just start diving into this subject so every input most welcome. Would love Mojo to assist here and make reasonable choices based on the hardware as mentioned above (par_and_vec() idea)
I am a bit lost here as it seems clear that what works well on my laptop will possibly be a poor choice on other hardware.