Parallelize in Mandelbrot example do not use all physical cores of an Intel Core i9 CPU

Parallelize don't use all 24 physical cores of a "13th Gen Intel® Core™ i9-13900KF × 32" processor. Instead only the 8 "performance" cores are actually used. Achieving only a speedup just below 8X on a processor with 24 physical cores. It seem that parallelize algorithm implementation for this processor use performance cores instead of physical cores for sizing the thread pool. I try changing parallelize runtime argument num_workers with substantially no practical effect, except reducing the distance from 8X with optimal values around 100. (Mojo examples) .../mojo/examples$ mojo mandelbrot.mojo Number of logical cores: 32 Number of physical cores: 24 Number of performance cores: 8 Vectorized: 5.576460718604651 ms Parallelized: 0.7627593649078195 ms Parallel speedup: 7.310904297161366 I would expect a parallel speedup below 24X, but probably above 20X. Mandelbrot computation is highly parallelizable, but the 16 not "performance" are usually slower. I hope this can be the first of my cent contribute to a truly promising language & friends ( Mojo->MAX&Magic 😉
No description
No description
No description
4 Replies
Jack Clayton
Jack Clayton4w ago
Hi @andrea onofri at the moment Mojo defaults to using just performance cores, which is the best default for ops like matmul on CPU. There is work being done to expose an API to users so you can turn on efficiency cores and hyperthreading, which is better for things like mandelbrot where each task executes at a different speed. The team working on the runtime was put onto higher priority tasks for GPU performance, but will implement this when they have cycles.
andrea onofri
andrea onofriOP4w ago
Thanks for the reply! But I still think that physical cores would be a better default. I agree that logical core is not a good default, because this would use hyperthreading by default. I understand (and agree) that GPU is now the main priority. So I propose only a default change, that should be very easy to implement. I propose to change parallelize default from num_performance_cores() to num_physical_cores(). Leaving the current default, on many CPU only a fraction of available computation power is actually used. For example on my CPU only 8 physical core are used, but 24 are available. Is not exactly 8/24 = 1/3 because the 16 not performance cores can be slower (but the main difference is they do not support hyperthreading). Using less than half of the available cores is not a good presentation card, especially for newcomers to a language devoted to the full use of unused resources as vectorial ops multicore and GPUs. I think that changing this default would be beneficial also for tasks less embarrassing parallel than Mandelbrot set, e.g. matmul. Probably you haven't spotted this problem because on higher level CPU all physical cores are performance cores. As is the case for the CPU (Xeon Platinum 8481C) used for Mandelbrot in "How Mojo🔥 gets a 35,000x speedup over Python" https://www.modular.com/blog/how-mojo-gets-a-35-000x-speedup-over-python-part-1. (Sorry for the multiple messages, I am new also to discord and I don't know how to add newlines.)
Jack Clayton
Jack Clayton4w ago
Using a default of all physical cores does slow things down a lot, it hurts pipelining because something like matmul is tiled and all the separate tasks take the same time to execute, if some tasks go the efficiency cores it messes up the scheduling and performance cores end up waiting for work. Mandelbrot is somewhat unique because all the tasks take different times to execute, so it doesn't matter if they're spread out all over the place. But totally agree we need to fix this and make it configurable.
andrea onofri
andrea onofriOP4w ago
In my experience using all physical core is profitable also considering a tail for slower threads. But you surely know better than me what is better for Mojo in this stage of teething. Thanks again for the kind reply.

Did you find this page helpful?