Parallelize help (Running time increases a factor of 10 adding a "var += 1").
I'm trying to do 1BRC in Mojo and I'm on the optimizing code part.
I've reduced the code to minimum. If someone could try it would be great to know is not my system.
In a function that is using parallelize, when I try to modify a value from the inside @parameter function that is created outside the function not inside that function slows down the execution by a lot.
Am I missing something obvious?
8 Replies
1) Are your writes staying in bounds? I have
len(bounds) = 11
on my machine so if I change the tuple size to 16 it is fast. 2) parallelize
takes a second argument to set the number of workers which you probably want to be num_performance_cores()
. Other values may work well also but that is a good default. On my machine it cut the tiem in half.
If I set the number of workers to 1 (no parallelism) and use the @always_inline
decorator the code runs in under a millisecond. Probably a quirk because the function isn't doing anything so all the time is setting up parallel threads.I'm using
Have increases the tuple to 16 and added num_performance_cores() , and reduced it to 1.5, so, goes faster, but 1,5s increasing a counter seems too much...
@always_inline
, and without the increment of total[bound_idx]
takes less than 1ms.
With total[bound_idx] = total[bound_idx] + 1
on the code, reaches 3s 😫Have increases the tuple to 16 and added num_performance_cores() , and reduced it to 1.5, so, goes faster, but 1,5s increasing a counter seems too much...
Why you are not using SIMD:
var total = SIMDDType.int64,8
This with StaticTuple crashes on my computer:
[19990:19990:20240229,105942.389187:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
with StaticTuple[16,... works.
What is the number of cores on your machine? or the length of bounds? On my machine with 14 cores I stay in bounds with a Tuple of size 16.
@asosoman , did this get resolved so your runs fast when writing to total?
@Michael K Sorry, was out for a couple of days.
I keep not understanding why behaves this way, I stripped everything not needed.
From my understanding, both functions should take similar time, they only increment one value.
As it is below, int version takes 0.0009s and tuple version takes 0,68s.
If I increase the parallelize to 2 workers (
parallelize[read_chunk](2)
) , goes 0.0009 and 3.36...
And for higher versions of workers gets worse and worse... 😦
I-m really lost, as I can't understand why the updating the Tuple takes that much 😦
But also, I dont see a way of using 8 workers and keeping track of things without a tuple and storing each worker on one index.
Could you try this simple code and give me your timed results?
My output from your code as is:
If tuple is operating as you say on your machine I think you should file a bug report. It seems like a performance cliiff that either needs to be documented or fixed.
That said your code is running a simple thing a huge number of times. Any change in the inner loop could have a huge impact. One slow operation is writing to the heap or stack when a register will do.
total[bound_idx]
is loop invariant and always points to the same location. So read from it once and write to it once, outside of the inner loop. With read_chunk like this:
produces:
Congrats @Michael K, you just advanced to level 6!
thanks a lot ! 🙂
Have a lot to learn of new ways of doing things 😄