M
Modular•10mo ago
asosoman

Parallelize help (Running time increases a factor of 10 adding a "var += 1").

I'm trying to do 1BRC in Mojo and I'm on the optimizing code part. I've reduced the code to minimum. If someone could try it would be great to know is not my system. In a function that is using parallelize, when I try to modify a value from the inside @parameter function that is created outside the function not inside that function slows down the execution by a lot. Am I missing something obvious?
# TEST FOR SPEED
from algorithm import parallelize
from time import now
from sys.info import num_physical_cores

fn read_file(bounds: DynamicVector[Int]):
var total: StaticTuple[8,Int64] = StaticTuple[8,Int64](0,0,0,0,0,0,0,0)
@parameter
fn read_chunk(bound_idx: Int):
var offset: Int = bounds[bound_idx]
var r: Int = bounds[bound_idx+1]
while True:

# IF I UNCOMMENT THE NEXT LINE, GOES FROM 0,001 s execution time to more than 2s
# total[bound_idx] = total[bound_idx] + 1

offset += 1
if offset+1 >= r:
break
parallelize[read_chunk](len(bounds)-1)
return

fn main() raises:
var bounds = DynamicVector[Int]()
for i in range(0, num_physical_cores()+1):
var offset = int(100_000_000 / num_physical_cores() * i)
bounds.append(offset)

var start: Int = now()
read_file(bounds)
print("Total time:", (now()-start) / 1_000_000_000)
# TEST FOR SPEED
from algorithm import parallelize
from time import now
from sys.info import num_physical_cores

fn read_file(bounds: DynamicVector[Int]):
var total: StaticTuple[8,Int64] = StaticTuple[8,Int64](0,0,0,0,0,0,0,0)
@parameter
fn read_chunk(bound_idx: Int):
var offset: Int = bounds[bound_idx]
var r: Int = bounds[bound_idx+1]
while True:

# IF I UNCOMMENT THE NEXT LINE, GOES FROM 0,001 s execution time to more than 2s
# total[bound_idx] = total[bound_idx] + 1

offset += 1
if offset+1 >= r:
break
parallelize[read_chunk](len(bounds)-1)
return

fn main() raises:
var bounds = DynamicVector[Int]()
for i in range(0, num_physical_cores()+1):
var offset = int(100_000_000 / num_physical_cores() * i)
bounds.append(offset)

var start: Int = now()
read_file(bounds)
print("Total time:", (now()-start) / 1_000_000_000)
8 Replies
Michael K
Michael K•10mo ago
1) Are your writes staying in bounds? I have len(bounds) = 11 on my machine so if I change the tuple size to 16 it is fast. 2) parallelize takes a second argument to set the number of workers which you probably want to be num_performance_cores() . Other values may work well also but that is a good default. On my machine it cut the tiem in half. If I set the number of workers to 1 (no parallelism) and use the @always_inline decorator the code runs in under a millisecond. Probably a quirk because the function isn't doing anything so all the time is setting up parallel threads.
asosoman
asosomanOP•10mo ago
I'm using @always_inline, and without the increment of total[bound_idx] takes less than 1ms. With total[bound_idx] = total[bound_idx] + 1 on the code, reaches 3s 😫
Have increases the tuple to 16 and added num_performance_cores() , and reduced it to 1.5, so, goes faster, but 1,5s increasing a counter seems too much...
Rty Li
Rty Li•10mo ago
Why you are not using SIMD: var total = SIMDDType.int64,8 This with StaticTuple crashes on my computer: [19990:19990:20240229,105942.389187:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2) with StaticTuple[16,... works.
Michael K
Michael K•10mo ago
What is the number of cores on your machine? or the length of bounds? On my machine with 14 cores I stay in bounds with a Tuple of size 16. @asosoman , did this get resolved so your runs fast when writing to total?
asosoman
asosomanOP•10mo ago
@Michael K Sorry, was out for a couple of days. I keep not understanding why behaves this way, I stripped everything not needed. From my understanding, both functions should take similar time, they only increment one value. As it is below, int version takes 0.0009s and tuple version takes 0,68s. If I increase the parallelize to 2 workers (parallelize[read_chunk](2)) , goes 0.0009 and 3.36... And for higher versions of workers gets worse and worse... 😦 I-m really lost, as I can't understand why the updating the Tuple takes that much 😦 But also, I dont see a way of using 8 workers and keeping track of things without a tuple and storing each worker on one index. Could you try this simple code and give me your timed results?
from algorithm import parallelize
from time import now

fn read_file_tuple():
var total: StaticTuple[8,Int64] = StaticTuple[8,Int64](0,0,0,0,0,0,0,0)
@parameter
fn read_chunk(bound_idx: Int):
var offset: Int = 100_000_000
while offset != 0:
total[bound_idx] += 1
offset -= 1
parallelize[read_chunk](1)
print (total[0])
return

fn read_file_int():
var total: Int = 0
@parameter
fn read_chunk(bound_idx: Int):
var offset: Int = 100_000_000
while offset != 0:
total += 1
offset -= 1
parallelize[read_chunk](1)
print (total)
return

fn main() raises:
var start: Int = now()
read_file_int()
print("Total time int:", (now()-start) / 1_000_000_000)
start = now()
read_file_tuple()
print("Total time tuple:", (now()-start) / 1_000_000_000)
from algorithm import parallelize
from time import now

fn read_file_tuple():
var total: StaticTuple[8,Int64] = StaticTuple[8,Int64](0,0,0,0,0,0,0,0)
@parameter
fn read_chunk(bound_idx: Int):
var offset: Int = 100_000_000
while offset != 0:
total[bound_idx] += 1
offset -= 1
parallelize[read_chunk](1)
print (total[0])
return

fn read_file_int():
var total: Int = 0
@parameter
fn read_chunk(bound_idx: Int):
var offset: Int = 100_000_000
while offset != 0:
total += 1
offset -= 1
parallelize[read_chunk](1)
print (total)
return

fn main() raises:
var start: Int = now()
read_file_int()
print("Total time int:", (now()-start) / 1_000_000_000)
start = now()
read_file_tuple()
print("Total time tuple:", (now()-start) / 1_000_000_000)
Michael K
Michael K•10mo ago
My output from your code as is:
100000000
Total time int: 9.7999999999999997e-05
100000000
Total time tuple: 0.35048099999999999
100000000
Total time int: 9.7999999999999997e-05
100000000
Total time tuple: 0.35048099999999999
If tuple is operating as you say on your machine I think you should file a bug report. It seems like a performance cliiff that either needs to be documented or fixed. That said your code is running a simple thing a huge number of times. Any change in the inner loop could have a huge impact. One slow operation is writing to the heap or stack when a register will do. total[bound_idx] is loop invariant and always points to the same location. So read from it once and write to it once, outside of the inner loop. With read_chunk like this:
@parameter
fn read_chunk(bound_idx: Int):
var offset: Int = 100_000_000
var t = total[bound_idx]
while offset != 0:
t += 1
offset -= 1
total[bound_idx] = t
@parameter
fn read_chunk(bound_idx: Int):
var offset: Int = 100_000_000
var t = total[bound_idx]
while offset != 0:
t += 1
offset -= 1
total[bound_idx] = t
produces:
100000000
Total time int: 0.00032699999999999998
100000000
Total time tuple: 0.00025900000000000001
100000000
Total time int: 0.00032699999999999998
100000000
Total time tuple: 0.00025900000000000001
ModularBot
ModularBot•10mo ago
Congrats @Michael K, you just advanced to level 6!
asosoman
asosomanOP•10mo ago
thanks a lot ! 🙂 Have a lot to learn of new ways of doing things 😄
Want results from more Discord servers?
Add your server