1 Billion nested loop iterations Mojo Implementation - seeking feedback

Hey everyone, first time writing mojo and I am seeking feedback for my implementation on 1 Billion nested loop iterations, it's making its rounds on X and I thought I give it go in Mojo. Here is my current working branch was wondering if there is a better way to do this? The goal is not to implement the fastest way but the most "straight forward". I had a hard time figuring out how to initialize a list of 10,000 elements to 0 without doing another loop.
45 Replies
Tyler Hillery
Tyler HilleryOP2mo ago
here are my results
languages/loops git:(tyler/feat/add-mojo) ✗
➜ neofetch
'c.
.OMMMMo ------------------------------
OMMM0, OS: macOS 14.6.1 23G93 arm64
.;loddo:' loolloddol;. Host: Mac15,3
cKMMMMMMMMMMNWMMMMMMMMMM0: Kernel: 23.6.0
.KMMMMMMMMMMMMMMMMMMMMMMMWd. Uptime: 3 days, 2 hours, 39 mins
XMMMMMMMMMMMMMMMMMMMMMMMX. Packages: 56 (brew)
;MMMMMMMMMMMMMMMMMMMMMMMM: Shell: zsh 5.9
:MMMMMMMMMMMMMMMMMMMMMMMM: Resolution: 2560x1440, 1512x982
.MMMMMMMMMMMMMMMMMMMMMMMMX. DE: Aqua
kMMMMMMMMMMMMMMMMMMMMMMMMWd. WM: Rectangle
.XMMMMMMMMMMMMMMMMMMMMMMMMMMk Terminal: vscode
.XMMMMMMMMMMMMMMMMMMMMMMMMK. CPU: Apple M3
kMMMMMMMMMMMMMMMMMMMMMMd GPU: Apple M3
;KMMMMMMMWXXWMMMMMMMk. Memory: 4251MiB / 24576MiB
.cooc,. .,coo:.




languages/loops git:(tyler/feat/add-mojo)
➜ bash ../compile.sh
languages/loops git:(tyler/feat/add-mojo) ✗
➜ bash ../run.sh

Benchmarking Mojo
Benchmark 1: ./mojo/code 40
Time (mean ± σ): 778.6 ms ± 10.9 ms [User: 769.5 ms, System: 3.6 ms]
Range (min … max): 769.9 ms … 790.8 ms 3 runs
languages/loops git:(tyler/feat/add-mojo) ✗
➜ neofetch
'c.
.OMMMMo ------------------------------
OMMM0, OS: macOS 14.6.1 23G93 arm64
.;loddo:' loolloddol;. Host: Mac15,3
cKMMMMMMMMMMNWMMMMMMMMMM0: Kernel: 23.6.0
.KMMMMMMMMMMMMMMMMMMMMMMMWd. Uptime: 3 days, 2 hours, 39 mins
XMMMMMMMMMMMMMMMMMMMMMMMX. Packages: 56 (brew)
;MMMMMMMMMMMMMMMMMMMMMMMM: Shell: zsh 5.9
:MMMMMMMMMMMMMMMMMMMMMMMM: Resolution: 2560x1440, 1512x982
.MMMMMMMMMMMMMMMMMMMMMMMMX. DE: Aqua
kMMMMMMMMMMMMMMMMMMMMMMMMWd. WM: Rectangle
.XMMMMMMMMMMMMMMMMMMMMMMMMMMk Terminal: vscode
.XMMMMMMMMMMMMMMMMMMMMMMMMK. CPU: Apple M3
kMMMMMMMMMMMMMMMMMMMMMMd GPU: Apple M3
;KMMMMMMMWXXWMMMMMMMk. Memory: 4251MiB / 24576MiB
.cooc,. .,coo:.




languages/loops git:(tyler/feat/add-mojo)
➜ bash ../compile.sh
languages/loops git:(tyler/feat/add-mojo) ✗
➜ bash ../run.sh

Benchmarking Mojo
Benchmark 1: ./mojo/code 40
Time (mean ± σ): 778.6 ms ± 10.9 ms [User: 769.5 ms, System: 3.6 ms]
Range (min … max): 769.9 ms … 790.8 ms 3 runs
➜ magic --version
magic 0.5.0 - (based on pixi 0.37.0)

➜ magic run mojo --version
mojo 24.5.0 (e8aacb95)
➜ magic --version
magic 0.5.0 - (based on pixi 0.37.0)

➜ magic run mojo --version
mojo 24.5.0 (e8aacb95)
import sys
import random

fn main() raises:
var size = 10000
var u = atol(sys.argv()[1]) # Get an input number from the command line
var r = int(random.random_ui64(0, size)) # Get a random number 0 <= r < 10k
var a = List[Int]() # Array of 10k elements initialized to 0
for _ in range(size):
a.append(0)
for i in range(10000): # 10k outer loop iterations
for j in range(100000): # 100k inner loop iterations, per outer loop iteration
a[i] += j%u # Simple sum
a[i] += r # Add a random value to each element in array
print(a[r]) # Print out a single element from the array
import sys
import random

fn main() raises:
var size = 10000
var u = atol(sys.argv()[1]) # Get an input number from the command line
var r = int(random.random_ui64(0, size)) # Get a random number 0 <= r < 10k
var a = List[Int]() # Array of 10k elements initialized to 0
for _ in range(size):
a.append(0)
for i in range(10000): # 10k outer loop iterations
for j in range(100000): # 100k inner loop iterations, per outer loop iteration
a[i] += j%u # Simple sum
a[i] += r # Add a random value to each element in array
print(a[r]) # Print out a single element from the array
Tyler Hillery
Tyler HilleryOP2mo ago
Lot of people making fun of how long python takes, I think it would be nice to show an implementation that looks similar to python to demonstrate how much faster your code is with very little changes
GitHub
languages/loops/py/code.py at main · bddicken/languages
Compare languages. Contribute to bddicken/languages development by creating an account on GitHub.
btokarzewski
btokarzewski2mo ago
:mojo:
Manuel Saelices
Manuel Saelices2mo ago
I would change var a = List[Int]() with var a = List[Int](capacity=size) just to prevent an memory allocation on each a.append(0) sentence Also, you can add hint_trivial_type=True Meaning, var a = List[Int, hint_trivial_type=True](capacity=size) Also, I would make sure you are using the nightly version, as it's faster than the current stable one And will be even faster soon CC/ @Martin Vuyk
Martin Vuyk
Martin Vuyk2mo ago
thou hath summoned me I'd like to not have to show the world that we so liberally use UnsafePointer but we still need to define some other stuff before we can migrate away towards something higher level. We will probably use Span for many things. Having said that, I think this is what you're looking for (if you want it to look as pythonic as possible):
import sys
import random
from memory import memset_zero


def main():
size = 10_000
u = int(sys.argv()[1]) # Get an input number from the command line
r = int(random.random_ui64(0, size)) # Get a random number 0 <= r < 10k
a = List[Int](capacity=size) # Array of 10k elements initialized to 0
memset_zero(a.unsafe_ptr(), size)
for i in range(size): # 10k outer loop iterations
# 100k inner loop iterations, per outer loop iteration
for j in range(100_000):
a[i] += j % u # Simple sum
a[i] += r # Add a random value to each element in array
print(a[r]) # Print out a single element from the array
import sys
import random
from memory import memset_zero


def main():
size = 10_000
u = int(sys.argv()[1]) # Get an input number from the command line
r = int(random.random_ui64(0, size)) # Get a random number 0 <= r < 10k
a = List[Int](capacity=size) # Array of 10k elements initialized to 0
memset_zero(a.unsafe_ptr(), size)
for i in range(size): # 10k outer loop iterations
# 100k inner loop iterations, per outer loop iteration
for j in range(100_000):
a[i] += j % u # Simple sum
a[i] += r # Add a random value to each element in array
print(a[r]) # Print out a single element from the array
This is quite literally Python's weakness, for loops and int lists
Tyler Hillery
Tyler HilleryOP2mo ago
Thanks for the feedback everyone, I had fun dipping my toes into Mojo. I will work on implementing your suggestions and open a PR
Martin Vuyk
Martin Vuyk2mo ago
One more thing, if we want to compare Mojo with C, Zig and Rust, the datatype has to be the same. Python uses infinite precision int, Mojo uses platform dependent Int (for x86-64 it's Int64), but I saw the code for those other languages and it's UInt32.
Darkmatter
Darkmatter2mo ago
C is Int32
Martin Vuyk
Martin Vuyk2mo ago
potato potato for this kind of operations a the cpu level afaik
Darkmatter
Darkmatter2mo ago
Signed vs unsigned has a lot of disparity in available vector ops.
Martin Vuyk
Martin Vuyk2mo ago
just for modulo and add? I don't see much perf difference between Fortran, C, Zig and Rust in the website
Darkmatter
Darkmatter2mo ago
Look at the NEON instruction sets
Tyler Hillery
Tyler HilleryOP2mo ago
Updated results using code above and switching to nightly. My machine is different than the one used in the official repo as well.
➜ bash ../run.sh

Benchmarking Mojo
Benchmark 1: ./mojo/code 40
Time (mean ± σ): 774.5 ms ± 10.0 ms [User: 767.7 ms, System: 3.0 ms]
Range (min … max): 767.6 ms … 786.0 ms 3 runs

➜ magic run mojo --version
mojo 24.6.0.dev2024120416 (36394a81)
➜ bash ../run.sh

Benchmarking Mojo
Benchmark 1: ./mojo/code 40
Time (mean ± σ): 774.5 ms ± 10.0 ms [User: 767.7 ms, System: 3.0 ms]
Range (min … max): 767.6 ms … 786.0 ms 3 runs

➜ magic run mojo --version
mojo 24.6.0.dev2024120416 (36394a81)
Martin Vuyk
Martin Vuyk2mo ago
Interesting, do you have a way to test C or Zig or Rust in your machine as a comparison point?
Tyler Hillery
Tyler HilleryOP2mo ago
let me set those up real quick
ModularBot
ModularBot2mo ago
Congrats @Tyler Hillery, you just advanced to level 1!
Tyler Hillery
Tyler HilleryOP2mo ago
➜ ../run.sh

Benchmarking C
Benchmark 1: ./c/code 40
Time (mean ± σ): 507.1 ms ± 1.0 ms [User: 505.2 ms, System: 1.4 ms]
Range (min … max): 506.1 ms … 508.0 ms 3 runs


Benchmarking Rust
Benchmark 1: ./rust/target/release/code 40
Time (mean ± σ): 504.6 ms ± 0.8 ms [User: 502.8 ms, System: 1.3 ms]
Range (min … max): 503.8 ms … 505.4 ms 3 runs


Benchmarking Zig
Benchmark 1: ./zig/code 40
Time (mean ± σ): 503.2 ms ± 0.5 ms [User: 501.1 ms, System: 1.5 ms]
Range (min … max): 502.8 ms … 503.8 ms 3 runs


Benchmarking Mojo
Benchmark 1: ./mojo/code 40
Time (mean ± σ): 764.5 ms ± 0.1 ms [User: 762.4 ms, System: 2.1 ms]
Range (min … max): 764.4 ms … 764.5 ms 3 runs
➜ ../run.sh

Benchmarking C
Benchmark 1: ./c/code 40
Time (mean ± σ): 507.1 ms ± 1.0 ms [User: 505.2 ms, System: 1.4 ms]
Range (min … max): 506.1 ms … 508.0 ms 3 runs


Benchmarking Rust
Benchmark 1: ./rust/target/release/code 40
Time (mean ± σ): 504.6 ms ± 0.8 ms [User: 502.8 ms, System: 1.3 ms]
Range (min … max): 503.8 ms … 505.4 ms 3 runs


Benchmarking Zig
Benchmark 1: ./zig/code 40
Time (mean ± σ): 503.2 ms ± 0.5 ms [User: 501.1 ms, System: 1.5 ms]
Range (min … max): 502.8 ms … 503.8 ms 3 runs


Benchmarking Mojo
Benchmark 1: ./mojo/code 40
Time (mean ± σ): 764.5 ms ± 0.1 ms [User: 762.4 ms, System: 2.1 ms]
Range (min … max): 764.4 ms … 764.5 ms 3 runs
Martin Vuyk
Martin Vuyk2mo ago
Do you mind changing List[Int] to List[UInt32] ? and testing that We still don't have many profiling tools so it's mostly just guessing rn 😦
Tyler Hillery
Tyler HilleryOP2mo ago
That seemed to help!
➜ ../run.sh

Benchmarking C
Benchmark 1: ./c/code 40
Time (mean ± σ): 505.6 ms ± 0.1 ms [User: 503.9 ms, System: 1.1 ms]
Range (min … max): 505.5 ms … 505.7 ms 3 runs


Benchmarking Rust
Benchmark 1: ./rust/target/release/code 40
Time (mean ± σ): 504.5 ms ± 0.2 ms [User: 502.7 ms, System: 1.2 ms]
Range (min … max): 504.3 ms … 504.8 ms 3 runs


Benchmarking Zig
Benchmark 1: ./zig/code 40
Time (mean ± σ): 501.7 ms ± 0.3 ms [User: 500.2 ms, System: 1.0 ms]
Range (min … max): 501.3 ms … 502.0 ms 3 runs


Benchmarking Mojo
Benchmark 1: ./mojo/code 40
Time (mean ± σ): 607.4 ms ± 0.3 ms [User: 605.7 ms, System: 1.8 ms]
Range (min … max): 607.2 ms … 607.7 ms 3 runs
➜ ../run.sh

Benchmarking C
Benchmark 1: ./c/code 40
Time (mean ± σ): 505.6 ms ± 0.1 ms [User: 503.9 ms, System: 1.1 ms]
Range (min … max): 505.5 ms … 505.7 ms 3 runs


Benchmarking Rust
Benchmark 1: ./rust/target/release/code 40
Time (mean ± σ): 504.5 ms ± 0.2 ms [User: 502.7 ms, System: 1.2 ms]
Range (min … max): 504.3 ms … 504.8 ms 3 runs


Benchmarking Zig
Benchmark 1: ./zig/code 40
Time (mean ± σ): 501.7 ms ± 0.3 ms [User: 500.2 ms, System: 1.0 ms]
Range (min … max): 501.3 ms … 502.0 ms 3 runs


Benchmarking Mojo
Benchmark 1: ./mojo/code 40
Time (mean ± σ): 607.4 ms ± 0.3 ms [User: 605.7 ms, System: 1.8 ms]
Range (min … max): 607.2 ms … 607.7 ms 3 runs
Martin Vuyk
Martin Vuyk2mo ago
That would put us around 13th place if we assume both CPUs have similar SIMD capabilities (which I think it's safe to assume since both are M3 and took around the same time) I'm mostly guessing, but this line u = int(sys.argv()[1]) # Get an input number from the command line triggers atol which has a lot of logic that has to deal with a lot of edge cases which are only true to comply with Python. I'm not sure if that has enough impact for this difference though
Tyler Hillery
Tyler HilleryOP2mo ago
I'm not adamant on submitting PR, thought it was just a fun way to try out Mojo, if you rather have me not submit that's fine by me. I still think these are awesome results ( speaking as a python dev )
Martin Vuyk
Martin Vuyk2mo ago
It's fine by me either way, I still think its too early for Mojo to start being compared since we are still changing much of the stdlib and figuring stuff out.
I still think these are awesome results ( speaking as a python dev )
That's one of the main goals, providing a simple to use language where the default is high performance and the option to go deeper is always there. Your code wasn't that much different in performance to the end result for example 🙂
sb
sb2mo ago
@Owen Hilyard has one, some updates:
import random
from sys import argv

fn inner(u: Int32) -> Int32:
var jb: Int32 = 100000
var full_iterations = jb / u
var full_iteration_sum = u * (u+1) / 2
var remaining_iterations = jb - (jb / u)
var partial_iteration_sum = remaining_iterations * (remaining_iterations+1) / 2
return (full_iterations * full_iteration_sum) + partial_iteration_sum

fn main() raises:
var u: Int32 = Int32(int(argv()[1]))
var r: Int32 = random.random_si64(0, 10000).cast[DType.int32]()
var loop_value = inner(u) + r
print(loop_value)
import random
from sys import argv

fn inner(u: Int32) -> Int32:
var jb: Int32 = 100000
var full_iterations = jb / u
var full_iteration_sum = u * (u+1) / 2
var remaining_iterations = jb - (jb / u)
var partial_iteration_sum = remaining_iterations * (remaining_iterations+1) / 2
return (full_iterations * full_iteration_sum) + partial_iteration_sum

fn main() raises:
var u: Int32 = Int32(int(argv()[1]))
var r: Int32 = random.random_si64(0, 10000).cast[DType.int32]()
var loop_value = inner(u) + r
print(loop_value)
sb
sb2mo ago
No description
sb
sb2mo ago
(also this is on m1, so may be slow for them) (yes this is operationally equivalent, it is also why we haven't tried opening a PR because it sorta mocks the benchmark itself)
Tyler Hillery
Tyler HilleryOP2mo ago
what sort of wizardary is this results on machine
➜ ../run.sh

Benchmarking C
Benchmark 1: ./c/code 40
Time (mean ± σ): 505.8 ms ± 0.4 ms [User: 504.1 ms, System: 1.2 ms]
Range (min … max): 505.5 ms … 506.2 ms 3 runs


Benchmarking Rust
Benchmark 1: ./rust/target/release/code 40
Time (mean ± σ): 504.5 ms ± 0.4 ms [User: 502.6 ms, System: 1.2 ms]
Range (min … max): 504.0 ms … 504.9 ms 3 runs


Benchmarking Zig
Benchmark 1: ./zig/code 40
Time (mean ± σ): 501.9 ms ± 0.4 ms [User: 500.1 ms, System: 1.2 ms]
Range (min … max): 501.7 ms … 502.4 ms 3 runs


Benchmarking Mojo
Benchmark 1: ./mojo/code 40
Time (mean ± σ): 2.7 ms ± 0.1 ms [User: 1.7 ms, System: 0.7 ms]
Range (min … max): 2.7 ms … 2.8 ms 3 runs
➜ ../run.sh

Benchmarking C
Benchmark 1: ./c/code 40
Time (mean ± σ): 505.8 ms ± 0.4 ms [User: 504.1 ms, System: 1.2 ms]
Range (min … max): 505.5 ms … 506.2 ms 3 runs


Benchmarking Rust
Benchmark 1: ./rust/target/release/code 40
Time (mean ± σ): 504.5 ms ± 0.4 ms [User: 502.6 ms, System: 1.2 ms]
Range (min … max): 504.0 ms … 504.9 ms 3 runs


Benchmarking Zig
Benchmark 1: ./zig/code 40
Time (mean ± σ): 501.9 ms ± 0.4 ms [User: 500.1 ms, System: 1.2 ms]
Range (min … max): 501.7 ms … 502.4 ms 3 runs


Benchmarking Mojo
Benchmark 1: ./mojo/code 40
Time (mean ± σ): 2.7 ms ± 0.1 ms [User: 1.7 ms, System: 0.7 ms]
Range (min … max): 2.7 ms … 2.8 ms 3 runs
sb
sb2mo ago
it's a modulo so we're just integrating across a sawtooth fn :) I should close some tabs, it might have been swapping while it was running 😬 /s
Darkmatter
Darkmatter2mo ago
I’ll think about submitting it
Tyler Hillery
Tyler HilleryOP2mo ago
Sounds good, I got what I wanted out of this which was to try out Mojo 🙂
Darkmatter
Darkmatter2mo ago
import sys
import random

u = int(sys.argv[1])
r = random.randint(0, 10000)
jb = 100000
full_iterations = jb / u
full_iteration_sum = u * (u+1) / 2
remaining_iterations = jb - (jb / u)
partial_iteration_sum = remaining_iterations * (remaining_iterations+1) / 2
result = (full_iterations * full_iteration_sum) + partial_iteration_sum + r
print(result)
import sys
import random

u = int(sys.argv[1])
r = random.randint(0, 10000)
jb = 100000
full_iterations = jb / u
full_iteration_sum = u * (u+1) / 2
remaining_iterations = jb - (jb / u)
partial_iteration_sum = remaining_iterations * (remaining_iterations+1) / 2
result = (full_iterations * full_iteration_sum) + partial_iteration_sum + r
print(result)
No description
Darkmatter
Darkmatter2mo ago
Look python is faster than C! To be clear, there was an argument in #offtopic about this benchmark, and I don't think is particularly good because it has a zero-iteration solution. But for learning go ahead and optimize.
rd4com
rd4com2mo ago
There is also the Advent of Code in mojo if you'd like to continue 👍 #advent-of-code-2024
Serg Gini
Serg Gini2mo ago
this is different algorithm of course the problem is too simple and constraints from the author are kinda strange.. So I think, it is "bad" to compare absolute values in this bench But it is "good" to investigate your and other language design details based on this bench that took some attention
sb
sb2mo ago
Compilers can, and do, do algorithmic transformations fwiw. A lot of the "magic" of compilers is just matching on idioms/patterns and doing templated replacement
Serg Gini
Serg Gini2mo ago
the idea of the bench to let compiler do the magic, not the dev
Darkmatter
Darkmatter2mo ago
My point is that I'm 3-4 llvm patches away from making C, C++, Rust and Mojo all do this.
Serg Gini
Serg Gini2mo ago
for some optimizations it is possible, for others it is harder..
Darkmatter
Darkmatter2mo ago
This means that for smaller languages with only a few contributors, they could game this benchmark easily with what are realistically general purpose optimization passes.
sb
sb2mo ago
The reason I brought it up was that it's like judging the critical thinking skills of a person based on a single answer they give to a question on Jeopardy--it's a scenario that has some very specific "step" tricks that don't necessarily generalize well and aren't really representative of the overall in-practice effectiveness of the compilers/interpreters tested. That algo from above is just an extreme example to show that.
Serg Gini
Serg Gini2mo ago
for sure nobody should do any significant decisions only on this one microbench)
ModularBot
ModularBot2mo ago
Congrats @Serg Gini, you just advanced to level 5!
sb
sb2mo ago
My worry is that people are, since I've seen this particular bench making the rounds in a bunch of places without much critical discourse And a bench that relies on compilers recognizing different levels of the 99.99+% of work it's doing can be thrown away feels disingenuous in that light
Darkmatter
Darkmatter2mo ago
I think some of this comes from the 1 billion row challenge, which actually was an interesting challenge. CSV parsing a known format is something that is reasonable to bench.
Serg Gini
Serg Gini2mo ago
1brc was interesting yes. I think the problem itself is quite uncommon.. like usually you have thousands smaller CSV/TSV/Json files.. not one big though
Darkmatter
Darkmatter2mo ago
Tons of small ones means you throw map/reduce at it.

Did you find this page helpful?