Modular•5mo ago

1 Billion nested loop iterations Mojo Implementation - seeking feedback

Hey everyone, first time writing mojo and I am seeking feedback for my implementation on 1 Billion nested loop iterations, it's making its rounds on X and I thought I give it go in Mojo. Here is my current working branch was wondering if there is a better way to do this? The goal is not to implement the fastest way but the most "straight forward". I had a hard time figuring out how to initialize a list of 10,000 elements to 0 without doing another loop.

GitHub

feat: add mojo loops · TylerHillery/languages@b5197d6

45 Replies

Tyler HilleryOP•5mo ago

here are my results

languages/loops git:(tyler/feat/add-mojo) ✗ 
➜ neofetch
                    'c.
                 ,xNMM.          [email protected] 
               .OMMMMo           ------------------------------ 
               OMMM0,            OS: macOS 14.6.1 23G93 arm64 
     .;loddo:' loolloddol;.      Host: Mac15,3 
   cKMMMMMMMMMMNWMMMMMMMMMM0:    Kernel: 23.6.0 
 .KMMMMMMMMMMMMMMMMMMMMMMMWd.    Uptime: 3 days, 2 hours, 39 mins 
 XMMMMMMMMMMMMMMMMMMMMMMMX.      Packages: 56 (brew) 
;MMMMMMMMMMMMMMMMMMMMMMMM:       Shell: zsh 5.9 
:MMMMMMMMMMMMMMMMMMMMMMMM:       Resolution: 2560x1440, 1512x982 
.MMMMMMMMMMMMMMMMMMMMMMMMX.      DE: Aqua 
 kMMMMMMMMMMMMMMMMMMMMMMMMWd.    WM: Rectangle 
 .XMMMMMMMMMMMMMMMMMMMMMMMMMMk   Terminal: vscode 
  .XMMMMMMMMMMMMMMMMMMMMMMMMK.   CPU: Apple M3 
    kMMMMMMMMMMMMMMMMMMMMMMd     GPU: Apple M3 
     ;KMMMMMMMWXXWMMMMMMMk.      Memory: 4251MiB / 24576MiB 
       .cooc,.    .,coo:.
                                                         
                                                         


languages/loops git:(tyler/feat/add-mojo) 
➜ bash ../compile.sh
languages/loops git:(tyler/feat/add-mojo) ✗ 
➜ bash ../run.sh    

Benchmarking Mojo
Benchmark 1: ./mojo/code 40
  Time (mean ± σ):     778.6 ms ±  10.9 ms    [User: 769.5 ms, System: 3.6 ms]
  Range (min … max):   769.9 ms … 790.8 ms    3 runs

languages/loops git:(tyler/feat/add-mojo) ✗ 
➜ neofetch
                    'c.
                 ,xNMM.          [email protected] 
               .OMMMMo           ------------------------------ 
               OMMM0,            OS: macOS 14.6.1 23G93 arm64 
     .;loddo:' loolloddol;.      Host: Mac15,3 
   cKMMMMMMMMMMNWMMMMMMMMMM0:    Kernel: 23.6.0 
 .KMMMMMMMMMMMMMMMMMMMMMMMWd.    Uptime: 3 days, 2 hours, 39 mins 
 XMMMMMMMMMMMMMMMMMMMMMMMX.      Packages: 56 (brew) 
;MMMMMMMMMMMMMMMMMMMMMMMM:       Shell: zsh 5.9 
:MMMMMMMMMMMMMMMMMMMMMMMM:       Resolution: 2560x1440, 1512x982 
.MMMMMMMMMMMMMMMMMMMMMMMMX.      DE: Aqua 
 kMMMMMMMMMMMMMMMMMMMMMMMMWd.    WM: Rectangle 
 .XMMMMMMMMMMMMMMMMMMMMMMMMMMk   Terminal: vscode 
  .XMMMMMMMMMMMMMMMMMMMMMMMMK.   CPU: Apple M3 
    kMMMMMMMMMMMMMMMMMMMMMMd     GPU: Apple M3 
     ;KMMMMMMMWXXWMMMMMMMk.      Memory: 4251MiB / 24576MiB 
       .cooc,.    .,coo:.
                                                         
                                                         


languages/loops git:(tyler/feat/add-mojo) 
➜ bash ../compile.sh
languages/loops git:(tyler/feat/add-mojo) ✗ 
➜ bash ../run.sh    

Benchmarking Mojo
Benchmark 1: ./mojo/code 40
  Time (mean ± σ):     778.6 ms ±  10.9 ms    [User: 769.5 ms, System: 3.6 ms]
  Range (min … max):   769.9 ms … 790.8 ms    3 runs

➜ magic --version
magic 0.5.0 - (based on pixi 0.37.0)

➜ magic run mojo --version
mojo 24.5.0 (e8aacb95)

➜ magic --version
magic 0.5.0 - (based on pixi 0.37.0)

➜ magic run mojo --version
mojo 24.5.0 (e8aacb95)

import sys
import random 

fn main() raises:
    var size = 10000
    var u = atol(sys.argv()[1])                 # Get an input number from the command line
    var r = int(random.random_ui64(0, size))    # Get a random number 0 <= r < 10k
    var a = List[Int]()                         # Array of 10k elements initialized to 0
    for _ in range(size):                       
        a.append(0)
    for i in range(10000):                      # 10k outer loop iterations 
        for j in range(100000):                 # 100k inner loop iterations, per outer loop iteration
            a[i] += j%u                         # Simple sum
        a[i] += r                               # Add a random value to each element in array
    print(a[r])                                 # Print out a single element from the array

import sys
import random 

fn main() raises:
    var size = 10000
    var u = atol(sys.argv()[1])                 # Get an input number from the command line
    var r = int(random.random_ui64(0, size))    # Get a random number 0 <= r < 10k
    var a = List[Int]()                         # Array of 10k elements initialized to 0
    for _ in range(size):                       
        a.append(0)
    for i in range(10000):                      # 10k outer loop iterations 
        for j in range(100000):                 # 100k inner loop iterations, per outer loop iteration
            a[i] += j%u                         # Simple sum
        a[i] += r                               # Add a random value to each element in array
    print(a[r])                                 # Print out a single element from the array

Tyler HilleryOP•5mo ago

Lot of people making fun of how long python takes, I think it would be nice to show an implementation that looks similar to python to demonstrate how much faster your code is with very little changes

GitHub

languages/loops/py/code.py at main · bddicken/languages

Compare languages. Contribute to bddicken/languages development by creating an account on GitHub.

btokarzewski•5mo ago

:mojo:

Manuel Saelices•5mo ago

I would change var a = List[Int]() with var a = List[Int](capacity=size) just to prevent an memory allocation on each a.append(0) sentence Also, you can add hint_trivial_type=True Meaning, var a = List[Int, hint_trivial_type=True](capacity=size) Also, I would make sure you are using the nightly version, as it's faster than the current stable one And will be even faster soon CC/ @Martin Vuyk

Martin Vuyk•5mo ago

thou hath summoned me I'd like to not have to show the world that we so liberally use UnsafePointer but we still need to define some other stuff before we can migrate away towards something higher level. We will probably use Span for many things. Having said that, I think this is what you're looking for (if you want it to look as pythonic as possible):

import sys
import random
from memory import memset_zero


def main():
    size = 10_000
    u = int(sys.argv()[1])  # Get an input number from the command line
    r = int(random.random_ui64(0, size))  # Get a random number 0 <= r < 10k
    a = List[Int](capacity=size)  # Array of 10k elements initialized to 0
    memset_zero(a.unsafe_ptr(), size)
    for i in range(size):  # 10k outer loop iterations
        # 100k inner loop iterations, per outer loop iteration
        for j in range(100_000):
            a[i] += j % u  # Simple sum
        a[i] += r  # Add a random value to each element in array
    print(a[r])  # Print out a single element from the array

import sys
import random
from memory import memset_zero


def main():
    size = 10_000
    u = int(sys.argv()[1])  # Get an input number from the command line
    r = int(random.random_ui64(0, size))  # Get a random number 0 <= r < 10k
    a = List[Int](capacity=size)  # Array of 10k elements initialized to 0
    memset_zero(a.unsafe_ptr(), size)
    for i in range(size):  # 10k outer loop iterations
        # 100k inner loop iterations, per outer loop iteration
        for j in range(100_000):
            a[i] += j % u  # Simple sum
        a[i] += r  # Add a random value to each element in array
    print(a[r])  # Print out a single element from the array

This is quite literally Python's weakness, for loops and int lists

Tyler HilleryOP•5mo ago

Thanks for the feedback everyone, I had fun dipping my toes into Mojo. I will work on implementing your suggestions and open a PR

Martin Vuyk•5mo ago

One more thing, if we want to compare Mojo with C, Zig and Rust, the datatype has to be the same. Python uses infinite precision int, Mojo uses platform dependent Int (for x86-64 it's Int64), but I saw the code for those other languages and it's UInt32.

Darkmatter•5mo ago

C is Int32

Martin Vuyk•5mo ago

potato potato for this kind of operations a the cpu level afaik

Darkmatter•5mo ago

Signed vs unsigned has a lot of disparity in available vector ops.

Martin Vuyk•5mo ago

just for modulo and add? I don't see much perf difference between Fortran, C, Zig and Rust in the website

Darkmatter•5mo ago

Look at the NEON instruction sets

Tyler HilleryOP•5mo ago

Updated results using code above and switching to nightly. My machine is different than the one used in the official repo as well.

➜ bash ../run.sh    

Benchmarking Mojo
Benchmark 1: ./mojo/code 40
  Time (mean ± σ):     774.5 ms ±  10.0 ms    [User: 767.7 ms, System: 3.0 ms]
  Range (min … max):   767.6 ms … 786.0 ms    3 runs

➜ magic run mojo --version
mojo 24.6.0.dev2024120416 (36394a81)

➜ bash ../run.sh    

Benchmarking Mojo
Benchmark 1: ./mojo/code 40
  Time (mean ± σ):     774.5 ms ±  10.0 ms    [User: 767.7 ms, System: 3.0 ms]
  Range (min … max):   767.6 ms … 786.0 ms    3 runs

➜ magic run mojo --version
mojo 24.6.0.dev2024120416 (36394a81)

Martin Vuyk•5mo ago

Interesting, do you have a way to test C or Zig or Rust in your machine as a comparison point?

Tyler HilleryOP•5mo ago

let me set those up real quick

ModularBot•5mo ago

Congrats @Tyler Hillery, you just advanced to level 1!

Tyler HilleryOP•5mo ago

➜ ../run.sh

Benchmarking C
Benchmark 1: ./c/code 40
  Time (mean ± σ):     507.1 ms ±   1.0 ms    [User: 505.2 ms, System: 1.4 ms]
  Range (min … max):   506.1 ms … 508.0 ms    3 runs


Benchmarking Rust
Benchmark 1: ./rust/target/release/code 40
  Time (mean ± σ):     504.6 ms ±   0.8 ms    [User: 502.8 ms, System: 1.3 ms]
  Range (min … max):   503.8 ms … 505.4 ms    3 runs


Benchmarking Zig
Benchmark 1: ./zig/code 40
  Time (mean ± σ):     503.2 ms ±   0.5 ms    [User: 501.1 ms, System: 1.5 ms]
  Range (min … max):   502.8 ms … 503.8 ms    3 runs


Benchmarking Mojo
Benchmark 1: ./mojo/code 40
  Time (mean ± σ):     764.5 ms ±   0.1 ms    [User: 762.4 ms, System: 2.1 ms]
  Range (min … max):   764.4 ms … 764.5 ms    3 runs

➜ ../run.sh

Benchmarking C
Benchmark 1: ./c/code 40
  Time (mean ± σ):     507.1 ms ±   1.0 ms    [User: 505.2 ms, System: 1.4 ms]
  Range (min … max):   506.1 ms … 508.0 ms    3 runs


Benchmarking Rust
Benchmark 1: ./rust/target/release/code 40
  Time (mean ± σ):     504.6 ms ±   0.8 ms    [User: 502.8 ms, System: 1.3 ms]
  Range (min … max):   503.8 ms … 505.4 ms    3 runs


Benchmarking Zig
Benchmark 1: ./zig/code 40
  Time (mean ± σ):     503.2 ms ±   0.5 ms    [User: 501.1 ms, System: 1.5 ms]
  Range (min … max):   502.8 ms … 503.8 ms    3 runs


Benchmarking Mojo
Benchmark 1: ./mojo/code 40
  Time (mean ± σ):     764.5 ms ±   0.1 ms    [User: 762.4 ms, System: 2.1 ms]
  Range (min … max):   764.4 ms … 764.5 ms    3 runs

Martin Vuyk•5mo ago

Do you mind changing List[Int] to List[UInt32] ? and testing that We still don't have many profiling tools so it's mostly just guessing rn 😦

Tyler HilleryOP•5mo ago

That seemed to help!

➜ ../run.sh

Benchmarking C
Benchmark 1: ./c/code 40
  Time (mean ± σ):     505.6 ms ±   0.1 ms    [User: 503.9 ms, System: 1.1 ms]
  Range (min … max):   505.5 ms … 505.7 ms    3 runs


Benchmarking Rust
Benchmark 1: ./rust/target/release/code 40
  Time (mean ± σ):     504.5 ms ±   0.2 ms    [User: 502.7 ms, System: 1.2 ms]
  Range (min … max):   504.3 ms … 504.8 ms    3 runs


Benchmarking Zig
Benchmark 1: ./zig/code 40
  Time (mean ± σ):     501.7 ms ±   0.3 ms    [User: 500.2 ms, System: 1.0 ms]
  Range (min … max):   501.3 ms … 502.0 ms    3 runs


Benchmarking Mojo
Benchmark 1: ./mojo/code 40
  Time (mean ± σ):     607.4 ms ±   0.3 ms    [User: 605.7 ms, System: 1.8 ms]
  Range (min … max):   607.2 ms … 607.7 ms    3 runs

➜ ../run.sh

Benchmarking C
Benchmark 1: ./c/code 40
  Time (mean ± σ):     505.6 ms ±   0.1 ms    [User: 503.9 ms, System: 1.1 ms]
  Range (min … max):   505.5 ms … 505.7 ms    3 runs


Benchmarking Rust
Benchmark 1: ./rust/target/release/code 40
  Time (mean ± σ):     504.5 ms ±   0.2 ms    [User: 502.7 ms, System: 1.2 ms]
  Range (min … max):   504.3 ms … 504.8 ms    3 runs


Benchmarking Zig
Benchmark 1: ./zig/code 40
  Time (mean ± σ):     501.7 ms ±   0.3 ms    [User: 500.2 ms, System: 1.0 ms]
  Range (min … max):   501.3 ms … 502.0 ms    3 runs


Benchmarking Mojo
Benchmark 1: ./mojo/code 40
  Time (mean ± σ):     607.4 ms ±   0.3 ms    [User: 605.7 ms, System: 1.8 ms]
  Range (min … max):   607.2 ms … 607.7 ms    3 runs

Martin Vuyk•5mo ago

That would put us around 13th place if we assume both CPUs have similar SIMD capabilities (which I think it's safe to assume since both are M3 and took around the same time) I'm mostly guessing, but this line u = int(sys.argv()[1]) # Get an input number from the command line triggers atol which has a lot of logic that has to deal with a lot of edge cases which are only true to comply with Python. I'm not sure if that has enough impact for this difference though

Tyler HilleryOP•5mo ago

I'm not adamant on submitting PR, thought it was just a fun way to try out Mojo, if you rather have me not submit that's fine by me. I still think these are awesome results ( speaking as a python dev )

Martin Vuyk•5mo ago

It's fine by me either way, I still think its too early for Mojo to start being compared since we are still changing much of the stdlib and figuring stuff out.

I still think these are awesome results ( speaking as a python dev )

That's one of the main goals, providing a simple to use language where the default is high performance and the option to go deeper is always there. Your code wasn't that much different in performance to the end result for example 🙂

sb•5mo ago

@Owen Hilyard has one, some updates:

import random
from sys import argv

fn inner(u: Int32) -> Int32:
  var jb: Int32 = 100000
  var full_iterations = jb / u
  var full_iteration_sum = u * (u+1) / 2
  var remaining_iterations = jb - (jb / u)
  var partial_iteration_sum = remaining_iterations * (remaining_iterations+1) / 2
  return (full_iterations * full_iteration_sum) + partial_iteration_sum

fn main() raises:
    var u: Int32 = Int32(int(argv()[1]))
    var r: Int32 = random.random_si64(0, 10000).cast[DType.int32]()
    var loop_value = inner(u) + r
    print(loop_value)

import random
from sys import argv

fn inner(u: Int32) -> Int32:
  var jb: Int32 = 100000
  var full_iterations = jb / u
  var full_iteration_sum = u * (u+1) / 2
  var remaining_iterations = jb - (jb / u)
  var partial_iteration_sum = remaining_iterations * (remaining_iterations+1) / 2
  return (full_iterations * full_iteration_sum) + partial_iteration_sum

fn main() raises:
    var u: Int32 = Int32(int(argv()[1]))
    var r: Int32 = random.random_si64(0, 10000).cast[DType.int32]()
    var loop_value = inner(u) + r
    print(loop_value)

sb•5mo ago

sb•5mo ago

(also this is on m1, so may be slow for them) (yes this is operationally equivalent, it is also why we haven't tried opening a PR because it sorta mocks the benchmark itself)

Tyler HilleryOP•5mo ago

what sort of wizardary is this results on machine

➜ ../run.sh

Benchmarking C
Benchmark 1: ./c/code 40
  Time (mean ± σ):     505.8 ms ±   0.4 ms    [User: 504.1 ms, System: 1.2 ms]
  Range (min … max):   505.5 ms … 506.2 ms    3 runs


Benchmarking Rust
Benchmark 1: ./rust/target/release/code 40
  Time (mean ± σ):     504.5 ms ±   0.4 ms    [User: 502.6 ms, System: 1.2 ms]
  Range (min … max):   504.0 ms … 504.9 ms    3 runs


Benchmarking Zig
Benchmark 1: ./zig/code 40
  Time (mean ± σ):     501.9 ms ±   0.4 ms    [User: 500.1 ms, System: 1.2 ms]
  Range (min … max):   501.7 ms … 502.4 ms    3 runs


Benchmarking Mojo
Benchmark 1: ./mojo/code 40
  Time (mean ± σ):       2.7 ms ±   0.1 ms    [User: 1.7 ms, System: 0.7 ms]
  Range (min … max):     2.7 ms …   2.8 ms    3 runs

➜ ../run.sh

Benchmarking C
Benchmark 1: ./c/code 40
  Time (mean ± σ):     505.8 ms ±   0.4 ms    [User: 504.1 ms, System: 1.2 ms]
  Range (min … max):   505.5 ms … 506.2 ms    3 runs


Benchmarking Rust
Benchmark 1: ./rust/target/release/code 40
  Time (mean ± σ):     504.5 ms ±   0.4 ms    [User: 502.6 ms, System: 1.2 ms]
  Range (min … max):   504.0 ms … 504.9 ms    3 runs


Benchmarking Zig
Benchmark 1: ./zig/code 40
  Time (mean ± σ):     501.9 ms ±   0.4 ms    [User: 500.1 ms, System: 1.2 ms]
  Range (min … max):   501.7 ms … 502.4 ms    3 runs


Benchmarking Mojo
Benchmark 1: ./mojo/code 40
  Time (mean ± σ):       2.7 ms ±   0.1 ms    [User: 1.7 ms, System: 0.7 ms]
  Range (min … max):     2.7 ms …   2.8 ms    3 runs

sb•5mo ago

it's a modulo so we're just integrating across a sawtooth fn :) I should close some tabs, it might have been swapping while it was running 😬 /s

Darkmatter•5mo ago

I’ll think about submitting it

Tyler HilleryOP•5mo ago

Sounds good, I got what I wanted out of this which was to try out Mojo 🙂

Darkmatter•5mo ago

import sys
import random

u = int(sys.argv[1])
r = random.randint(0, 10000)
jb = 100000
full_iterations = jb / u
full_iteration_sum = u * (u+1) / 2
remaining_iterations = jb - (jb / u)
partial_iteration_sum = remaining_iterations * (remaining_iterations+1) / 2
result = (full_iterations * full_iteration_sum) + partial_iteration_sum + r
print(result)

import sys
import random

u = int(sys.argv[1])
r = random.randint(0, 10000)
jb = 100000
full_iterations = jb / u
full_iteration_sum = u * (u+1) / 2
remaining_iterations = jb - (jb / u)
partial_iteration_sum = remaining_iterations * (remaining_iterations+1) / 2
result = (full_iterations * full_iteration_sum) + partial_iteration_sum + r
print(result)

Darkmatter•5mo ago

Look python is faster than C! To be clear, there was an argument in #offtopic about this benchmark, and I don't think is particularly good because it has a zero-iteration solution. But for learning go ahead and optimize.

rd4com•5mo ago

There is also the Advent of Code in mojo if you'd like to continue 👍 #advent-of-code-2024

Serg Gini•5mo ago

this is different algorithm of course the problem is too simple and constraints from the author are kinda strange.. So I think, it is "bad" to compare absolute values in this bench But it is "good" to investigate your and other language design details based on this bench that took some attention

sb•5mo ago

Compilers can, and do, do algorithmic transformations fwiw. A lot of the "magic" of compilers is just matching on idioms/patterns and doing templated replacement

Serg Gini•5mo ago

the idea of the bench to let compiler do the magic, not the dev

Darkmatter•5mo ago

My point is that I'm 3-4 llvm patches away from making C, C++, Rust and Mojo all do this.

Serg Gini•5mo ago

for some optimizations it is possible, for others it is harder..

Darkmatter•5mo ago

This means that for smaller languages with only a few contributors, they could game this benchmark easily with what are realistically general purpose optimization passes.

sb•5mo ago

The reason I brought it up was that it's like judging the critical thinking skills of a person based on a single answer they give to a question on Jeopardy--it's a scenario that has some very specific "step" tricks that don't necessarily generalize well and aren't really representative of the overall in-practice effectiveness of the compilers/interpreters tested. That algo from above is just an extreme example to show that.

Serg Gini•5mo ago

for sure nobody should do any significant decisions only on this one microbench)

ModularBot•5mo ago

Congrats @Serg Gini, you just advanced to level 5!

sb•5mo ago

My worry is that people are, since I've seen this particular bench making the rounds in a bunch of places without much critical discourse And a bench that relies on compilers recognizing different levels of the 99.99+% of work it's doing can be thrown away feels disingenuous in that light

Darkmatter•5mo ago

I think some of this comes from the 1 billion row challenge, which actually was an interesting challenge. CSV parsing a known format is something that is reasonable to bench.

Serg Gini•5mo ago

1brc was interesting yes. I think the problem itself is quite uncommon.. like usually you have thousands smaller CSV/TSV/Json files.. not one big though

Darkmatter•5mo ago

Tons of small ones means you throw map/reduce at it.

Gaming

Programming

1 Billion nested loop iterations Mojo Implementation - seeking feedback

Did you find this page helpful?