1 Billion nested loop iterations Mojo Implementation - seeking feedback
Hey everyone, first time writing mojo and I am seeking feedback for my implementation on 1 Billion nested loop iterations, it's making its rounds on X and I thought I give it go in Mojo. Here is my current working branch was wondering if there is a better way to do this? The goal is not to implement the fastest way but the most "straight forward". I had a hard time figuring out how to initialize a list of 10,000 elements to 0 without doing another loop.
45 Replies
here are my results
Lot of people making fun of how long python takes, I think it would be nice to show an implementation that looks similar to python to demonstrate how much faster your code is with very little changes
GitHub
languages/loops/py/code.py at main · bddicken/languages
Compare languages. Contribute to bddicken/languages development by creating an account on GitHub.
:mojo:
I would change
var a = List[Int]()
with var a = List[Int](capacity=size)
just to prevent an memory allocation on each a.append(0)
sentence
Also, you can add hint_trivial_type=True
Meaning, var a = List[Int, hint_trivial_type=True](capacity=size)
Also, I would make sure you are using the nightly version, as it's faster than the current stable one
And will be even faster soon CC/ @Martin Vuykthou hath summoned me
I'd like to not have to show the world that we so liberally use
UnsafePointer
but we still need to define some other stuff before we can migrate away towards something higher level. We will probably use Span
for many things.
Having said that, I think this is what you're looking for (if you want it to look as pythonic as possible):
This is quite literally Python's weakness, for loops and int listsThanks for the feedback everyone, I had fun dipping my toes into Mojo. I will work on implementing your suggestions and open a PR
One more thing, if we want to compare Mojo with C, Zig and Rust, the datatype has to be the same. Python uses infinite precision int, Mojo uses platform dependent Int (for x86-64 it's Int64), but I saw the code for those other languages and it's UInt32.
C is Int32
potato potato for this kind of operations a the cpu level afaik
Signed vs unsigned has a lot of disparity in available vector ops.
just for modulo and add? I don't see much perf difference between Fortran, C, Zig and Rust in the website
Look at the NEON instruction sets
Updated results using code above and switching to nightly. My machine is different than the one used in the official repo as well.
Interesting, do you have a way to test C or Zig or Rust in your machine as a comparison point?
let me set those up real quick
Congrats @Tyler Hillery, you just advanced to level 1!
Do you mind changing
List[Int]
to List[UInt32]
?
and testing that
We still don't have many profiling tools so it's mostly just guessing rn 😦That seemed to help!
That would put us around 13th place if we assume both CPUs have similar SIMD capabilities (which I think it's safe to assume since both are M3 and took around the same time)
I'm mostly guessing, but this line
u = int(sys.argv()[1]) # Get an input number from the command line
triggers atol
which has a lot of logic that has to deal with a lot of edge cases which are only true to comply with Python. I'm not sure if that has enough impact for this difference thoughI'm not adamant on submitting PR, thought it was just a fun way to try out Mojo, if you rather have me not submit that's fine by me. I still think these are awesome results ( speaking as a python dev )
It's fine by me either way, I still think its too early for Mojo to start being compared since we are still changing much of the stdlib and figuring stuff out.
I still think these are awesome results ( speaking as a python dev )That's one of the main goals, providing a simple to use language where the default is high performance and the option to go deeper is always there. Your code wasn't that much different in performance to the end result for example 🙂
@Owen Hilyard has one, some updates:
(also this is on m1, so may be slow for them)
(yes this is operationally equivalent, it is also why we haven't tried opening a PR because it sorta mocks the benchmark itself)
what sort of wizardary is this
results on machine
it's a modulo so we're just integrating across a sawtooth fn :)
I should close some tabs, it might have been swapping while it was running 😬 /s
I’ll think about submitting it
Sounds good, I got what I wanted out of this which was to try out Mojo 🙂
Look python is faster than C!
To be clear, there was an argument in #offtopic about this benchmark, and I don't think is particularly good because it has a zero-iteration solution.
But for learning go ahead and optimize.
There is also the Advent of Code in mojo if you'd like to continue 👍
#advent-of-code-2024
this is different algorithm
of course the problem is too simple and constraints from the author are kinda strange..
So I think, it is "bad" to compare absolute values in this bench
But it is "good" to investigate your and other language design details based on this bench that took some attention
Compilers can, and do, do algorithmic transformations fwiw. A lot of the "magic" of compilers is just matching on idioms/patterns and doing templated replacement
the idea of the bench to let compiler do the magic, not the dev
My point is that I'm 3-4 llvm patches away from making C, C++, Rust and Mojo all do this.
for some optimizations it is possible, for others it is harder..
This means that for smaller languages with only a few contributors, they could game this benchmark easily with what are realistically general purpose optimization passes.
The reason I brought it up was that it's like judging the critical thinking skills of a person based on a single answer they give to a question on Jeopardy--it's a scenario that has some very specific "step" tricks that don't necessarily generalize well and aren't really representative of the overall in-practice effectiveness of the compilers/interpreters tested. That algo from above is just an extreme example to show that.
for sure nobody should do any significant decisions only on this one microbench)
Congrats @Serg Gini, you just advanced to level 5!
My worry is that people are, since I've seen this particular bench making the rounds in a bunch of places without much critical discourse
And a bench that relies on compilers recognizing different levels of the 99.99+% of work it's doing can be thrown away feels disingenuous in that light
I think some of this comes from the 1 billion row challenge, which actually was an interesting challenge.
CSV parsing a known format is something that is reasonable to bench.
1brc was interesting yes. I think the problem itself is quite uncommon.. like usually you have thousands smaller CSV/TSV/Json files.. not one big though
Tons of small ones means you throw map/reduce at it.