Modular•7mo ago

Mojo added to SpeedTests repo on github

I recently came across jabbalaci's SpeedTests https://github.com/jabbalaci/SpeedTests, and I made a PR to add Mojo, which got accepted today! The project's idea is to stick to the basic implementation pattern of the task without using language-specific tricks, so I just followed the Python implementation. It's a insightful list of benchmark results, and Mojo is holding up pretty well. 😉

12 Replies

Caroline Frasca•7mo ago

This is super cool! Mojo does seem to hold up well 🙂

name•7mo ago

can you update mojo to the latest version to improve the speed of print? i think it will have a huge difference. https://github.com/modularml/mojo/commit/9cbfa411d1ea0f01d0b7c1732cc4e36f08c1f969

GitHub

[stdlib] Buffer output before printing and writing to file · modula...

This greatly improves the print performance, minimizing syscalls and multiple variadic pack loads. These are results from printing 5 args in a 10k loop on the fast alacritty terminal emulator: ```...

Martin DudekOP•7mo ago

There are only five numbers printed by the program, so this change likely won't have a noticeable impact on performance. When I run the program locally, it seems that the current nightly version is actually performing a bit slower for some reason. Let's wait for the next stable version to see if there’s a noticeable performance improvement. If there is, I'll request the GitHub repository owner to rerun the benchmark with the latest version.

Martin DudekOP•7mo ago

Some insightful comments by @Owen Hilyard and @Martin Vuyk on various ways how this simple task can be implemented appeared on github https://github.com/jabbalaci/SpeedTests/pull/63 :mojo:

GitHub

Use more ideomatic Mojo and unify integer width with C. by owenhil...

Mojo prefers to make use of the SIMD type for small collections. In this case, it allocates a i32x16 block because it has to be a power of 2, but only the first 10 slots are used as with the origin...

Darkmatter•7mo ago

I’ll make a comment later but I closed it because I realized it would be a massive performance regression on the system the repo owner tests on due to it having half the SIMD width I do.

name•7mo ago

for some of the languages, the repo has parallelized and vectorized version of the code. https://github.com/jabbalaci/SpeedTests

GitHub

GitHub - jabbalaci/SpeedTests: comparing the execution speeds of va...

comparing the execution speeds of various programming languages - jabbalaci/SpeedTests

aurelian•7mo ago

is this right? A simple change to UInt32 took a full second off my time: Before:

Benchmark 1: ./munch
  Time (mean ± σ):      2.751 s ±  0.000 s    [User: 2.739 s, System: 0.012 s]
  Range (min … max):    2.751 s …  2.751 s    2 runs

Benchmark 1: ./munch
  Time (mean ± σ):      2.751 s ±  0.000 s    [User: 2.739 s, System: 0.012 s]
  Range (min … max):    2.751 s …  2.751 s    2 runs

After:

Benchmark 1: ./munch
  Time (mean ± σ):      1.726 s ±  0.016 s    [User: 1.710 s, System: 0.009 s]
  Range (min … max):    1.715 s …  1.737 s    2 runs

Benchmark 1: ./munch
  Time (mean ± σ):      1.726 s ±  0.016 s    [User: 1.710 s, System: 0.009 s]
  Range (min … max):    1.715 s …  1.737 s    2 runs

alias N: UInt32 = 440_000_000

fn is_munchausen(number: UInt32, cache: List[UInt32]) -> Bool:
    n = number
    var total: UInt32 = 0

    while n > 0:
        digit = n % 10
        total += cache[int(digit)]
        if total > number:
            return False
        n //= 10

    return total == number

fn get_cache() -> List[UInt32]:
    ca = List[UInt32](capacity=10)
    ca.append(0)

    @parameter
    for i in range(1,10):
        ca.append(i**i)
    return ca

fn main():
    cache = get_cache()
    for n in range(0, N):
        if is_munchausen(n, cache):
            print(n)

alias N: UInt32 = 440_000_000

fn is_munchausen(number: UInt32, cache: List[UInt32]) -> Bool:
    n = number
    var total: UInt32 = 0

    while n > 0:
        digit = n % 10
        total += cache[int(digit)]
        if total > number:
            return False
        n //= 10

    return total == number

fn get_cache() -> List[UInt32]:
    ca = List[UInt32](capacity=10)
    ca.append(0)

    @parameter
    for i in range(1,10):
        ca.append(i**i)
    return ca

fn main():
    cache = get_cache()
    for n in range(0, N):
        if is_munchausen(n, cache):
            print(n)

Darkmatter•7mo ago

The main issue is that benchmarks are run on a haswell system, which is more than a decade old. It lacks a lot of modern SIMD, which Mojo uses heavily, so it takes poorly performing fallback paths.

name•7mo ago

v2 of the code is faster. https://github.com/jabbalaci/SpeedTests?tab=readme-ov-file#mojo

Martin DudekOP•7mo ago

Interesting comments, benchmark results, and various PRs aimed at improving the initial Mojo implementation have appeared on GitHub: :mojo: https://github.com/jabbalaci/SpeedTests/pulls?q=mojo Currently, we have a standard, Python-like version (v1) and an additional version (v2) aimed at further improving performance.

GitHub

Pull requests · jabbalaci/SpeedTests

comparing the execution speeds of various programming languages - Pull requests · jabbalaci/SpeedTests

aurelian•7mo ago

I should have put inlinearray and unsafe_get in v2 zig has no bounds checks in ReleaseFast though it seems to be a tiny improvement on my machine

Martin DudekOP•7mo ago

it seems the github repo owner is very open for further PRs even so maybe good not to bombard him too frequently with PRs 😉

Gaming

Programming

Mojo added to SpeedTests repo on github

Did you find this page helpful?