M
Modular•4mo ago
Martin Dudek

Mojo added to SpeedTests repo on github

I recently came across jabbalaci's SpeedTests https://github.com/jabbalaci/SpeedTests, and I made a PR to add Mojo, which got accepted today! The project's idea is to stick to the basic implementation pattern of the task without using language-specific tricks, so I just followed the Python implementation. It's a insightful list of benchmark results, and Mojo is holding up pretty well. 😉
No description
12 Replies
Caroline
Caroline•4mo ago
This is super cool! Mojo does seem to hold up well 🙂
name
name•4mo ago
can you update mojo to the latest version to improve the speed of print? i think it will have a huge difference. https://github.com/modularml/mojo/commit/9cbfa411d1ea0f01d0b7c1732cc4e36f08c1f969
GitHub
[stdlib] Buffer output before printing and writing to file · modula...
This greatly improves the print performance, minimizing syscalls and multiple variadic pack loads. These are results from printing 5 args in a 10k loop on the fast alacritty terminal emulator: ```...
Martin Dudek
Martin DudekOP•4mo ago
There are only five numbers printed by the program, so this change likely won't have a noticeable impact on performance. When I run the program locally, it seems that the current nightly version is actually performing a bit slower for some reason. Let's wait for the next stable version to see if there’s a noticeable performance improvement. If there is, I'll request the GitHub repository owner to rerun the benchmark with the latest version.
Martin Dudek
Martin DudekOP•4mo ago
Some insightful comments by @Owen Hilyard and @Martin Vuyk on various ways how this simple task can be implemented appeared on github https://github.com/jabbalaci/SpeedTests/pull/63 :mojo:
GitHub
Use more ideomatic Mojo and unify integer width with C. by owenhil...
Mojo prefers to make use of the SIMD type for small collections. In this case, it allocates a i32x16 block because it has to be a power of 2, but only the first 10 slots are used as with the origin...
Darkmatter
Darkmatter•4mo ago
I’ll make a comment later but I closed it because I realized it would be a massive performance regression on the system the repo owner tests on due to it having half the SIMD width I do.
name
name•4mo ago
for some of the languages, the repo has parallelized and vectorized version of the code. https://github.com/jabbalaci/SpeedTests
GitHub
GitHub - jabbalaci/SpeedTests: comparing the execution speeds of va...
comparing the execution speeds of various programming languages - jabbalaci/SpeedTests
aurelian
aurelian•4mo ago
is this right? A simple change to UInt32 took a full second off my time: Before:
Benchmark 1: ./munch
Time (mean ± σ): 2.751 s ± 0.000 s [User: 2.739 s, System: 0.012 s]
Range (min … max): 2.751 s … 2.751 s 2 runs
Benchmark 1: ./munch
Time (mean ± σ): 2.751 s ± 0.000 s [User: 2.739 s, System: 0.012 s]
Range (min … max): 2.751 s … 2.751 s 2 runs
After:
Benchmark 1: ./munch
Time (mean ± σ): 1.726 s ± 0.016 s [User: 1.710 s, System: 0.009 s]
Range (min … max): 1.715 s … 1.737 s 2 runs
Benchmark 1: ./munch
Time (mean ± σ): 1.726 s ± 0.016 s [User: 1.710 s, System: 0.009 s]
Range (min … max): 1.715 s … 1.737 s 2 runs
alias N: UInt32 = 440_000_000

fn is_munchausen(number: UInt32, cache: List[UInt32]) -> Bool:
n = number
var total: UInt32 = 0

while n > 0:
digit = n % 10
total += cache[int(digit)]
if total > number:
return False
n //= 10

return total == number

fn get_cache() -> List[UInt32]:
ca = List[UInt32](capacity=10)
ca.append(0)

@parameter
for i in range(1,10):
ca.append(i**i)
return ca

fn main():
cache = get_cache()
for n in range(0, N):
if is_munchausen(n, cache):
print(n)
alias N: UInt32 = 440_000_000

fn is_munchausen(number: UInt32, cache: List[UInt32]) -> Bool:
n = number
var total: UInt32 = 0

while n > 0:
digit = n % 10
total += cache[int(digit)]
if total > number:
return False
n //= 10

return total == number

fn get_cache() -> List[UInt32]:
ca = List[UInt32](capacity=10)
ca.append(0)

@parameter
for i in range(1,10):
ca.append(i**i)
return ca

fn main():
cache = get_cache()
for n in range(0, N):
if is_munchausen(n, cache):
print(n)
Darkmatter
Darkmatter•4mo ago
The main issue is that benchmarks are run on a haswell system, which is more than a decade old. It lacks a lot of modern SIMD, which Mojo uses heavily, so it takes poorly performing fallback paths.
Martin Dudek
Martin DudekOP•4mo ago
Interesting comments, benchmark results, and various PRs aimed at improving the initial Mojo implementation have appeared on GitHub: :mojo: https://github.com/jabbalaci/SpeedTests/pulls?q=mojo Currently, we have a standard, Python-like version (v1) and an additional version (v2) aimed at further improving performance.
GitHub
Pull requests · jabbalaci/SpeedTests
comparing the execution speeds of various programming languages - Pull requests · jabbalaci/SpeedTests
No description
aurelian
aurelian•4mo ago
I should have put inlinearray and unsafe_get in v2 zig has no bounds checks in ReleaseFast though it seems to be a tiny improvement on my machine
Martin Dudek
Martin DudekOP•4mo ago
it seems the github repo owner is very open for further PRs even so maybe good not to bombard him too frequently with PRs 😉

Did you find this page helpful?