A Benchmark with Files and Bytes
Crossposting my forum post since the formatting is a bit nicer there: https://forum.modular.com/t/showcase-a-benchmark-with-files-and-bytes-standard-benchmark-warnings-apply/420
tl;dr; pretty vanilla Mojo was beating out pretty vanilla Rust (all normal caveats about benchmarks being worthless apply).
Modular
[Showcase] A Benchmark with Files and Bytes (standard benchmark war...
WARNING: It’s Friday. This was just for fun. Please remember that all benchmarks are lies and not to be trusted. Mostly this was a useful exercise for improving the performance and APIs in ExtraMojo. REMINDER: The point of this “benchmark” wasn’t to optimize the problem, but just to follow the template to write code that I often end up writing ...
18 Replies
Oh yeah, take that Rust! 🦀 🚒 :mojo:
I also personally find the Mojo implementation more readable than either Rust or Python in this case
Updated with some new numbers after fixing up my SIMD code a bit more and moving Rust to use memchr!
When I get a chance, I'll create a SIMD rust version
Congrats @Lil'Nish, you just advanced to level 4!
tho SIMD is not a huge selling point of rust and requires unsafe or external crates, so even if rust wins with SIMD, it may not be a fair comparison...
I did swap in the memchr crate, which really should be going fast. Someone on the modular forum post said that in previous benchmarking, for unknown reasons, mojo tends have faster simd code on M-series macs compared to Rust 🤷
Agreed though, part of the sell of Mojo is the easy-to-write SIMD, and it really was easy, even for a SIMD newbie like me, and it's not that easy in Rust (at least not yet, maybe someday they'll land the portable simd stuff).
Now a bit more pythonic with a context manager! (code updated in forum post)
There's always the chance I'm doing something wildly unsafe, or just wrong, that happens to work on this particular dataset as well but would explode elsewhere. I have tests, but that's not the same as running this IRL all over the place.
the memchr crate uses SIMD already and goes to great length to provide vectorized arch-specific implementations. I am not sure if there is a better solution in the rust ecosystem.
I can give the code a try on an Intel machine, and I can report the results here
That would be awesome! Lmk if you run into any issues running it / setting it up.
Here the hyperfine results after fixing a small import bug in the rust script
Rust 1.83, Mojo 26.6.
as expected the rust implementation is more performant on Intel-X86 than on ARM-Mac and now beats mojo.
Fiar enough, ig there's no need to manually implement a SIMD version then
Thanks for running that! Just to double check, that was latest master of ExtraMojo?
And makes sense. memchr really should be faster than my first attempt simd. Wild that rust has that big a difference between Mac and Intel though
Today we learned that one should use Mojo on M chips and Rust on Intel for best performance 😁
yes, I used the latest extramojo, and used the script in the records directory.
I can't say if it's rust or the memchr package specifically that they are using subpar algorithm for Arm.
based on my previous experiments, I can match the performance or memchr on x86 by careful optimization, removing unnecessary function calls and allocations, and optimizing the buffer size. I can try to see if using the rolled-up buffered reader that I have can make a difference.
You could try the Rust version without memchr:
I'd be pretty willing to bet memchr is the fast bit. But I'd also be curious to see how your rolled-up buffered reader works on it! Only optimization I haven't pushed on ExtraMojo is upping the buffer size to 128kb in
file.mojo
Pushed a few more updates to ExtraMojo, which is now around 1.44 times faster on my mac than the Rust with memchr version. All just improvements to my buffered reader though, nothing to do with improving the find algorithm.Here are the corresponding results on intel.
Better than I was expecting even!
Same warnings as always - this is very much a toy benchmark, but I'm taking the win in getting on par with Rust perf on this one and getting some solid building blocks with the Buffered Reader and memchr.
Of note, I discovered today that there is a memchr function in the stdlib: https://github.com/modularml/mojo/blob/fa8d0dfcf38fe21dc43ff9887e33fe52edf32800/stdlib/src/utils/stringref.mojo#L700
It works on pointers, and is missing a few of the optimizations that I've got, but it's in the standard library and that counts for a lot.
Something else I learned while working on this, the
print
function does no buffering. if you want to be able to write to stdout and match other languages you can use the _WriteBufferStack. I may pull that into ExtraMojo as well, or just finally add a real Buffered Writer.
Lastly, I have both memchr
and memchr_wide
. The wide impl is pretty much exactly what is in the Rust memchr crate, operating on 4 simd vectors at a time. I opted to keep the not-unrolled version as well since more often than not you use memchr on short distances ("finding your next ',' for example), where the not-unrolled one has a small advantage due to how I get the pointers set up to be aligned.
But for long distances (finding the next '\n') the wide version is faster. The Reader uses the wide version for finding newlines. The SplitIterator uses the short version for find delimiters. Your mileage may vary.Does mojo have small string optimization by default? If so, it may be appropriate to use arraystring (or some other stack based string implementation) in rust
I'm not sure, but I also don't think I'd have hit that with what I'm doing. I stay strictly in the realm of Spans.