duck_tape
MModular
•Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
@Caroline I don't think I can make it work schedule wise for me for this upcoming 2/3 meeting, but I'd love to present at a future meeting! Probably not on "Mojo is faster than Rust" but more like "Here's a side by side of these two programs doing the same thing and how compare them" type of thing.
27 replies
MModular
•Created by duck_tape on 12/10/2024 in #community-showcase
ExtraMojo
Would there be a downside to a concrete implementation in the stdlib currently that is just a
BufferedReader
that is created with a FileHandle
? That could at a future point take something that implements a Read
trait, and anything calling it would still work because presumably FileHandle
will implement read.
That doesn't address the zero-copy portion. If there's an implementation that you like out there that you could point me at to go and read about, I'd love to understand how that would work, especially when very large files are involved. @Owen Hilyard
I still read all of this as "please don't try to submit a 'classical' BufferedReader/BufferedWriter to the stdlib right now" which is totally fine 👍83 replies
MModular
•Created by duck_tape on 12/10/2024 in #community-showcase
ExtraMojo
I'd be very interested doing the work to add this to the stdlib, and happy to make it conform to whatever is most palatable to the stdlib team.
Per older conversation with @Owen Hilyard I was under the impression that the stdlib team wanted to wait a bit before adding something like this, for some of the reasons he outlined just now, and also because of the desire to have zero-copy io be the default.
I agree that having some form of lines iterator is generally a pretty reached-for tool in standard libraries. I also lean towards rusts suite of Read/Write, BufferedReader / BufferedWriter implementations.
83 replies
MModular
•Created by duck_tape on 1/7/2025 in #questions
Mojo for-loop performance
Startup is slower, but it doesn't explain the full delta between Rust and Mojo in the above programs, especially when the loops are large.
5 replies
MModular
•Created by duck_tape on 1/7/2025 in #questions
Mojo for-loop performance
Possibly largely answered by this: https://discord.com/channels/1087530497313357884/1151418895417233429/1326217184963334217
Since it's comparing two binaries running, if Mojo has a slow startup time, that could be it.
5 replies
MModular
•Created by duck_tape on 1/7/2025 in #questions
Mojo for-loop performance
Possibly related, I posted a bug demonstrating the difference in peref when using
range(start, end)
vs just range(end)
: https://github.com/modularml/mojo/issues/3931
Is range somehow getting in the way of optimizations?5 replies
MModular
•Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
I'm not sure, but I also don't think I'd have hit that with what I'm doing. I stay strictly in the realm of Spans.
27 replies
MModular
•Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
Better than I was expecting even!
Same warnings as always - this is very much a toy benchmark, but I'm taking the win in getting on par with Rust perf on this one and getting some solid building blocks with the Buffered Reader and memchr.
Of note, I discovered today that there is a memchr function in the stdlib: https://github.com/modularml/mojo/blob/fa8d0dfcf38fe21dc43ff9887e33fe52edf32800/stdlib/src/utils/stringref.mojo#L700
It works on pointers, and is missing a few of the optimizations that I've got, but it's in the standard library and that counts for a lot.
Something else I learned while working on this, the
print
function does no buffering. if you want to be able to write to stdout and match other languages you can use the _WriteBufferStack. I may pull that into ExtraMojo as well, or just finally add a real Buffered Writer.
Lastly, I have both memchr
and memchr_wide
. The wide impl is pretty much exactly what is in the Rust memchr crate, operating on 4 simd vectors at a time. I opted to keep the not-unrolled version as well since more often than not you use memchr on short distances ("finding your next ',' for example), where the not-unrolled one has a small advantage due to how I get the pointers set up to be aligned.
But for long distances (finding the next '\n') the wide version is faster. The Reader uses the wide version for finding newlines. The SplitIterator uses the short version for find delimiters. Your mileage may vary.27 replies
MModular
•Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
Pushed a few more updates to ExtraMojo, which is now around 1.44 times faster on my mac than the Rust with memchr version. All just improvements to my buffered reader though, nothing to do with improving the find algorithm.
27 replies
MModular
•Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
You could try the Rust version without memchr:
I'd be pretty willing to bet memchr is the fast bit. But I'd also be curious to see how your rolled-up buffered reader works on it! Only optimization I haven't pushed on ExtraMojo is upping the buffer size to 128kb in
file.mojo
27 replies
MModular
•Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
Thanks for running that! Just to double check, that was latest master of ExtraMojo?
And makes sense. memchr really should be faster than my first attempt simd. Wild that rust has that big a difference between Mac and Intel though
27 replies
MModular
•Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
That would be awesome! Lmk if you run into any issues running it / setting it up.
27 replies
MModular
•Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
There's always the chance I'm doing something wildly unsafe, or just wrong, that happens to work on this particular dataset as well but would explode elsewhere. I have tests, but that's not the same as running this IRL all over the place.
27 replies
MModular
•Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
Now a bit more pythonic with a context manager! (code updated in forum post)
27 replies
MModular
•Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
I did swap in the memchr crate, which really should be going fast. Someone on the modular forum post said that in previous benchmarking, for unknown reasons, mojo tends have faster simd code on M-series macs compared to Rust 🤷
Agreed though, part of the sell of Mojo is the easy-to-write SIMD, and it really was easy, even for a SIMD newbie like me, and it's not that easy in Rust (at least not yet, maybe someday they'll land the portable simd stuff).
27 replies
MModular
•Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
Updated with some new numbers after fixing up my SIMD code a bit more and moving Rust to use memchr!
27 replies
MModular
•Created by duck_tape on 12/10/2024 in #community-showcase
ExtraMojo
mmap on osx isn't worth it in my experience, so you end up supporting both ways of doing things, which means sorting out your string lifetimes at the end of the day anyways.
83 replies
MModular
•Created by duck_tape on 12/10/2024 in #community-showcase
ExtraMojo
Unrelated to your perf experiments (which I'm following along with!). ExtraMojo v0.2.0 https://github.com/ExtraMojo/ExtraMojo
Still very rough around the edges, but it improves the buffered file reader to allocate less and move off the Tensors and use Spans etc. It also adds more byte-string helper functions like split and find. Still very much a "things I need when trying to benchmark against other languages" catchall.
83 replies
MModular
•Created by duck_tape on 12/10/2024 in #community-showcase
ExtraMojo
If you’re alluding to io_uring at the end there, totally agree!
Obviously both @Mohamed Mabrouk and I fall in the buffered IO camp 🙂
We’ll have to do some bake offs as APIs mature a bit more. I’ve gone fairly deep on exactly this in Rust and everything I’ve done with Mojo so far indicates it’ll be just as speedy.
83 replies
MModular
•Created by duck_tape on 12/10/2024 in #community-showcase
ExtraMojo
I’m all for zero copy! How does that work though without a buffer? Buffered IO is just such a common abstraction I didn’t realize there was a zero copy performant way to do it.
83 replies