duck_tape
duck_tape
MModular
Created by duck_tape on 1/7/2025 in #questions
Mojo for-loop performance
Startup is slower, but it doesn't explain the full delta between Rust and Mojo in the above programs, especially when the loops are large.
5 replies
MModular
Created by duck_tape on 1/7/2025 in #questions
Mojo for-loop performance
Possibly largely answered by this: https://discord.com/channels/1087530497313357884/1151418895417233429/1326217184963334217 Since it's comparing two binaries running, if Mojo has a slow startup time, that could be it.
5 replies
MModular
Created by duck_tape on 1/7/2025 in #questions
Mojo for-loop performance
Possibly related, I posted a bug demonstrating the difference in peref when using range(start, end) vs just range(end): https://github.com/modularml/mojo/issues/3931 Is range somehow getting in the way of optimizations?
5 replies
MModular
Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
I'm not sure, but I also don't think I'd have hit that with what I'm doing. I stay strictly in the realm of Spans.
24 replies
MModular
Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
Better than I was expecting even! Same warnings as always - this is very much a toy benchmark, but I'm taking the win in getting on par with Rust perf on this one and getting some solid building blocks with the Buffered Reader and memchr. Of note, I discovered today that there is a memchr function in the stdlib: https://github.com/modularml/mojo/blob/fa8d0dfcf38fe21dc43ff9887e33fe52edf32800/stdlib/src/utils/stringref.mojo#L700 It works on pointers, and is missing a few of the optimizations that I've got, but it's in the standard library and that counts for a lot. Something else I learned while working on this, the print function does no buffering. if you want to be able to write to stdout and match other languages you can use the _WriteBufferStack. I may pull that into ExtraMojo as well, or just finally add a real Buffered Writer.
from utils.write import _WriteBufferStack as BufWriter
var writer = BufWriter[4096](sys.stdout)
from utils.write import _WriteBufferStack as BufWriter
var writer = BufWriter[4096](sys.stdout)
Lastly, I have both memchr and memchr_wide. The wide impl is pretty much exactly what is in the Rust memchr crate, operating on 4 simd vectors at a time. I opted to keep the not-unrolled version as well since more often than not you use memchr on short distances ("finding your next ',' for example), where the not-unrolled one has a small advantage due to how I get the pointers set up to be aligned. But for long distances (finding the next '\n') the wide version is faster. The Reader uses the wide version for finding newlines. The SplitIterator uses the short version for find delimiters. Your mileage may vary.
24 replies
MModular
Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
Pushed a few more updates to ExtraMojo, which is now around 1.44 times faster on my mac than the Rust with memchr version. All just improvements to my buffered reader though, nothing to do with improving the find algorithm.
24 replies
MModular
Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
You could try the Rust version without memchr:
use std::io::stdin;
use std::io::BufRead;
use std::io::BufReader;

struct Record {
pub name: String,
pub count: usize,
}

impl Record {
pub fn new(name: String, count: usize) -> Record {
Record { name, count }
}
}

fn create_record(line: &str) -> Record {
let mut iter = line.split('\t').peekable();
let name = iter.peek().unwrap().to_string();
let count = iter.filter(|s| s.contains("bc")).count();
Record::new(name, count)
}

fn main() {
let mut records = vec![];
let mut buffer = String::new();
let stdin = stdin();
let mut input = BufReader::new(stdin.lock());
while let Ok(bytes_read) = input.read_line(&mut buffer) {
if bytes_read == 0 {
break;
}
buffer.make_ascii_lowercase();
records.push(create_record(&buffer));
buffer.clear();
}
let count: usize = records.iter().map(|r| r.count).sum();
println!("{}", count);
}
use std::io::stdin;
use std::io::BufRead;
use std::io::BufReader;

struct Record {
pub name: String,
pub count: usize,
}

impl Record {
pub fn new(name: String, count: usize) -> Record {
Record { name, count }
}
}

fn create_record(line: &str) -> Record {
let mut iter = line.split('\t').peekable();
let name = iter.peek().unwrap().to_string();
let count = iter.filter(|s| s.contains("bc")).count();
Record::new(name, count)
}

fn main() {
let mut records = vec![];
let mut buffer = String::new();
let stdin = stdin();
let mut input = BufReader::new(stdin.lock());
while let Ok(bytes_read) = input.read_line(&mut buffer) {
if bytes_read == 0 {
break;
}
buffer.make_ascii_lowercase();
records.push(create_record(&buffer));
buffer.clear();
}
let count: usize = records.iter().map(|r| r.count).sum();
println!("{}", count);
}
I'd be pretty willing to bet memchr is the fast bit. But I'd also be curious to see how your rolled-up buffered reader works on it! Only optimization I haven't pushed on ExtraMojo is upping the buffer size to 128kb in file.mojo
# 128 KiB: http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/ioblksize.h;h=266c209f48fc07cb4527139a2548b6398b75f740;hb=HEAD#l23
alias BUF_SIZE: Int = 1024 * 125
# 128 KiB: http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/ioblksize.h;h=266c209f48fc07cb4527139a2548b6398b75f740;hb=HEAD#l23
alias BUF_SIZE: Int = 1024 * 125
24 replies
MModular
Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
Thanks for running that! Just to double check, that was latest master of ExtraMojo? And makes sense. memchr really should be faster than my first attempt simd. Wild that rust has that big a difference between Mac and Intel though
24 replies
MModular
Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
That would be awesome! Lmk if you run into any issues running it / setting it up.
24 replies
MModular
Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
There's always the chance I'm doing something wildly unsafe, or just wrong, that happens to work on this particular dataset as well but would explode elsewhere. I have tests, but that's not the same as running this IRL all over the place.
24 replies
MModular
Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
Now a bit more pythonic with a context manager! (code updated in forum post)
24 replies
MModular
Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
I did swap in the memchr crate, which really should be going fast. Someone on the modular forum post said that in previous benchmarking, for unknown reasons, mojo tends have faster simd code on M-series macs compared to Rust 🤷 Agreed though, part of the sell of Mojo is the easy-to-write SIMD, and it really was easy, even for a SIMD newbie like me, and it's not that easy in Rust (at least not yet, maybe someday they'll land the portable simd stuff).
24 replies
MModular
Created by duck_tape on 1/3/2025 in #community-showcase
A Benchmark with Files and Bytes
Updated with some new numbers after fixing up my SIMD code a bit more and moving Rust to use memchr!
24 replies
MModular
Created by duck_tape on 12/10/2024 in #community-showcase
ExtraMojo
mmap on osx isn't worth it in my experience, so you end up supporting both ways of doing things, which means sorting out your string lifetimes at the end of the day anyways.
68 replies
MModular
Created by duck_tape on 12/10/2024 in #community-showcase
ExtraMojo
Unrelated to your perf experiments (which I'm following along with!). ExtraMojo v0.2.0 https://github.com/ExtraMojo/ExtraMojo Still very rough around the edges, but it improves the buffered file reader to allocate less and move off the Tensors and use Spans etc. It also adds more byte-string helper functions like split and find. Still very much a "things I need when trying to benchmark against other languages" catchall.
68 replies
MModular
Created by duck_tape on 12/10/2024 in #community-showcase
ExtraMojo
If you’re alluding to io_uring at the end there, totally agree! Obviously both @Mohamed Mabrouk and I fall in the buffered IO camp 🙂 We’ll have to do some bake offs as APIs mature a bit more. I’ve gone fairly deep on exactly this in Rust and everything I’ve done with Mojo so far indicates it’ll be just as speedy.
68 replies
MModular
Created by duck_tape on 12/10/2024 in #community-showcase
ExtraMojo
I’m all for zero copy! How does that work though without a buffer? Buffered IO is just such a common abstraction I didn’t realize there was a zero copy performant way to do it.
68 replies
MModular
Created by duck_tape on 12/10/2024 in #community-showcase
ExtraMojo
Very interested! Personally I want the Rust suite of IO traits, or something like it, which I’ve found to have the full range of high/low level control. I really want to dive into a big project around this but don’t want to get too ahead of where the language is either, so I’ve just been building “as needed” type things.
68 replies
MModular
Created by duck_tape on 12/10/2024 in #community-showcase
ExtraMojo
Thank you for the detailed answer! That all makes a lot of sense. If I find nice carve outs (buffered by-line reader) that don't conflict with known-in-waiting-things for the stdlib (like strip/split), I'll look into making a PR to the stdlib to begin the discussion 👍 It does feel like a weird stalemate right now where there's almost enough features to do everything, but enough change coming that it's not worth doing yet. Overall very excited for Mojo this year.
68 replies
MModular
Created by duck_tape on 12/10/2024 in #community-showcase
ExtraMojo
Hi @Darkmatter - I'm sure there's a better place to ask this, I just haven't found it! What's the best path to contributing to the standard library? There's stuff I'd love to work on, like adding a buffered file reader / line-by-line reading and such, but I'm also sure you have a plan for what features you want and when you want them? Also helper functions around the String API, like split and strip and such but with iterators. I'm just guessing the Iterator API isn't solid yet and that's why that hasn't been taken up?
68 replies