duck_tape Comments - Answer Overflow

Topics

duck_tape

•Created by duck_tape on 1/3/2025 in #community-showcase

A Benchmark with Files and Bytes

@Caroline I don't think I can make it work schedule wise for me for this upcoming 2/3 meeting, but I'd love to present at a future meeting! Probably not on "Mojo is faster than Rust" but more like "Here's a side by side of these two programs doing the same thing and how compare them" type of thing.

27 replies

•Created by duck_tape on 12/10/2024 in #community-showcase

ExtraMojo

Would there be a downside to a concrete implementation in the stdlib currently that is just a BufferedReader that is created with a FileHandle? That could at a future point take something that implements a Read trait, and anything calling it would still work because presumably FileHandle will implement read. That doesn't address the zero-copy portion. If there's an implementation that you like out there that you could point me at to go and read about, I'd love to understand how that would work, especially when very large files are involved. @Owen Hilyard I still read all of this as "please don't try to submit a 'classical' BufferedReader/BufferedWriter to the stdlib right now" which is totally fine 👍

83 replies

•Created by duck_tape on 12/10/2024 in #community-showcase

ExtraMojo

I'd be very interested doing the work to add this to the stdlib, and happy to make it conform to whatever is most palatable to the stdlib team. Per older conversation with @Owen Hilyard I was under the impression that the stdlib team wanted to wait a bit before adding something like this, for some of the reasons he outlined just now, and also because of the desire to have zero-copy io be the default. I agree that having some form of lines iterator is generally a pretty reached-for tool in standard libraries. I also lean towards rusts suite of Read/Write, BufferedReader / BufferedWriter implementations.

83 replies

•Created by duck_tape on 1/7/2025 in #questions

Mojo for-loop performance

Startup is slower, but it doesn't explain the full delta between Rust and Mojo in the above programs, especially when the loops are large.

5 replies

•Created by duck_tape on 1/7/2025 in #questions

Mojo for-loop performance

Possibly largely answered by this: https://discord.com/channels/1087530497313357884/1151418895417233429/1326217184963334217 Since it's comparing two binaries running, if Mojo has a slow startup time, that could be it.

5 replies

•Created by duck_tape on 1/7/2025 in #questions

Mojo for-loop performance

Possibly related, I posted a bug demonstrating the difference in peref when using range(start, end) vs just range(end): https://github.com/modularml/mojo/issues/3931 Is range somehow getting in the way of optimizations?

5 replies

•Created by duck_tape on 1/3/2025 in #community-showcase

A Benchmark with Files and Bytes

I'm not sure, but I also don't think I'd have hit that with what I'm doing. I stay strictly in the realm of Spans.

27 replies

•Created by duck_tape on 1/3/2025 in #community-showcase

A Benchmark with Files and Bytes

Better than I was expecting even! Same warnings as always - this is very much a toy benchmark, but I'm taking the win in getting on par with Rust perf on this one and getting some solid building blocks with the Buffered Reader and memchr. Of note, I discovered today that there is a memchr function in the stdlib: https://github.com/modularml/mojo/blob/fa8d0dfcf38fe21dc43ff9887e33fe52edf32800/stdlib/src/utils/stringref.mojo#L700 It works on pointers, and is missing a few of the optimizations that I've got, but it's in the standard library and that counts for a lot. Something else I learned while working on this, the print function does no buffering. if you want to be able to write to stdout and match other languages you can use the _WriteBufferStack. I may pull that into ExtraMojo as well, or just finally add a real Buffered Writer.

from utils.write import _WriteBufferStack as BufWriter
var writer = BufWriter[4096](sys.stdout)

from utils.write import _WriteBufferStack as BufWriter
var writer = BufWriter[4096](sys.stdout)

Lastly, I have both memchr and memchr_wide. The wide impl is pretty much exactly what is in the Rust memchr crate, operating on 4 simd vectors at a time. I opted to keep the not-unrolled version as well since more often than not you use memchr on short distances ("finding your next ',' for example), where the not-unrolled one has a small advantage due to how I get the pointers set up to be aligned. But for long distances (finding the next '\n') the wide version is faster. The Reader uses the wide version for finding newlines. The SplitIterator uses the short version for find delimiters. Your mileage may vary.

27 replies

•Created by duck_tape on 1/3/2025 in #community-showcase

A Benchmark with Files and Bytes

Pushed a few more updates to ExtraMojo, which is now around 1.44 times faster on my mac than the Rust with memchr version. All just improvements to my buffered reader though, nothing to do with improving the find algorithm.

27 replies

•Created by duck_tape on 1/3/2025 in #community-showcase

A Benchmark with Files and Bytes

You could try the Rust version without memchr:

use std::io::stdin;
use std::io::BufRead;
use std::io::BufReader;

struct Record {
    pub name: String,
    pub count: usize,
}

impl Record {
    pub fn new(name: String, count: usize) -> Record {
        Record { name, count }
    }
}

fn create_record(line: &str) -> Record {
    let mut iter = line.split('\t').peekable();
    let name = iter.peek().unwrap().to_string();
    let count = iter.filter(|s| s.contains("bc")).count();
    Record::new(name, count)
}

fn main() {
    let mut records = vec![];
    let mut buffer = String::new();
    let stdin = stdin();
    let mut input = BufReader::new(stdin.lock());
    while let Ok(bytes_read) = input.read_line(&mut buffer) {
        if bytes_read == 0 {
            break;
        }
        buffer.make_ascii_lowercase();
        records.push(create_record(&buffer));
        buffer.clear();
    }
    let count: usize = records.iter().map(|r| r.count).sum();
    println!("{}", count);
}

use std::io::stdin;
use std::io::BufRead;
use std::io::BufReader;

struct Record {
    pub name: String,
    pub count: usize,
}

impl Record {
    pub fn new(name: String, count: usize) -> Record {
        Record { name, count }
    }
}

fn create_record(line: &str) -> Record {
    let mut iter = line.split('\t').peekable();
    let name = iter.peek().unwrap().to_string();
    let count = iter.filter(|s| s.contains("bc")).count();
    Record::new(name, count)
}

fn main() {
    let mut records = vec![];
    let mut buffer = String::new();
    let stdin = stdin();
    let mut input = BufReader::new(stdin.lock());
    while let Ok(bytes_read) = input.read_line(&mut buffer) {
        if bytes_read == 0 {
            break;
        }
        buffer.make_ascii_lowercase();
        records.push(create_record(&buffer));
        buffer.clear();
    }
    let count: usize = records.iter().map(|r| r.count).sum();
    println!("{}", count);
}

I'd be pretty willing to bet memchr is the fast bit. But I'd also be curious to see how your rolled-up buffered reader works on it! Only optimization I haven't pushed on ExtraMojo is upping the buffer size to 128kb in file.mojo

# 128 KiB: http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/ioblksize.h;h=266c209f48fc07cb4527139a2548b6398b75f740;hb=HEAD#l23
alias BUF_SIZE: Int = 1024 * 125

# 128 KiB: http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/ioblksize.h;h=266c209f48fc07cb4527139a2548b6398b75f740;hb=HEAD#l23
alias BUF_SIZE: Int = 1024 * 125

27 replies

•Created by duck_tape on 1/3/2025 in #community-showcase

A Benchmark with Files and Bytes

Thanks for running that! Just to double check, that was latest master of ExtraMojo? And makes sense. memchr really should be faster than my first attempt simd. Wild that rust has that big a difference between Mac and Intel though

27 replies

•Created by duck_tape on 1/3/2025 in #community-showcase

A Benchmark with Files and Bytes

That would be awesome! Lmk if you run into any issues running it / setting it up.

27 replies

•Created by duck_tape on 1/3/2025 in #community-showcase

A Benchmark with Files and Bytes

There's always the chance I'm doing something wildly unsafe, or just wrong, that happens to work on this particular dataset as well but would explode elsewhere. I have tests, but that's not the same as running this IRL all over the place.

27 replies

•Created by duck_tape on 1/3/2025 in #community-showcase

A Benchmark with Files and Bytes

Now a bit more pythonic with a context manager! (code updated in forum post)

27 replies

•Created by duck_tape on 1/3/2025 in #community-showcase

A Benchmark with Files and Bytes

I did swap in the memchr crate, which really should be going fast. Someone on the modular forum post said that in previous benchmarking, for unknown reasons, mojo tends have faster simd code on M-series macs compared to Rust 🤷 Agreed though, part of the sell of Mojo is the easy-to-write SIMD, and it really was easy, even for a SIMD newbie like me, and it's not that easy in Rust (at least not yet, maybe someday they'll land the portable simd stuff).

27 replies

•Created by duck_tape on 1/3/2025 in #community-showcase

A Benchmark with Files and Bytes

Updated with some new numbers after fixing up my SIMD code a bit more and moving Rust to use memchr!

27 replies

•Created by duck_tape on 12/10/2024 in #community-showcase

ExtraMojo

mmap on osx isn't worth it in my experience, so you end up supporting both ways of doing things, which means sorting out your string lifetimes at the end of the day anyways.

83 replies

•Created by duck_tape on 12/10/2024 in #community-showcase

ExtraMojo

Unrelated to your perf experiments (which I'm following along with!). ExtraMojo v0.2.0 https://github.com/ExtraMojo/ExtraMojo Still very rough around the edges, but it improves the buffered file reader to allocate less and move off the Tensors and use Spans etc. It also adds more byte-string helper functions like split and find. Still very much a "things I need when trying to benchmark against other languages" catchall.

83 replies

•Created by duck_tape on 12/10/2024 in #community-showcase

ExtraMojo

If you’re alluding to io_uring at the end there, totally agree! Obviously both @Mohamed Mabrouk and I fall in the buffered IO camp 🙂 We’ll have to do some bake offs as APIs mature a bit more. I’ve gone fairly deep on exactly this in Rust and everything I’ve done with Mojo so far indicates it’ll be just as speedy.

83 replies

•Created by duck_tape on 12/10/2024 in #community-showcase

ExtraMojo

I’m all for zero copy! How does that work though without a buffer? Buffered IO is just such a common abstraction I didn’t realize there was a zero copy performant way to do it.

83 replies