Modular•4mo ago

A Benchmark with Files and Bytes

Crossposting my forum post since the formatting is a bit nicer there: https://forum.modular.com/t/showcase-a-benchmark-with-files-and-bytes-standard-benchmark-warnings-apply/420 tl;dr; pretty vanilla Mojo was beating out pretty vanilla Rust (all normal caveats about benchmarks being worthless apply).

Modular

[Showcase] A Benchmark with Files and Bytes (standard benchmark war...

WARNING: It’s Friday. This was just for fun. Please remember that all benchmarks are lies and not to be trusted. Mostly this was a useful exercise for improving the performance and APIs in ExtraMojo. REMINDER: The point of this “benchmark” wasn’t to optimize the problem, but just to follow the template to write code that I often end up writing ...

21 Replies

a2svior•4mo ago

Oh yeah, take that Rust! 🦀 🚒 :mojo: I also personally find the Mojo implementation more readable than either Rust or Python in this case

duck_tapeOP•4mo ago

Updated with some new numbers after fixing up my SIMD code a bit more and moving Rust to use memchr!

Lil'Nish•4mo ago

When I get a chance, I'll create a SIMD rust version

ModularBot•4mo ago

Congrats @Lil'Nish, you just advanced to level 4!

Lil'Nish•4mo ago

tho SIMD is not a huge selling point of rust and requires unsafe or external crates, so even if rust wins with SIMD, it may not be a fair comparison...

duck_tapeOP•4mo ago

I did swap in the memchr crate, which really should be going fast. Someone on the modular forum post said that in previous benchmarking, for unknown reasons, mojo tends have faster simd code on M-series macs compared to Rust 🤷 Agreed though, part of the sell of Mojo is the easy-to-write SIMD, and it really was easy, even for a SIMD newbie like me, and it's not that easy in Rust (at least not yet, maybe someday they'll land the portable simd stuff). Now a bit more pythonic with a context manager! (code updated in forum post) There's always the chance I'm doing something wildly unsafe, or just wrong, that happens to work on this particular dataset as well but would explode elsewhere. I have tests, but that's not the same as running this IRL all over the place.

Mohamed Mabrouk•4mo ago

the memchr crate uses SIMD already and goes to great length to provide vectorized arch-specific implementations. I am not sure if there is a better solution in the rust ecosystem. I can give the code a try on an Intel machine, and I can report the results here

duck_tapeOP•4mo ago

That would be awesome! Lmk if you run into any issues running it / setting it up.

Mohamed Mabrouk•4mo ago

Here the hyperfine results after fixing a small import bug in the rust script Rust 1.83, Mojo 26.6.

Benchmark 1: < big.tsv ./mojo/count_lines
  Time (mean ± σ):      1.216 s ±  0.007 s    [User: 1.123 s, System: 0.095 s]
  Range (min … max):    1.209 s …  1.225 s    10 runs
 
Benchmark 2: < big.tsv ./rust/target/release/count_lines
  Time (mean ± σ):      1.006 s ±  0.008 s    [User: 0.901 s, System: 0.105 s]
  Range (min … max):    0.997 s …  1.016 s    10 runs
 
Summary
  < big.tsv ./rust/target/release/count_lines ran
    1.21 ± 0.01 times faster than < big.tsv ./mojo/count_lines

Benchmark 1: < big.tsv ./mojo/count_lines
  Time (mean ± σ):      1.216 s ±  0.007 s    [User: 1.123 s, System: 0.095 s]
  Range (min … max):    1.209 s …  1.225 s    10 runs
 
Benchmark 2: < big.tsv ./rust/target/release/count_lines
  Time (mean ± σ):      1.006 s ±  0.008 s    [User: 0.901 s, System: 0.105 s]
  Range (min … max):    0.997 s …  1.016 s    10 runs
 
Summary
  < big.tsv ./rust/target/release/count_lines ran
    1.21 ± 0.01 times faster than < big.tsv ./mojo/count_lines

as expected the rust implementation is more performant on Intel-X86 than on ARM-Mac and now beats mojo.

Lil'Nish•4mo ago

Fiar enough, ig there's no need to manually implement a SIMD version then

duck_tapeOP•4mo ago

Thanks for running that! Just to double check, that was latest master of ExtraMojo? And makes sense. memchr really should be faster than my first attempt simd. Wild that rust has that big a difference between Mac and Intel though

a2svior•4mo ago

Today we learned that one should use Mojo on M chips and Rust on Intel for best performance 😁

Mohamed Mabrouk•4mo ago

yes, I used the latest extramojo, and used the script in the records directory. I can't say if it's rust or the memchr package specifically that they are using subpar algorithm for Arm. based on my previous experiments, I can match the performance or memchr on x86 by careful optimization, removing unnecessary function calls and allocations, and optimizing the buffer size. I can try to see if using the rolled-up buffered reader that I have can make a difference.

duck_tapeOP•4mo ago

You could try the Rust version without memchr:

use std::io::stdin;
use std::io::BufRead;
use std::io::BufReader;

struct Record {
    pub name: String,
    pub count: usize,
}

impl Record {
    pub fn new(name: String, count: usize) -> Record {
        Record { name, count }
    }
}

fn create_record(line: &str) -> Record {
    let mut iter = line.split('\t').peekable();
    let name = iter.peek().unwrap().to_string();
    let count = iter.filter(|s| s.contains("bc")).count();
    Record::new(name, count)
}

fn main() {
    let mut records = vec![];
    let mut buffer = String::new();
    let stdin = stdin();
    let mut input = BufReader::new(stdin.lock());
    while let Ok(bytes_read) = input.read_line(&mut buffer) {
        if bytes_read == 0 {
            break;
        }
        buffer.make_ascii_lowercase();
        records.push(create_record(&buffer));
        buffer.clear();
    }
    let count: usize = records.iter().map(|r| r.count).sum();
    println!("{}", count);
}

use std::io::stdin;
use std::io::BufRead;
use std::io::BufReader;

struct Record {
    pub name: String,
    pub count: usize,
}

impl Record {
    pub fn new(name: String, count: usize) -> Record {
        Record { name, count }
    }
}

fn create_record(line: &str) -> Record {
    let mut iter = line.split('\t').peekable();
    let name = iter.peek().unwrap().to_string();
    let count = iter.filter(|s| s.contains("bc")).count();
    Record::new(name, count)
}

fn main() {
    let mut records = vec![];
    let mut buffer = String::new();
    let stdin = stdin();
    let mut input = BufReader::new(stdin.lock());
    while let Ok(bytes_read) = input.read_line(&mut buffer) {
        if bytes_read == 0 {
            break;
        }
        buffer.make_ascii_lowercase();
        records.push(create_record(&buffer));
        buffer.clear();
    }
    let count: usize = records.iter().map(|r| r.count).sum();
    println!("{}", count);
}

I'd be pretty willing to bet memchr is the fast bit. But I'd also be curious to see how your rolled-up buffered reader works on it! Only optimization I haven't pushed on ExtraMojo is upping the buffer size to 128kb in file.mojo

# 128 KiB: http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/ioblksize.h;h=266c209f48fc07cb4527139a2548b6398b75f740;hb=HEAD#l23
alias BUF_SIZE: Int = 1024 * 125

# 128 KiB: http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/ioblksize.h;h=266c209f48fc07cb4527139a2548b6398b75f740;hb=HEAD#l23
alias BUF_SIZE: Int = 1024 * 125

Pushed a few more updates to ExtraMojo, which is now around 1.44 times faster on my mac than the Rust with memchr version. All just improvements to my buffered reader though, nothing to do with improving the find algorithm.

Mohamed Mabrouk•4mo ago

Here are the corresponding results on intel.

hyperfine --warmup 3 '< big.tsv ./mojo/count_lines' '< big.tsv ./rust/target/release/count_lines'
Benchmark 1: < big.tsv ./mojo/count_lines
  Time (mean ± σ):     893.3 ms ±   8.0 ms    [User: 793.7 ms, System: 101.4 ms]
  Range (min … max):   883.1 ms … 911.5 ms    10 runs
 
Benchmark 2: < big.tsv ./rust/target/release/count_lines
  Time (mean ± σ):      1.012 s ±  0.004 s    [User: 0.897 s, System: 0.116 s]
  Range (min … max):    1.004 s …  1.020 s    10 runs

  < big.tsv ./mojo/count_lines ran
    1.13 ± 0.01 times faster than < big.tsv ./rust/target/release/count_lines

hyperfine --warmup 3 '< big.tsv ./mojo/count_lines' '< big.tsv ./rust/target/release/count_lines'
Benchmark 1: < big.tsv ./mojo/count_lines
  Time (mean ± σ):     893.3 ms ±   8.0 ms    [User: 793.7 ms, System: 101.4 ms]
  Range (min … max):   883.1 ms … 911.5 ms    10 runs
 
Benchmark 2: < big.tsv ./rust/target/release/count_lines
  Time (mean ± σ):      1.012 s ±  0.004 s    [User: 0.897 s, System: 0.116 s]
  Range (min … max):    1.004 s …  1.020 s    10 runs

  < big.tsv ./mojo/count_lines ran
    1.13 ± 0.01 times faster than < big.tsv ./rust/target/release/count_lines

duck_tapeOP•4mo ago

Better than I was expecting even! Same warnings as always - this is very much a toy benchmark, but I'm taking the win in getting on par with Rust perf on this one and getting some solid building blocks with the Buffered Reader and memchr. Of note, I discovered today that there is a memchr function in the stdlib: https://github.com/modularml/mojo/blob/fa8d0dfcf38fe21dc43ff9887e33fe52edf32800/stdlib/src/utils/stringref.mojo#L700 It works on pointers, and is missing a few of the optimizations that I've got, but it's in the standard library and that counts for a lot. Something else I learned while working on this, the print function does no buffering. if you want to be able to write to stdout and match other languages you can use the _WriteBufferStack. I may pull that into ExtraMojo as well, or just finally add a real Buffered Writer.

from utils.write import _WriteBufferStack as BufWriter
var writer = BufWriter[4096](sys.stdout)

from utils.write import _WriteBufferStack as BufWriter
var writer = BufWriter[4096](sys.stdout)

Lastly, I have both memchr and memchr_wide. The wide impl is pretty much exactly what is in the Rust memchr crate, operating on 4 simd vectors at a time. I opted to keep the not-unrolled version as well since more often than not you use memchr on short distances ("finding your next ',' for example), where the not-unrolled one has a small advantage due to how I get the pointers set up to be aligned. But for long distances (finding the next '\n') the wide version is faster. The Reader uses the wide version for finding newlines. The SplitIterator uses the short version for find delimiters. Your mileage may vary.

Lil'Nish•4mo ago

Does mojo have small string optimization by default? If so, it may be appropriate to use arraystring (or some other stack based string implementation) in rust

duck_tapeOP•4mo ago

I'm not sure, but I also don't think I'd have hit that with what I'm doing. I stay strictly in the realm of Spans.

Caroline Frasca•3mo ago

Hey @duck_tape, any interest in showcasing this in our upcoming community meeting on February 3rd?

duck_tapeOP•3mo ago

@Caroline I don't think I can make it work schedule wise for me for this upcoming 2/3 meeting, but I'd love to present at a future meeting! Probably not on "Mojo is faster than Rust" but more like "Here's a side by side of these two programs doing the same thing and how compare them" type of thing.

Caroline Frasca•3mo ago

No worries! Sounds good

Gaming

Programming

A Benchmark with Files and Bytes

Did you find this page helpful?