A Benchmark with Files and Bytes

Crossposting my forum post since the formatting is a bit nicer there: https://forum.modular.com/t/showcase-a-benchmark-with-files-and-bytes-standard-benchmark-warnings-apply/420 tl;dr; pretty vanilla Mojo was beating out pretty vanilla Rust (all normal caveats about benchmarks being worthless apply).
Modular
[Showcase] A Benchmark with Files and Bytes (standard benchmark war...
WARNING: It’s Friday. This was just for fun. Please remember that all benchmarks are lies and not to be trusted. Mostly this was a useful exercise for improving the performance and APIs in ExtraMojo. REMINDER: The point of this “benchmark” wasn’t to optimize the problem, but just to follow the template to write code that I often end up writing ...
18 Replies
a2svior
a2svior3w ago
Oh yeah, take that Rust! 🦀 🚒 :mojo: I also personally find the Mojo implementation more readable than either Rust or Python in this case
duck_tape
duck_tapeOP3w ago
Updated with some new numbers after fixing up my SIMD code a bit more and moving Rust to use memchr!
Lil'Nish
Lil'Nish3w ago
When I get a chance, I'll create a SIMD rust version
ModularBot
ModularBot3w ago
Congrats @Lil'Nish, you just advanced to level 4!
Lil'Nish
Lil'Nish3w ago
tho SIMD is not a huge selling point of rust and requires unsafe or external crates, so even if rust wins with SIMD, it may not be a fair comparison...
duck_tape
duck_tapeOP3w ago
I did swap in the memchr crate, which really should be going fast. Someone on the modular forum post said that in previous benchmarking, for unknown reasons, mojo tends have faster simd code on M-series macs compared to Rust 🤷 Agreed though, part of the sell of Mojo is the easy-to-write SIMD, and it really was easy, even for a SIMD newbie like me, and it's not that easy in Rust (at least not yet, maybe someday they'll land the portable simd stuff). Now a bit more pythonic with a context manager! (code updated in forum post) There's always the chance I'm doing something wildly unsafe, or just wrong, that happens to work on this particular dataset as well but would explode elsewhere. I have tests, but that's not the same as running this IRL all over the place.
Mohamed Mabrouk
the memchr crate uses SIMD already and goes to great length to provide vectorized arch-specific implementations. I am not sure if there is a better solution in the rust ecosystem. I can give the code a try on an Intel machine, and I can report the results here
duck_tape
duck_tapeOP3w ago
That would be awesome! Lmk if you run into any issues running it / setting it up.
Mohamed Mabrouk
Here the hyperfine results after fixing a small import bug in the rust script Rust 1.83, Mojo 26.6.
Benchmark 1: < big.tsv ./mojo/count_lines
Time (mean ± σ): 1.216 s ± 0.007 s [User: 1.123 s, System: 0.095 s]
Range (min … max): 1.209 s … 1.225 s 10 runs

Benchmark 2: < big.tsv ./rust/target/release/count_lines
Time (mean ± σ): 1.006 s ± 0.008 s [User: 0.901 s, System: 0.105 s]
Range (min … max): 0.997 s … 1.016 s 10 runs

Summary
< big.tsv ./rust/target/release/count_lines ran
1.21 ± 0.01 times faster than < big.tsv ./mojo/count_lines
Benchmark 1: < big.tsv ./mojo/count_lines
Time (mean ± σ): 1.216 s ± 0.007 s [User: 1.123 s, System: 0.095 s]
Range (min … max): 1.209 s … 1.225 s 10 runs

Benchmark 2: < big.tsv ./rust/target/release/count_lines
Time (mean ± σ): 1.006 s ± 0.008 s [User: 0.901 s, System: 0.105 s]
Range (min … max): 0.997 s … 1.016 s 10 runs

Summary
< big.tsv ./rust/target/release/count_lines ran
1.21 ± 0.01 times faster than < big.tsv ./mojo/count_lines
as expected the rust implementation is more performant on Intel-X86 than on ARM-Mac and now beats mojo.
Lil'Nish
Lil'Nish3w ago
Fiar enough, ig there's no need to manually implement a SIMD version then
duck_tape
duck_tapeOP3w ago
Thanks for running that! Just to double check, that was latest master of ExtraMojo? And makes sense. memchr really should be faster than my first attempt simd. Wild that rust has that big a difference between Mac and Intel though
a2svior
a2svior3w ago
Today we learned that one should use Mojo on M chips and Rust on Intel for best performance 😁
Mohamed Mabrouk
yes, I used the latest extramojo, and used the script in the records directory. I can't say if it's rust or the memchr package specifically that they are using subpar algorithm for Arm. based on my previous experiments, I can match the performance or memchr on x86 by careful optimization, removing unnecessary function calls and allocations, and optimizing the buffer size. I can try to see if using the rolled-up buffered reader that I have can make a difference.
duck_tape
duck_tapeOP3w ago
You could try the Rust version without memchr:
use std::io::stdin;
use std::io::BufRead;
use std::io::BufReader;

struct Record {
pub name: String,
pub count: usize,
}

impl Record {
pub fn new(name: String, count: usize) -> Record {
Record { name, count }
}
}

fn create_record(line: &str) -> Record {
let mut iter = line.split('\t').peekable();
let name = iter.peek().unwrap().to_string();
let count = iter.filter(|s| s.contains("bc")).count();
Record::new(name, count)
}

fn main() {
let mut records = vec![];
let mut buffer = String::new();
let stdin = stdin();
let mut input = BufReader::new(stdin.lock());
while let Ok(bytes_read) = input.read_line(&mut buffer) {
if bytes_read == 0 {
break;
}
buffer.make_ascii_lowercase();
records.push(create_record(&buffer));
buffer.clear();
}
let count: usize = records.iter().map(|r| r.count).sum();
println!("{}", count);
}
use std::io::stdin;
use std::io::BufRead;
use std::io::BufReader;

struct Record {
pub name: String,
pub count: usize,
}

impl Record {
pub fn new(name: String, count: usize) -> Record {
Record { name, count }
}
}

fn create_record(line: &str) -> Record {
let mut iter = line.split('\t').peekable();
let name = iter.peek().unwrap().to_string();
let count = iter.filter(|s| s.contains("bc")).count();
Record::new(name, count)
}

fn main() {
let mut records = vec![];
let mut buffer = String::new();
let stdin = stdin();
let mut input = BufReader::new(stdin.lock());
while let Ok(bytes_read) = input.read_line(&mut buffer) {
if bytes_read == 0 {
break;
}
buffer.make_ascii_lowercase();
records.push(create_record(&buffer));
buffer.clear();
}
let count: usize = records.iter().map(|r| r.count).sum();
println!("{}", count);
}
I'd be pretty willing to bet memchr is the fast bit. But I'd also be curious to see how your rolled-up buffered reader works on it! Only optimization I haven't pushed on ExtraMojo is upping the buffer size to 128kb in file.mojo
# 128 KiB: http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/ioblksize.h;h=266c209f48fc07cb4527139a2548b6398b75f740;hb=HEAD#l23
alias BUF_SIZE: Int = 1024 * 125
# 128 KiB: http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/ioblksize.h;h=266c209f48fc07cb4527139a2548b6398b75f740;hb=HEAD#l23
alias BUF_SIZE: Int = 1024 * 125
Pushed a few more updates to ExtraMojo, which is now around 1.44 times faster on my mac than the Rust with memchr version. All just improvements to my buffered reader though, nothing to do with improving the find algorithm.
Mohamed Mabrouk
Here are the corresponding results on intel.
hyperfine --warmup 3 '< big.tsv ./mojo/count_lines' '< big.tsv ./rust/target/release/count_lines'
Benchmark 1: < big.tsv ./mojo/count_lines
Time (mean ± σ): 893.3 ms ± 8.0 ms [User: 793.7 ms, System: 101.4 ms]
Range (min … max): 883.1 ms … 911.5 ms 10 runs

Benchmark 2: < big.tsv ./rust/target/release/count_lines
Time (mean ± σ): 1.012 s ± 0.004 s [User: 0.897 s, System: 0.116 s]
Range (min … max): 1.004 s … 1.020 s 10 runs

< big.tsv ./mojo/count_lines ran
1.13 ± 0.01 times faster than < big.tsv ./rust/target/release/count_lines
hyperfine --warmup 3 '< big.tsv ./mojo/count_lines' '< big.tsv ./rust/target/release/count_lines'
Benchmark 1: < big.tsv ./mojo/count_lines
Time (mean ± σ): 893.3 ms ± 8.0 ms [User: 793.7 ms, System: 101.4 ms]
Range (min … max): 883.1 ms … 911.5 ms 10 runs

Benchmark 2: < big.tsv ./rust/target/release/count_lines
Time (mean ± σ): 1.012 s ± 0.004 s [User: 0.897 s, System: 0.116 s]
Range (min … max): 1.004 s … 1.020 s 10 runs

< big.tsv ./mojo/count_lines ran
1.13 ± 0.01 times faster than < big.tsv ./rust/target/release/count_lines
duck_tape
duck_tapeOP3w ago
Better than I was expecting even! Same warnings as always - this is very much a toy benchmark, but I'm taking the win in getting on par with Rust perf on this one and getting some solid building blocks with the Buffered Reader and memchr. Of note, I discovered today that there is a memchr function in the stdlib: https://github.com/modularml/mojo/blob/fa8d0dfcf38fe21dc43ff9887e33fe52edf32800/stdlib/src/utils/stringref.mojo#L700 It works on pointers, and is missing a few of the optimizations that I've got, but it's in the standard library and that counts for a lot. Something else I learned while working on this, the print function does no buffering. if you want to be able to write to stdout and match other languages you can use the _WriteBufferStack. I may pull that into ExtraMojo as well, or just finally add a real Buffered Writer.
from utils.write import _WriteBufferStack as BufWriter
var writer = BufWriter[4096](sys.stdout)
from utils.write import _WriteBufferStack as BufWriter
var writer = BufWriter[4096](sys.stdout)
Lastly, I have both memchr and memchr_wide. The wide impl is pretty much exactly what is in the Rust memchr crate, operating on 4 simd vectors at a time. I opted to keep the not-unrolled version as well since more often than not you use memchr on short distances ("finding your next ',' for example), where the not-unrolled one has a small advantage due to how I get the pointers set up to be aligned. But for long distances (finding the next '\n') the wide version is faster. The Reader uses the wide version for finding newlines. The SplitIterator uses the short version for find delimiters. Your mileage may vary.
Lil'Nish
Lil'Nish3w ago
Does mojo have small string optimization by default? If so, it may be appropriate to use arraystring (or some other stack based string implementation) in rust
duck_tape
duck_tapeOP3w ago
I'm not sure, but I also don't think I'd have hit that with what I'm doing. I stay strictly in the realm of Spans.

Did you find this page helpful?