ExtraMojo
https://github.com/sstadick/ExtraMojo
ExtraMojo has been updated to support latest mojo and Magic, and now has more tests and examples!
ExtraMojo is just things I wish were in the stdlib, which mostly means a buffered file reader that can read files by line (or any delim), and a tiney regex implementation.
Feedback welcome and appreciated.
GitHub
GitHub - sstadick/ExtraMojo: A library of nice to have things not f...
A library of nice to have things not found in the current mojo stdlib - sstadick/ExtraMojo
40 Replies
Congrats @duck_tape, you just advanced to level 1!
Hah. I actually wanted this while working on aoc. Stdlib is something you can contribute to, no?
I hadn't looked, but will now!
The stdlib is open for contributions.
Hi @Darkmatter - I'm sure there's a better place to ask this, I just haven't found it!
What's the best path to contributing to the standard library? There's stuff I'd love to work on, like adding a buffered file reader / line-by-line reading and such, but I'm also sure you have a plan for what features you want and when you want them?
Also helper functions around the String API, like split and strip and such but with iterators. I'm just guessing the Iterator API isn't solid yet and that's why that hasn't been taken up?
The best path to getting things into the standard library is to open PRs, at which point discussion will commence.
Proper iterators are blocked by the lack of parametric traits, which is a general theme in the stdlib. We want to have split and strip, but we need to generalize them (and also strings need to do UTF-8 properly). Generalizing them means a lot of type system things that aren't well supported yet. There are also a few places where things are technically possible but we've been asked to not do it that way, such as making
print(List[T](...))
work.
The thing we can generally agree belong in the stdlib are things that can also be found in Rust's core
and alloc
crates.
Beyond that, it's a matter of debate between batteries included and the lessons of C++.
Those factors have resulted in a bit of a stall in stdlib development, since the compiler team was focused on the needs of MAX for a bit.Thank you for the detailed answer! That all makes a lot of sense.
If I find nice carve outs (buffered by-line reader) that don't conflict with known-in-waiting-things for the stdlib (like strip/split), I'll look into making a PR to the stdlib to begin the discussion 👍
It does feel like a weird stalemate right now where there's almost enough features to do everything, but enough change coming that it's not worth doing yet. Overall very excited for Mojo this year.
You're exactly right on the stalemate, that's why stdlib work has stalled.
It is nice that you found the buffered line reader useful. I originally hand-rolled it for my needs and I made several modification in it over the time, it is also not straightforward implementation as it returns tesnor of bytes and not strings (it was started in the pre-list era). I was planning on splitting a more complicated version of the buffered reader (and a companion buffered writer) as it's own package, if you are interested in the buffered IO part, we can work on a more pythonic implementation, which if useful can find its way eventually to stdlib (python do buffered reads by default in some cases).
IO traits are one of those things stalled on more type system stuff, and I'm aiming to do zero-copy by default (which means no buffering).
Very interested!
Personally I want the Rust suite of IO traits, or something like it, which I’ve found to have the full range of high/low level control.
I really want to dive into a big project around this but don’t want to get too ahead of where the language is either, so I’ve just been building “as needed” type things.
I’m all for zero copy! How does that work though without a buffer? Buffered IO is just such a common abstraction I didn’t realize there was a zero copy performant way to do it.
Essentially, you do IO into a buffer, and then you only move a tiny amount of the data, say a header, and keep a pointer around to the buffer after the header. For instance, for a safetensors model, you would ask the OS how large the file is, make a buffer that large, and then tell the OS "read that whole file into this buffer, let me know when you're done", or "wake me up when you're done" for blocking IO. Then, you read the JSON that's at the start of the file to learn about all of the tensors. You make a dict to hold the tensor name -> tensor mappings, but you crucially never move the multi-GB arrays of floats around.
There's no reason to buffer here because you know how much you want.
If you have a normal buffered IO interface, it does dumb things when you try stuff like this.
If you want to, say, iterate the lines of a file, you should read in the whole file since the OS can do that for you while you do other things, instead of asking the OS for roughly block-sized chunks of data over and over.
There are cases where you only want to read some of the file, but those are much more rare than "read in this whole file and pass it to a JSON library", so I don't think they deserve to be the default.
If you use
mmap
, you don't need to preallocate a buffer and the OS can figure out from the page access pattern what you want and that you aren't modifying the buffer, which saves on memory.
If you toss a few madvise calls in there, the OS will be able to help you only keep just enough in memory for your drive to keep up.
This is sufficient for most usecases.
For more advanced ones, the unbuffered IO is usually preferred since the application will do it's own caching, and there are some new toys in Linux we may be able to expose to help with that.I think for some use cases reading the whole file at once would not work, I work regularly with file sizes of 10s-100s GB per file, and it may not be feasible for most systems to even have memory of this size. mmap could be used in this case, but I haven't tried it yet. Also in my early experiments I found that doing buffered IO and parsing with a smaller buffer (64 KB) is as efficient or even slightly more efficient than reading the whole file at once (at least on my hardware) for file sizes around 1GB.
For the zero-copy iterator, I currently have a version which returns byte spans instead for doing additional memory allocation, this minimizes the memalloc overhead at least. the main problem with this approach is the associated lifetime of the span and who it should be invalidated upon the buffer refill.
This is where
mmap
comes in.In general, I think the buffered IO is a nice interface to have in addition to the more fancy IO strategies, coming from other langs and domains, this one of the tools that you intuitively reach for in the beginning.
The kernel will transparently do buffered io for you.
But will automatically tune things for the drives you are pulling data from.
I got so many segfaults when I tried to use mmap API through early mojo ffi that kept me away 😅
I didn't have any issues.
and iirc MAX will actually mmap
.onnx
files.It would be great if you can pass around some example code, I tried this in early mojo (b4 0.6) and I would like to circle back to it again now
I use external_call and I grabbed the constants from running the C preprocessor over a file to grab the constants.
If you’re alluding to io_uring at the end there, totally agree!
Obviously both @Mohamed Mabrouk and I fall in the buffered IO camp 🙂
We’ll have to do some bake offs as APIs mature a bit more. I’ve gone fairly deep on exactly this in Rust and everything I’ve done with Mojo so far indicates it’ll be just as speedy.
I'm a database person, so anything less than the rated read speed of the drives is slow for me 🙂
We'll see.
It's really a matter of how you use it. There are ways to cause crippling perf issues with both styles, but my thought is that most programs are less likely to have issues with the mmap style, and it removes the question of how to tune the buffer size and lets the kernel answer that.
in my experiments, I am doing zero-copy buffered line iterator at around 5G/s from NVME disk, I will play around with mmap to see how it would work and if I can extract a bit more performance from this strategy (at least avoiding the syscall overhead).
You want mmap and
madvise(MADV_SEQUENTIAL)
.I did some initial experiments on mmap vs buffered reader and in small files (less than 10G) the difference is negligible however in large files (50G) mmap is 2X slower than a buffered reader with 4, 8 or 64 KB buffer size. this is also after the using
MADV_SEQUENTIAL
directive and changing the chunk size or reading from the mapped filed from simdsizeof[UINT8]()
upto 4096
elements.
I am not sure if I am doing reading the mmaped files in the best way though.
I made a quick plot of the reading times (files sizes are: 1.5 GB, 8.5 GB, 50GB)I've just found out we have no way to check the "magic pointer values" POSIX uses, so I can't actually get an error code out mmap. I'm trying to tell it to pre-fault things and to use hugepages (which I have), but I'm getting mystery errors I can't see because we have no way to get at errno.
GitHub
[Feature Request] Errno Getter · Issue #3921 · modularml/mojo
Review Mojo's priorities I have read the roadmap and priorities and I believe this request falls within the priorities. What is your request? A getter for errno, likely in the form fn raw_errno...
I'll check back on this when I can actually use mmap properly. We may need to apply some extra flags to avoid tons of extra page faults for the mmap version.
that would be great, the cost of page faults becomes non-trivial with increasing file size.
4k pages are not good for a lot of things honestly.
1G pages would mean the 50 GB file would have 50 page faults, instead of ~10 million.
would the huge page config differ between MAC and Linux or it's POSIX thing? also how portable would this setup be to windows?
MacOS does 16K and nothing else.
Windows pretends 1G doesn't exist on x86.
Linux and FreeBSD expose it all properly.
However, if you are frequently doing 50GB IOs on Windows you will run into NTFS very quickly.
The windows filesystem (NTFS) isn't really capable of keeping pace with a raid array.
ReFS is a bit better, roughly what *nix had in 2003.
Linux also supports 512 GB pages on RISC-V.
it's user-specific most people won't use 50G files on windows, I am just cloning a cross-platform industry standard tool in the field to Mojo and I want to have a good cross-platform sequential IO as the first step, I would like to ideally support multi threading at somepoint and the mmap is much more ergonomic than buffered reader on linux but it seems less predictable and portable on Mac and Windows
mmap helps for MT, since you just map it all in then go to town. It is quite a bite nicer for non-sequential IO.
We could also do both.
Have "read_all" variants for mmap with the populate flag and normal IO.
I will probably keep experimenting for a bit to see how this would work. one obvious advantage is that the lifetime of the references to the mapped file would be valid as long as the file is still mapped, that would simplify multi-threads that can be in different stages of a multi-step pipeline
This is a big one, it makes lifetimes easier for the "simple IO" case.
Having a buffer that gets wiped out makes MT hard.
Unrelated to your perf experiments (which I'm following along with!). ExtraMojo v0.2.0 https://github.com/ExtraMojo/ExtraMojo
Still very rough around the edges, but it improves the buffered file reader to allocate less and move off the Tensors and use Spans etc. It also adds more byte-string helper functions like split and find. Still very much a "things I need when trying to benchmark against other languages" catchall.
GitHub
GitHub - ExtraMojo/ExtraMojo: A library of nice to have things not ...
A library of nice to have things not found in the current mojo stdlib - ExtraMojo/ExtraMojo
mmap on osx isn't worth it in my experience, so you end up supporting both ways of doing things, which means sorting out your string lifetimes at the end of the day anyways.
Congrats @duck_tape, you just advanced to level 3!
Of course apple is "special"...