ExtraMojo
https://github.com/sstadick/ExtraMojo
ExtraMojo has been updated to support latest mojo and Magic, and now has more tests and examples!
ExtraMojo is just things I wish were in the stdlib, which mostly means a buffered file reader that can read files by line (or any delim), and a tiney regex implementation.
Feedback welcome and appreciated.
GitHub
GitHub - sstadick/ExtraMojo: A library of nice to have things not f...
A library of nice to have things not found in the current mojo stdlib - sstadick/ExtraMojo
51 Replies
Congrats @duck_tape, you just advanced to level 1!
Hah. I actually wanted this while working on aoc. Stdlib is something you can contribute to, no?
I hadn't looked, but will now!
The stdlib is open for contributions.
Hi @Darkmatter - I'm sure there's a better place to ask this, I just haven't found it!
What's the best path to contributing to the standard library? There's stuff I'd love to work on, like adding a buffered file reader / line-by-line reading and such, but I'm also sure you have a plan for what features you want and when you want them?
Also helper functions around the String API, like split and strip and such but with iterators. I'm just guessing the Iterator API isn't solid yet and that's why that hasn't been taken up?
The best path to getting things into the standard library is to open PRs, at which point discussion will commence.
Proper iterators are blocked by the lack of parametric traits, which is a general theme in the stdlib. We want to have split and strip, but we need to generalize them (and also strings need to do UTF-8 properly). Generalizing them means a lot of type system things that aren't well supported yet. There are also a few places where things are technically possible but we've been asked to not do it that way, such as making
print(List[T](...))
work.
The thing we can generally agree belong in the stdlib are things that can also be found in Rust's core
and alloc
crates.
Beyond that, it's a matter of debate between batteries included and the lessons of C++.
Those factors have resulted in a bit of a stall in stdlib development, since the compiler team was focused on the needs of MAX for a bit.Thank you for the detailed answer! That all makes a lot of sense.
If I find nice carve outs (buffered by-line reader) that don't conflict with known-in-waiting-things for the stdlib (like strip/split), I'll look into making a PR to the stdlib to begin the discussion 👍
It does feel like a weird stalemate right now where there's almost enough features to do everything, but enough change coming that it's not worth doing yet. Overall very excited for Mojo this year.
You're exactly right on the stalemate, that's why stdlib work has stalled.
It is nice that you found the buffered line reader useful. I originally hand-rolled it for my needs and I made several modification in it over the time, it is also not straightforward implementation as it returns tesnor of bytes and not strings (it was started in the pre-list era). I was planning on splitting a more complicated version of the buffered reader (and a companion buffered writer) as it's own package, if you are interested in the buffered IO part, we can work on a more pythonic implementation, which if useful can find its way eventually to stdlib (python do buffered reads by default in some cases).
IO traits are one of those things stalled on more type system stuff, and I'm aiming to do zero-copy by default (which means no buffering).
Very interested!
Personally I want the Rust suite of IO traits, or something like it, which I’ve found to have the full range of high/low level control.
I really want to dive into a big project around this but don’t want to get too ahead of where the language is either, so I’ve just been building “as needed” type things.
I’m all for zero copy! How does that work though without a buffer? Buffered IO is just such a common abstraction I didn’t realize there was a zero copy performant way to do it.
Essentially, you do IO into a buffer, and then you only move a tiny amount of the data, say a header, and keep a pointer around to the buffer after the header. For instance, for a safetensors model, you would ask the OS how large the file is, make a buffer that large, and then tell the OS "read that whole file into this buffer, let me know when you're done", or "wake me up when you're done" for blocking IO. Then, you read the JSON that's at the start of the file to learn about all of the tensors. You make a dict to hold the tensor name -> tensor mappings, but you crucially never move the multi-GB arrays of floats around.
There's no reason to buffer here because you know how much you want.
If you have a normal buffered IO interface, it does dumb things when you try stuff like this.
If you want to, say, iterate the lines of a file, you should read in the whole file since the OS can do that for you while you do other things, instead of asking the OS for roughly block-sized chunks of data over and over.
There are cases where you only want to read some of the file, but those are much more rare than "read in this whole file and pass it to a JSON library", so I don't think they deserve to be the default.
If you use
mmap
, you don't need to preallocate a buffer and the OS can figure out from the page access pattern what you want and that you aren't modifying the buffer, which saves on memory.
If you toss a few madvise calls in there, the OS will be able to help you only keep just enough in memory for your drive to keep up.
This is sufficient for most usecases.
For more advanced ones, the unbuffered IO is usually preferred since the application will do it's own caching, and there are some new toys in Linux we may be able to expose to help with that.I think for some use cases reading the whole file at once would not work, I work regularly with file sizes of 10s-100s GB per file, and it may not be feasible for most systems to even have memory of this size. mmap could be used in this case, but I haven't tried it yet. Also in my early experiments I found that doing buffered IO and parsing with a smaller buffer (64 KB) is as efficient or even slightly more efficient than reading the whole file at once (at least on my hardware) for file sizes around 1GB.
For the zero-copy iterator, I currently have a version which returns byte spans instead for doing additional memory allocation, this minimizes the memalloc overhead at least. the main problem with this approach is the associated lifetime of the span and who it should be invalidated upon the buffer refill.
This is where
mmap
comes in.In general, I think the buffered IO is a nice interface to have in addition to the more fancy IO strategies, coming from other langs and domains, this one of the tools that you intuitively reach for in the beginning.
The kernel will transparently do buffered io for you.
But will automatically tune things for the drives you are pulling data from.
I got so many segfaults when I tried to use mmap API through early mojo ffi that kept me away 😅
I didn't have any issues.
and iirc MAX will actually mmap
.onnx
files.It would be great if you can pass around some example code, I tried this in early mojo (b4 0.6) and I would like to circle back to it again now
I use external_call and I grabbed the constants from running the C preprocessor over a file to grab the constants.
If you’re alluding to io_uring at the end there, totally agree!
Obviously both @Mohamed Mabrouk and I fall in the buffered IO camp 🙂
We’ll have to do some bake offs as APIs mature a bit more. I’ve gone fairly deep on exactly this in Rust and everything I’ve done with Mojo so far indicates it’ll be just as speedy.
I'm a database person, so anything less than the rated read speed of the drives is slow for me 🙂
We'll see.
It's really a matter of how you use it. There are ways to cause crippling perf issues with both styles, but my thought is that most programs are less likely to have issues with the mmap style, and it removes the question of how to tune the buffer size and lets the kernel answer that.
in my experiments, I am doing zero-copy buffered line iterator at around 5G/s from NVME disk, I will play around with mmap to see how it would work and if I can extract a bit more performance from this strategy (at least avoiding the syscall overhead).
You want mmap and
madvise(MADV_SEQUENTIAL)
.I did some initial experiments on mmap vs buffered reader and in small files (less than 10G) the difference is negligible however in large files (50G) mmap is 2X slower than a buffered reader with 4, 8 or 64 KB buffer size. this is also after the using
MADV_SEQUENTIAL
directive and changing the chunk size or reading from the mapped filed from simdsizeof[UINT8]()
upto 4096
elements.
I am not sure if I am doing reading the mmaped files in the best way though.
I made a quick plot of the reading times (files sizes are: 1.5 GB, 8.5 GB, 50GB)I've just found out we have no way to check the "magic pointer values" POSIX uses, so I can't actually get an error code out mmap. I'm trying to tell it to pre-fault things and to use hugepages (which I have), but I'm getting mystery errors I can't see because we have no way to get at errno.
GitHub
[Feature Request] Errno Getter · Issue #3921 · modularml/mojo
Review Mojo's priorities I have read the roadmap and priorities and I believe this request falls within the priorities. What is your request? A getter for errno, likely in the form fn raw_errno...
I'll check back on this when I can actually use mmap properly. We may need to apply some extra flags to avoid tons of extra page faults for the mmap version.
that would be great, the cost of page faults becomes non-trivial with increasing file size.
4k pages are not good for a lot of things honestly.
1G pages would mean the 50 GB file would have 50 page faults, instead of ~10 million.
would the huge page config differ between MAC and Linux or it's POSIX thing? also how portable would this setup be to windows?
MacOS does 16K and nothing else.
Windows pretends 1G doesn't exist on x86.
Linux and FreeBSD expose it all properly.
However, if you are frequently doing 50GB IOs on Windows you will run into NTFS very quickly.
The windows filesystem (NTFS) isn't really capable of keeping pace with a raid array.
ReFS is a bit better, roughly what *nix had in 2003.
Linux also supports 512 GB pages on RISC-V.
it's user-specific most people won't use 50G files on windows, I am just cloning a cross-platform industry standard tool in the field to Mojo and I want to have a good cross-platform sequential IO as the first step, I would like to ideally support multi threading at somepoint and the mmap is much more ergonomic than buffered reader on linux but it seems less predictable and portable on Mac and Windows
mmap helps for MT, since you just map it all in then go to town. It is quite a bite nicer for non-sequential IO.
We could also do both.
Have "read_all" variants for mmap with the populate flag and normal IO.
I will probably keep experimenting for a bit to see how this would work. one obvious advantage is that the lifetime of the references to the mapped file would be valid as long as the file is still mapped, that would simplify multi-threads that can be in different stages of a multi-step pipeline
This is a big one, it makes lifetimes easier for the "simple IO" case.
Having a buffer that gets wiped out makes MT hard.
Unrelated to your perf experiments (which I'm following along with!). ExtraMojo v0.2.0 https://github.com/ExtraMojo/ExtraMojo
Still very rough around the edges, but it improves the buffered file reader to allocate less and move off the Tensors and use Spans etc. It also adds more byte-string helper functions like split and find. Still very much a "things I need when trying to benchmark against other languages" catchall.
GitHub
GitHub - ExtraMojo/ExtraMojo: A library of nice to have things not ...
A library of nice to have things not found in the current mojo stdlib - ExtraMojo/ExtraMojo
mmap on osx isn't worth it in my experience, so you end up supporting both ways of doing things, which means sorting out your string lifetimes at the end of the day anyways.
Congrats @duck_tape, you just advanced to level 3!
Of course apple is "special"...
Have you considered adding your buffered read logic to the FileHandle struct in the stdlib? Would be nice to get readline and iteration over lines added to it!
It really should be a separate struct.
Some of us want to use O_DIRECT or other options that are very particular about alignment.
That’s kind of what stopped me from poking at it myself. Wasn’t sure if the general direction was to make the default buffered r/w or to separate that out
If we make the default buffered, we need to buffer to comply with the hugepage write alignment requirements for NVMe, so 1 GiB aligned.
And, right now aligned pointers don't really work properly.
We need some kind of subtyping feature so that
UnsafePointer[T, alignment=2]
is a subtype of UnsafePointer[T, alignment=1]
.I personally think it should be added to the stdlib with an explicit method as
for line in FileHandle().Buffered LineIterator[buf_size]()
I would prefer using the Rust method of writing a generic
BufferedWriter[Inner: Writable, buffer_alignment: UInt = alignof[UInt8]()]
struct we can use for all buffered io.I'd be very interested doing the work to add this to the stdlib, and happy to make it conform to whatever is most palatable to the stdlib team.
Per older conversation with @Owen Hilyard I was under the impression that the stdlib team wanted to wait a bit before adding something like this, for some of the reasons he outlined just now, and also because of the desire to have zero-copy io be the default.
I agree that having some form of lines iterator is generally a pretty reached-for tool in standard libraries. I also lean towards rusts suite of Read/Write, BufferedReader / BufferedWriter implementations.
Then AFAIK this is currently blocked on having basic IO traits that allows a buffered reader/writer to wrap a generic object that implements the
Reader
? trait for example. we will need first to settle this down.Making zero-copy IO the default means that buffered io gets a whole lot more complicated. It's still doable, but it can kind-of be a mess.
Yes, we need basic io traits, which really should wait until async is more solidified so we don't have "read" and "async_read".
Would there be a downside to a concrete implementation in the stdlib currently that is just a
BufferedReader
that is created with a FileHandle
? That could at a future point take something that implements a Read
trait, and anything calling it would still work because presumably FileHandle
will implement read.
That doesn't address the zero-copy portion. If there's an implementation that you like out there that you could point me at to go and read about, I'd love to understand how that would work, especially when very large files are involved. @Owen Hilyard
I still read all of this as "please don't try to submit a 'classical' BufferedReader/BufferedWriter to the stdlib right now" which is totally fine 👍It's very likely that an implementation that works with posix would be forced to allocate a buffer, since for actual zero-copy the IO subsystem hands you buffers, and we will need to do some work regarding buffer type.