Mojo port of Andrjey Karpathy's minbpe
https://github.com/dorjeduck/minbpe.mojo
Minbpe is an implementation of various Tokenizers as they are commonly used in LLM application. This Mojo port is work in progress, particular looking into performance optimizations.
One interesting aspect of this project for me is that there is a Rust port of
minbpe
and in some aspects of Tokenization this Mojo port is not yet a performant as the Rust implemention (which seems to be done very well). Not so much a competition with Rust for me, but extremely helpful to have this performance comparison to find aspect of my implemention which can be optimized ...
Any feedback most welcomeGitHub
GitHub - dorjeduck/minbpe.mojo: port of Andrjey Karpathy's minbpe t...
port of Andrjey Karpathy's minbpe to Mojo. Contribute to dorjeduck/minbpe.mojo development by creating an account on GitHub.
1 Reply
this is where i am right now. I know this graph is not very insightful just to give an idea. One "issue" with the RegexTokinzer is that it uses Python "regex" which works great but a native regex engine would be faster i guess.