Mojo port of Andrjey Karpathy's minbpe

https://github.com/dorjeduck/minbpe.mojo Minbpe is an implementation of various Tokenizers as they are commonly used in LLM application. This Mojo port is work in progress, particular looking into performance optimizations. One interesting aspect of this project for me is that there is a Rust port of minbpe and in some aspects of Tokenization this Mojo port is not yet a performant as the Rust implemention (which seems to be done very well). Not so much a competition with Rust for me, but extremely helpful to have this performance comparison to find aspect of my implemention which can be optimized ... Any feedback most welcome
GitHub
GitHub - dorjeduck/minbpe.mojo: port of Andrjey Karpathy's minbpe t...
port of Andrjey Karpathy's minbpe to Mojo. Contribute to dorjeduck/minbpe.mojo development by creating an account on GitHub.
1 Reply
Martin Dudek
Martin DudekOP7mo ago
this is where i am right now. I know this graph is not very insightful just to give an idea. One "issue" with the RegexTokinzer is that it uses Python "regex" which works great but a native regex engine would be faster i guess.
No description
Want results from more Discord servers?
Add your server