Regular Expression Engine for Mojo, any plans?
Do we have a regular expression engine implemented in Mojo?I haven't come across one yet, but it would be fantastic to have. Or is there a native solution planned for Mojo? If not, what could be our approach to developing one? I have no clue about the complexity involved in this task. Would we base it on some existing C engine or write it from scratch in Mojo. Thx
65 Replies
The only one I know of is a very simple (60 loc) regex match implementaion in Mojo by sstadic (https://github.com/sstadick/ExtraMojo/blob/main/ExtraMojo/regex/simple_re.mojo). However I'm also very interested to know of any plans. To my understanding Rust has a fast regex library.
Congrats @MM, you just advanced to level 3!
Go's regex is known to be blazing fast, too. I recall reading that it beats Rust's (if barely) in some benchmarks. Last I looked into it (years ago; 5? 6? not sure), Go and Rust were the two fastest implementations in any major programming language.
I use https://pypi.org/project/regex/ via Python integration right now which works fine but regex seems so essential that i really hope we will have something native at one point. for performance, and for elegance :mojo:
If we're talking about absolute performance, there's also hyperscan. It absolutely smashed the benchmarks. I don't know if we can write a portable regex library which is as fast as it, but one can dream. :mojo:
A quick google search found this
oh! I forgot about RE2. It was still unstable when I was doing that eval, years ago. Google does some amazing engineering.
When they want to do something right they seem to be good at it
Thanks for finding these gems. But now the question remains, how can we do it right in Mojo?
we bully somebody with a bunch of spare time into implementing it for us
😂
I vote for not me lol
One thing to keep in mind when looking at those performance charts is that not all regex variants have the same expressive power. PCRE for example can describe any context-free language, not just regular languages. parsing/recognizing languages of different types on the chomsky hierarchy is well studied is CS and there's an inherent expressiveness/speed tradeoff to consider
came here to also mention re2, i’ve looked into porting it but it’s a pretty large library, definitely an interesting issue
I took a look at the docs and it’s fairly complex to use though maybe I just need to take a better look at it
While glimpsing through the re2 source code I found this link in a comment https://swtch.com/~rsc/regexp/. It has some bare bones examples in C that may provide a less intimidating starting point than the re2 library for someone toying with the idea of implementing regexes for Mojo.
I also recommend this video https://youtu.be/DDe-S3uef2w?si=NP6TqGm0PYFdg7zI, about halfway in it gives a great description of re2 (also a great vid in general)
Kevin Fang
YouTube
Cloudflare Deploys Really Slow Code, Takes Down Entire Company
Cloudflare is back at it again with more regex and state machines.
Previously on Cloudflare: https://youtu.be/GEbn3nHyKnA
Sources:
https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/
https://blog.cloudflare.com/introducing-quicksilver-configuration-distribution-at-internet-scale/
https://swtch.com/~rsc/regexp/regexp1....
A simd accelerated multiple substring matching algorithm called "Teddy". This is used in Rust's ripgrep. https://github.com/rust-lang/regex/blob/3de8c44f5357d5b582a80b7282480e38e8b7d50d/src/simd_accel/teddy128.rs
GitHub
regex/src/simd_accel/teddy128.rs at 3de8c44f5357d5b582a80b7282480e3...
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs. - rust-lang/regex
I finally watched that vid. Wow. That was awesome!
So, in "a former life," I worked somewhere that used a TON of regex. I recall that we had an engine that was "optimized for that time" (many years ago), but we also wrapped our regex-engine with a staging filter that would prevent certain regex formulas. For example, our system flat-out denied use of unlimited matching. I mean, you could get the rough equivalent of
.*
by doing .{0,99999999999999999}
or something. And .+
-> .{1,999...}
. But the programmer would be intentionally building that numeric limit. In addition, IIRC, we also limited the upper end of matching on unlimited character sets (.
) and also wide-ranging sets (\w
, \S
, etc); I do not recall the numeric limits. The real use case for .*
and .+
is really close to zero. Anytime you think "I should put an unlimited match here" you are really thinking "I should break the universe" and should probably approach the problem differently.
I don't remember all the safety-checks we had built into the staging filter. It was just too many years ago. But we essentially sanitized for any regex and kicked out errors and suggestions on how to get the same intended results without breaking everything.
---
Not editing anything above, but before hitting enter, I looked for something similar in open-source universe. This looks pretty cool: https://github.com/tjenkinson/redos-detector
Anyway, sorry for the diversion from discussing the underlying regex engine. Just wanted to share a tip for anybody working with regex in their current life, who might be at an org that is not willing to adopt something like RE2.i start getting worried we need to wait for a future life before somebody comes along who is actually willing to implement a regex engine for Mojo 😉 it feels more and more like a pretty dry and quite complex task to dive into to be honest ...
It's super dry, super complex, and ....... we just gotta find somebody that enjoys a lot of pain. 😂
The channel is great, by the way. I’d recommend watching their other videos
I think that having a bifurcated version of regex the way Rust does would be a good idea. One which implements most of the things most people will ever need but has guaranteed linear time evaluation, and another which is a full PCRE implementation with all the dangers that entails. Call the safe one regex and the full one PCRE since that is correct from a computer science theory perspective, so that unaware people reach for the safe and faster implementation first.
You could also possibly make a single library that just has multiple backends and uses whichever is most optimal for your pattern. Could even have an optional flag to just give you a compile time error if your pattern isn't compatible with the fast backend.
That would probably require more compile time capabilities than Mojo currently has. If the compiler is willing to cooperate it would be great, but otherwise we are probably looking at needing Zig comptime/Rust proc macros to actually give compile time errors. Iirc the Rust regex crate doesn't give compile time errors in the name of allowing you to give it a string at runtime.
Congrats @Darkmatter, you just advanced to level 1!
Also, automatically swapping to a backend with polynomial-in-input-size performance regressions is behavior I'm not sure I want to see in a library. It should be a manual switch.
This could probably be solved by making the default mode be strict regex only and slower modes require explicit opt in. A little more verbose for "script-like" code though. As for compiling your pattern at comptime vs. runtime it should be as simple as
alias my_pattern = re.compile(my_regex_string)
vs var my_pattern = re.compile(my_regex_string)
but Mojo doesn't currently support calling functions that might raise at compile time. It seems like it should support that eventually though.What would the actual data layout of my_pattern look like in this case?
I could see a potential path to turing completeness in the type system via Church encoding and lambda calculus, but that amounts to a torture test of the compiler.
Yeah, that's a bit tricky. Might have to be a variant type or a class
aliases can't be recursive or co-recursive.
Not sure if that's by design or a current implementation limitation.
^ That errors out
Remember that it has to be done 100% at compile time in order to construct a state machine you can then traverse at run-time in an O(n) manner.
And the state machine will be required so that you can check for anything that is incompatible with linear-time evaluation.
That's why I said this needed compiler support or more powerful compile-time capabilities.
This isn't true is it? Can't we just check for presence of certain syntax?
([a-zA-Z]+)*
What part of that syntax can you ban?
Because that's a Regex which allowed a Denial of Service attack in 2014.
And maybe have an optional type wrapping the data for each backend? then calling match would just check for the fastest non-None?
I ended up giving up and doing this since nobody was answering my question about a JSON parser and I ran into the variant issue. This is VERY wasteful on space, and there's probably a magic type somewhere in the standard library that introduces indirection and lets you do recursive variants, but I can't find it.
Oh now I understand. I was confused about you mentioning recursion in Variant. The idea would be that my_pattern would have type
Variant[re_state_machine, cfg_parser]
and the issue here isn't syntax but the fact that the engine can do backtracking. You can change the semantics so this isn't quadraticI was thinking about the parse tree since I'm currently writing a JSON parser while we're talking.
Congrats @Darkmatter, you just advanced to level 2!
I don't see a good way to build the parse tree at compile time.
Most of the approaches I can think of would probably run into issues around co-recursive type definitions.
That makes more sense. Yeah, not sure how that would work. You could I guess do something kind of cursed where you represent the tree using a
List[Node]
and just have each node hold incides but that's pretty grossExcept it's not a List because I think you can't have a list with members at compile time, it's a Cons list, which makes it much worse.
I just meant an ordinary stdlib List, which works fine at compile time
If that does work then that's nice.
And it might be doable, if very annoying.
just checked and this does all work fine
could make things easier to use if the List[BinaryTreeNode] was ARCed and every node had a reference to it
Ok, a parse tree is doable. Is there a recursion depth limit at compile time?
like can you get a stack overflow? I would assume so
i’ve definitely gotten to a compile time memory limit i’m just not sure the value
i’m pretty sure you can get
llvm out of memory
, an MLIR error about memory stuff, and it can just stall foreverThat's probably a real OOM.
Which does give us some space to work with.
Ok, time to validate that with some large tensors.
Ok, the parser (compiler and LSP) crashed before LLVM did, so I guess I'm filing a bug report tomorrow.
I figured out where you might have been going wrong and this also works fine. Your recursive structs just need pointers or references so they have some known size.
should probably add as an extra heads up that this isn't a safe abstraction. It really shouldn't be copyable and also I forgot to add
__del__
. Ideally it would just use references but those are still a bit rough to use.I'll probably write my own HeapRef[T] type that does do safe management.
Is the pointer actually necessary here?
Yup. A struct has all its fields stored contiguously in memory and the total size is the sum of the size of those fields. This means you can't have a struct with a field that has the same type as the struct itself because you get infinite recursion trying to determine the size. You get around this by using pointers since a pointer has fixed size regardless of what's being pointed to.
This feels like a bug but it isnt
No, this is how every single language that doesn't do heap indirection for you works. Languages that let you do it are usually automatically putting things behind pointers.
I know 😄
I figured it out pretty fast
Well it is kinda a bug atm in the sense that if you try your syntax highlighter will crash and trying to compile gives you a seemingly unrelated error message
Is there an issue on GH?
not sure
You might want to raise one 🙂
@Darkmatter FYI I'm writing a parser in mojo and it works elegently for me.
We were discussing doing the parsing at compile time.
Never heard of it. How it that different from regular parsers?
The result of doing the parsing and as much other stuff as you want to do is embedded into the resulting binary, instead of being done at runtime. For instance, you could parse a string the specifies a regex at compile time and convert it into a state machine, so that the resulting binary only has a "match" function which takes the state machine. This will be much faster than compiling the regex at runtime. For something like a serverless web request handler, this is useful because it lets you convert all of your routes into a big lookup table with perfect hashing.
I could imagine significant benefits from this in places like pygments