M
Modular•9mo ago
Martin Dudek

Regular Expression Engine for Mojo, any plans?

Do we have a regular expression engine implemented in Mojo?I haven't come across one yet, but it would be fantastic to have. Or is there a native solution planned for Mojo? If not, what could be our approach to developing one? I have no clue about the complexity involved in this task. Would we base it on some existing C engine or write it from scratch in Mojo. Thx
70 Replies
MM
MM•9mo ago
The only one I know of is a very simple (60 loc) regex match implementaion in Mojo by sstadic (https://github.com/sstadick/ExtraMojo/blob/main/ExtraMojo/regex/simple_re.mojo). However I'm also very interested to know of any plans. To my understanding Rust has a fast regex library.
ModularBot
ModularBot•9mo ago
Congrats @MM, you just advanced to level 3!
bunny
bunny•9mo ago
Go's regex is known to be blazing fast, too. I recall reading that it beats Rust's (if barely) in some benchmarks. Last I looked into it (years ago; 5? 6? not sure), Go and Rust were the two fastest implementations in any major programming language.
Martin Dudek
Martin DudekOP•9mo ago
I use https://pypi.org/project/regex/ via Python integration right now which works fine but regex seems so essential that i really hope we will have something native at one point. for performance, and for elegance :mojo:
PyPI
regex
Alternative regular expression module, to replace re.
Heyitsmeguys
Heyitsmeguys•9mo ago
If we're talking about absolute performance, there's also hyperscan. It absolutely smashed the benchmarks. I don't know if we can write a portable regex library which is as fast as it, but one can dream. :mojo:
Moosems / Three chickens
A quick google search found this
No description
bunny
bunny•9mo ago
oh! I forgot about RE2. It was still unstable when I was doing that eval, years ago. Google does some amazing engineering.
Moosems / Three chickens
When they want to do something right they seem to be good at it
Martin Dudek
Martin DudekOP•9mo ago
Thanks for finding these gems. But now the question remains, how can we do it right in Mojo?
bunny
bunny•9mo ago
we bully somebody with a bunch of spare time into implementing it for us 😂
Moosems / Three chickens
I vote for not me lol
Ryulord
Ryulord•9mo ago
One thing to keep in mind when looking at those performance charts is that not all regex variants have the same expressive power. PCRE for example can describe any context-free language, not just regular languages. parsing/recognizing languages of different types on the chomsky hierarchy is well studied is CS and there's an inherent expressiveness/speed tradeoff to consider
benny
benny•9mo ago
came here to also mention re2, i’ve looked into porting it but it’s a pretty large library, definitely an interesting issue
Moosems / Three chickens
I took a look at the docs and it’s fairly complex to use though maybe I just need to take a better look at it
MM
MM•9mo ago
While glimpsing through the re2 source code I found this link in a comment https://swtch.com/~rsc/regexp/. It has some bare bones examples in C that may provide a less intimidating starting point than the re2 library for someone toying with the idea of implementing regexes for Mojo.
benny
benny•9mo ago
MM
MM•9mo ago
A simd accelerated multiple substring matching algorithm called "Teddy". This is used in Rust's ripgrep. https://github.com/rust-lang/regex/blob/3de8c44f5357d5b582a80b7282480e38e8b7d50d/src/simd_accel/teddy128.rs
GitHub
regex/src/simd_accel/teddy128.rs at 3de8c44f5357d5b582a80b7282480e3...
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs. - rust-lang/regex
bunny
bunny•9mo ago
I finally watched that vid. Wow. That was awesome! So, in "a former life," I worked somewhere that used a TON of regex. I recall that we had an engine that was "optimized for that time" (many years ago), but we also wrapped our regex-engine with a staging filter that would prevent certain regex formulas. For example, our system flat-out denied use of unlimited matching. I mean, you could get the rough equivalent of .* by doing .{0,99999999999999999} or something. And .+ -> .{1,999...}. But the programmer would be intentionally building that numeric limit. In addition, IIRC, we also limited the upper end of matching on unlimited character sets (.) and also wide-ranging sets (\w, \S, etc); I do not recall the numeric limits. The real use case for .* and .+ is really close to zero. Anytime you think "I should put an unlimited match here" you are really thinking "I should break the universe" and should probably approach the problem differently. I don't remember all the safety-checks we had built into the staging filter. It was just too many years ago. But we essentially sanitized for any regex and kicked out errors and suggestions on how to get the same intended results without breaking everything. --- Not editing anything above, but before hitting enter, I looked for something similar in open-source universe. This looks pretty cool: https://github.com/tjenkinson/redos-detector Anyway, sorry for the diversion from discussing the underlying regex engine. Just wanted to share a tip for anybody working with regex in their current life, who might be at an org that is not willing to adopt something like RE2.
Martin Dudek
Martin DudekOP•9mo ago
i start getting worried we need to wait for a future life before somebody comes along who is actually willing to implement a regex engine for Mojo 😉 it feels more and more like a pretty dry and quite complex task to dive into to be honest ...
bunny
bunny•9mo ago
It's super dry, super complex, and ....... we just gotta find somebody that enjoys a lot of pain. 😂
Moosems / Three chickens
The channel is great, by the way. I’d recommend watching their other videos
Darkmatter
Darkmatter•9mo ago
I think that having a bifurcated version of regex the way Rust does would be a good idea. One which implements most of the things most people will ever need but has guaranteed linear time evaluation, and another which is a full PCRE implementation with all the dangers that entails. Call the safe one regex and the full one PCRE since that is correct from a computer science theory perspective, so that unaware people reach for the safe and faster implementation first.
Ryulord
Ryulord•9mo ago
You could also possibly make a single library that just has multiple backends and uses whichever is most optimal for your pattern. Could even have an optional flag to just give you a compile time error if your pattern isn't compatible with the fast backend.
Darkmatter
Darkmatter•9mo ago
That would probably require more compile time capabilities than Mojo currently has. If the compiler is willing to cooperate it would be great, but otherwise we are probably looking at needing Zig comptime/Rust proc macros to actually give compile time errors. Iirc the Rust regex crate doesn't give compile time errors in the name of allowing you to give it a string at runtime.
ModularBot
ModularBot•9mo ago
Congrats @Darkmatter, you just advanced to level 1!
Darkmatter
Darkmatter•9mo ago
Also, automatically swapping to a backend with polynomial-in-input-size performance regressions is behavior I'm not sure I want to see in a library. It should be a manual switch.
Ryulord
Ryulord•9mo ago
This could probably be solved by making the default mode be strict regex only and slower modes require explicit opt in. A little more verbose for "script-like" code though. As for compiling your pattern at comptime vs. runtime it should be as simple as alias my_pattern = re.compile(my_regex_string) vs var my_pattern = re.compile(my_regex_string) but Mojo doesn't currently support calling functions that might raise at compile time. It seems like it should support that eventually though.
Darkmatter
Darkmatter•9mo ago
What would the actual data layout of my_pattern look like in this case? I could see a potential path to turing completeness in the type system via Church encoding and lambda calculus, but that amounts to a torture test of the compiler.
Ryulord
Ryulord•9mo ago
Yeah, that's a bit tricky. Might have to be a variant type or a class
Darkmatter
Darkmatter•9mo ago
aliases can't be recursive or co-recursive. Not sure if that's by design or a current implementation limitation.
alias my_pattern = Variant[Int16, my_pattern]
alias my_pattern = Variant[Int16, my_pattern]
^ That errors out Remember that it has to be done 100% at compile time in order to construct a state machine you can then traverse at run-time in an O(n) manner. And the state machine will be required so that you can check for anything that is incompatible with linear-time evaluation. That's why I said this needed compiler support or more powerful compile-time capabilities.
Ryulord
Ryulord•9mo ago
This isn't true is it? Can't we just check for presence of certain syntax?
Darkmatter
Darkmatter•9mo ago
([a-zA-Z]+)* What part of that syntax can you ban? Because that's a Regex which allowed a Denial of Service attack in 2014.
Darkmatter
Darkmatter•9mo ago
Regular expression Denial of Service - ReDoS on the main website for The OWASP Foundation. OWASP is a nonprofit foundation that works to improve the security of software.
Ryulord
Ryulord•9mo ago
And maybe have an optional type wrapping the data for each backend? then calling match would just check for the fastest non-None?
Darkmatter
Darkmatter•9mo ago
I ended up giving up and doing this since nobody was answering my question about a JSON parser and I ran into the variant issue. This is VERY wasteful on space, and there's probably a magic type somewhere in the standard library that introduces indirection and lets you do recursive variants, but I can't find it.
@value
struct JsonType:
var value: UInt8
alias null = UInt8(0)
alias bool = UInt8(1)
alias string = UInt8(2)
alias number = UInt8(3)
alias list = UInt8(4)
alias object = UInt8(5)
alias integer = UInt8(6)

@value
struct JsonValue:
var type: JsonType
var bool: Bool
var integer: Int64
var number: Float64
var string: String
var list: List[Self]
var object: Dict[String, Self]
@value
struct JsonType:
var value: UInt8
alias null = UInt8(0)
alias bool = UInt8(1)
alias string = UInt8(2)
alias number = UInt8(3)
alias list = UInt8(4)
alias object = UInt8(5)
alias integer = UInt8(6)

@value
struct JsonValue:
var type: JsonType
var bool: Bool
var integer: Int64
var number: Float64
var string: String
var list: List[Self]
var object: Dict[String, Self]
Ryulord
Ryulord•9mo ago
Oh now I understand. I was confused about you mentioning recursion in Variant. The idea would be that my_pattern would have type Variant[re_state_machine, cfg_parser] and the issue here isn't syntax but the fact that the engine can do backtracking. You can change the semantics so this isn't quadratic
Darkmatter
Darkmatter•9mo ago
I was thinking about the parse tree since I'm currently writing a JSON parser while we're talking.
ModularBot
ModularBot•9mo ago
Congrats @Darkmatter, you just advanced to level 2!
Darkmatter
Darkmatter•9mo ago
I don't see a good way to build the parse tree at compile time. Most of the approaches I can think of would probably run into issues around co-recursive type definitions.
Ryulord
Ryulord•9mo ago
That makes more sense. Yeah, not sure how that would work. You could I guess do something kind of cursed where you represent the tree using a List[Node] and just have each node hold incides but that's pretty gross
Darkmatter
Darkmatter•9mo ago
Except it's not a List because I think you can't have a list with members at compile time, it's a Cons list, which makes it much worse.
Cons[Node, Cons[Node, Cons[Node, None]]]
Cons[Node, Cons[Node, Cons[Node, None]]]
Ryulord
Ryulord•9mo ago
I just meant an ordinary stdlib List, which works fine at compile time
Darkmatter
Darkmatter•9mo ago
If that does work then that's nice. And it might be doable, if very annoying.
Ryulord
Ryulord•9mo ago
just checked and this does all work fine
@value
struct BinaryTreeNode:
var l: Optional[Int]
var r: Optional[Int]

fn __str__(self) -> String:
return "This is a node"

fn build_binary_tree() -> List[BinaryTreeNode]:
var tree = List[BinaryTreeNode]()
tree.append(BinaryTreeNode(None, None))
tree.append(BinaryTreeNode(None, None))
tree.append(BinaryTreeNode(0, 1))
return tree

alias my_tree = build_binary_tree()

fn main():
var root_idx = 2
var l_idx = my_tree[root_idx].l
if l_idx:
var l = my_tree[l_idx.or_else(0)]
print(l)
@value
struct BinaryTreeNode:
var l: Optional[Int]
var r: Optional[Int]

fn __str__(self) -> String:
return "This is a node"

fn build_binary_tree() -> List[BinaryTreeNode]:
var tree = List[BinaryTreeNode]()
tree.append(BinaryTreeNode(None, None))
tree.append(BinaryTreeNode(None, None))
tree.append(BinaryTreeNode(0, 1))
return tree

alias my_tree = build_binary_tree()

fn main():
var root_idx = 2
var l_idx = my_tree[root_idx].l
if l_idx:
var l = my_tree[l_idx.or_else(0)]
print(l)
could make things easier to use if the List[BinaryTreeNode] was ARCed and every node had a reference to it
Darkmatter
Darkmatter•9mo ago
Ok, a parse tree is doable. Is there a recursion depth limit at compile time?
Ryulord
Ryulord•9mo ago
like can you get a stack overflow? I would assume so
benny
benny•9mo ago
i’ve definitely gotten to a compile time memory limit i’m just not sure the value i’m pretty sure you can get llvm out of memory, an MLIR error about memory stuff, and it can just stall forever
Darkmatter
Darkmatter•9mo ago
That's probably a real OOM. Which does give us some space to work with. Ok, time to validate that with some large tensors. Ok, the parser (compiler and LSP) crashed before LLVM did, so I guess I'm filing a bug report tomorrow.
Ryulord
Ryulord•9mo ago
I figured out where you might have been going wrong and this also works fine. Your recursive structs just need pointers or references so they have some known size.
@value
struct BinaryTreeNode:
var data: Int
var _l: UnsafePointer[Optional[BinaryTreeNode]]
var _r: UnsafePointer[Optional[BinaryTreeNode]]

fn __init__(inout self, data: Int):
self._l = self._l.alloc(1)
self._r = self._r.alloc(1)
self._l[] = None
self._r[] = None
self.data = data

fn getl(self) -> Optional[BinaryTreeNode]:
return self._l[]

fn getr(self) -> Optional[BinaryTreeNode]:
return self._r[]

fn setl(self, new_value: Optional[BinaryTreeNode]):
self._l[] = new_value

fn setr(self, new_value: Optional[BinaryTreeNode]):
self._r[] = new_value

fn __str__(self) -> String:
return 'BinaryTreeNode(' + str(self.data) + ')'

fn make_tree() -> BinaryTreeNode:
var l = BinaryTreeNode(1)
var r = BinaryTreeNode(2)
var root = BinaryTreeNode(3)
root.setl(l)
root.setr(r)
return root

alias tree = make_tree()

fn main():
print(tree) # BinaryTreeNode(3)
print(tree.getl().or_else(0) if tree.getl() else String('None')) # BinaryTreeNode(1)
print(tree.getr().or_else(0) if tree.getr() else String('None')) # BinaryTreeNode(2)
@value
struct BinaryTreeNode:
var data: Int
var _l: UnsafePointer[Optional[BinaryTreeNode]]
var _r: UnsafePointer[Optional[BinaryTreeNode]]

fn __init__(inout self, data: Int):
self._l = self._l.alloc(1)
self._r = self._r.alloc(1)
self._l[] = None
self._r[] = None
self.data = data

fn getl(self) -> Optional[BinaryTreeNode]:
return self._l[]

fn getr(self) -> Optional[BinaryTreeNode]:
return self._r[]

fn setl(self, new_value: Optional[BinaryTreeNode]):
self._l[] = new_value

fn setr(self, new_value: Optional[BinaryTreeNode]):
self._r[] = new_value

fn __str__(self) -> String:
return 'BinaryTreeNode(' + str(self.data) + ')'

fn make_tree() -> BinaryTreeNode:
var l = BinaryTreeNode(1)
var r = BinaryTreeNode(2)
var root = BinaryTreeNode(3)
root.setl(l)
root.setr(r)
return root

alias tree = make_tree()

fn main():
print(tree) # BinaryTreeNode(3)
print(tree.getl().or_else(0) if tree.getl() else String('None')) # BinaryTreeNode(1)
print(tree.getr().or_else(0) if tree.getr() else String('None')) # BinaryTreeNode(2)
should probably add as an extra heads up that this isn't a safe abstraction. It really shouldn't be copyable and also I forgot to add __del__. Ideally it would just use references but those are still a bit rough to use.
Darkmatter
Darkmatter•9mo ago
I'll probably write my own HeapRef[T] type that does do safe management.
Moosems / Three chickens
Is the pointer actually necessary here?
Ryulord
Ryulord•9mo ago
Yup. A struct has all its fields stored contiguously in memory and the total size is the sum of the size of those fields. This means you can't have a struct with a field that has the same type as the struct itself because you get infinite recursion trying to determine the size. You get around this by using pointers since a pointer has fixed size regardless of what's being pointed to.
Moosems / Three chickens
This feels like a bug but it isnt
Darkmatter
Darkmatter•9mo ago
No, this is how every single language that doesn't do heap indirection for you works. Languages that let you do it are usually automatically putting things behind pointers.
Moosems / Three chickens
I know 😄 I figured it out pretty fast
Ryulord
Ryulord•9mo ago
Well it is kinda a bug atm in the sense that if you try your syntax highlighter will crash and trying to compile gives you a seemingly unrelated error message
Moosems / Three chickens
Is there an issue on GH?
Ryulord
Ryulord•9mo ago
not sure
Moosems / Three chickens
You might want to raise one 🙂
Ethan
Ethan•8mo ago
@Darkmatter FYI I'm writing a parser in mojo and it works elegently for me.
No description
Darkmatter
Darkmatter•8mo ago
We were discussing doing the parsing at compile time.
Ethan
Ethan•8mo ago
Never heard of it. How it that different from regular parsers?
Darkmatter
Darkmatter•8mo ago
The result of doing the parsing and as much other stuff as you want to do is embedded into the resulting binary, instead of being done at runtime. For instance, you could parse a string the specifies a regex at compile time and convert it into a state machine, so that the resulting binary only has a "match" function which takes the state machine. This will be much faster than compiling the regex at runtime. For something like a serverless web request handler, this is useful because it lets you convert all of your routes into a big lookup table with perfect hashing.
Moosems / Three chickens
I could imagine significant benefits from this in places like pygments
Heyitsmeguys
Heyitsmeguys•5d ago
Or we could make a regex engine similar to this one: https://dl.acm.org/doi/abs/10.1145/3704837 It guarantees linear time complexity in a lot more cases
ACM Digital Library
Heyitsmeguys
Heyitsmeguys•5d ago
Also, making a regex dialect (similar to in this paper) seems like a good idea rather than making yet another ad-hoc compiler in a compiler stack that is supposed to unify compiler development (MLIR).
ACM Digital Library
ModularBot
ModularBot•5d ago
Congrats @Heyitsmeguys, you just advanced to level 9!
Heyitsmeguys
Heyitsmeguys•5d ago
It was already mentioned that this is in development: https://youtu.be/Lpr_GcX5uKE We'll need a lot of learning materials though, for people who aren't compiler experts to be able to use this
LLVM
YouTube
2024 LLVM Dev Mtg - Unlocking High Performance in Mojo through User...
2024 LLVM Developers' Meeting https://llvm.org/devmtg/2024-10/ ------ Unlocking High Performance in Mojo through User-Defined Dialects Speaker: Mathieu Fehr, Jeff Niu ------ Slides: https://llvm.org/devmtg/2024-10/slides/quicktalks/Fehr-Niu-UnlockingHighPerformanceInMojo.pdf ----- Traditionally, a clear separation exists between language librari...
Darkmatter
Darkmatter•5d ago
Linear complexity is probably a good idea. We might even be able to reuse most of the linear engine for the one that can have exponential behavior, but the exponential engine needs to be separated out. Regex dialect sounds like a decent idea, but we also want runtime evaluation without needing to spin up a compiler. If Modular exposed the JIT in a "give me a function pointer" way, that would work. I think the learning materials are compiler textbooks. This is "just" another way to interface with a compiler. If we write it as a library, then it's much easier to use and people don't need to learn that.

Did you find this page helpful?