Modular•11mo ago

Regular Expression Engine for Mojo, any plans?

Do we have a regular expression engine implemented in Mojo?I haven't come across one yet, but it would be fantastic to have. Or is there a native solution planned for Mojo? If not, what could be our approach to developing one? I have no clue about the complexity involved in this task. Would we base it on some existing C engine or write it from scratch in Mojo. Thx

70 Replies

MM•11mo ago

The only one I know of is a very simple (60 loc) regex match implementaion in Mojo by sstadic (https://github.com/sstadick/ExtraMojo/blob/main/ExtraMojo/regex/simple_re.mojo). However I'm also very interested to know of any plans. To my understanding Rust has a fast regex library.

ModularBot•11mo ago

Congrats @MM, you just advanced to level 3!

bunny•11mo ago

Go's regex is known to be blazing fast, too. I recall reading that it beats Rust's (if barely) in some benchmarks. Last I looked into it (years ago; 5? 6? not sure), Go and Rust were the two fastest implementations in any major programming language.

Martin DudekOP•11mo ago

I use https://pypi.org/project/regex/ via Python integration right now which works fine but regex seems so essential that i really hope we will have something native at one point. for performance, and for elegance :mojo:

PyPI

regex

Alternative regular expression module, to replace re.

Heyitsmeguys•11mo ago

If we're talking about absolute performance, there's also hyperscan. It absolutely smashed the benchmarks. I don't know if we can write a portable regex library which is as fast as it, but one can dream. :mojo:

moosems_yeehaw•11mo ago

A quick google search found this

bunny•11mo ago

oh! I forgot about RE2. It was still unstable when I was doing that eval, years ago. Google does some amazing engineering.

moosems_yeehaw•11mo ago

When they want to do something right they seem to be good at it

moosems_yeehaw•11mo ago

Martin DudekOP•11mo ago

Thanks for finding these gems. But now the question remains, how can we do it right in Mojo?

bunny•11mo ago

we bully somebody with a bunch of spare time into implementing it for us 😂

moosems_yeehaw•11mo ago

I vote for not me lol

Ryulord•11mo ago

One thing to keep in mind when looking at those performance charts is that not all regex variants have the same expressive power. PCRE for example can describe any context-free language, not just regular languages. parsing/recognizing languages of different types on the chomsky hierarchy is well studied is CS and there's an inherent expressiveness/speed tradeoff to consider

benny•11mo ago

came here to also mention re2, i’ve looked into porting it but it’s a pretty large library, definitely an interesting issue

moosems_yeehaw•11mo ago

I took a look at the docs and it’s fairly complex to use though maybe I just need to take a better look at it

MM•11mo ago

While glimpsing through the re2 source code I found this link in a comment https://swtch.com/~rsc/regexp/. It has some bare bones examples in C that may provide a less intimidating starting point than the re2 library for someone toying with the idea of implementing regexes for Mojo.

benny•11mo ago

I also recommend this video https://youtu.be/DDe-S3uef2w?si=NP6TqGm0PYFdg7zI, about halfway in it gives a great description of re2 (also a great vid in general)

Kevin Fang

YouTube

Cloudflare Deploys Really Slow Code, Takes Down Entire Company

Cloudflare is back at it again with more regex and state machines. Previously on Cloudflare: https://youtu.be/GEbn3nHyKnA Sources: https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/ https://blog.cloudflare.com/introducing-quicksilver-configuration-distribution-at-internet-scale/ https://swtch.com/~rsc/regexp/regexp1....

MM•11mo ago

A simd accelerated multiple substring matching algorithm called "Teddy". This is used in Rust's ripgrep. https://github.com/rust-lang/regex/blob/3de8c44f5357d5b582a80b7282480e38e8b7d50d/src/simd_accel/teddy128.rs

GitHub

regex/src/simd_accel/teddy128.rs at 3de8c44f5357d5b582a80b7282480e3...

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs. - rust-lang/regex

bunny•11mo ago

I finally watched that vid. Wow. That was awesome! So, in "a former life," I worked somewhere that used a TON of regex. I recall that we had an engine that was "optimized for that time" (many years ago), but we also wrapped our regex-engine with a staging filter that would prevent certain regex formulas. For example, our system flat-out denied use of unlimited matching. I mean, you could get the rough equivalent of .* by doing .{0,99999999999999999} or something. And .+ -> .{1,999...}. But the programmer would be intentionally building that numeric limit. In addition, IIRC, we also limited the upper end of matching on unlimited character sets (.) and also wide-ranging sets (\w, \S, etc); I do not recall the numeric limits. The real use case for .* and .+ is really close to zero. Anytime you think "I should put an unlimited match here" you are really thinking "I should break the universe" and should probably approach the problem differently. I don't remember all the safety-checks we had built into the staging filter. It was just too many years ago. But we essentially sanitized for any regex and kicked out errors and suggestions on how to get the same intended results without breaking everything. --- Not editing anything above, but before hitting enter, I looked for something similar in open-source universe. This looks pretty cool: https://github.com/tjenkinson/redos-detector Anyway, sorry for the diversion from discussing the underlying regex engine. Just wanted to share a tip for anybody working with regex in their current life, who might be at an org that is not willing to adopt something like RE2.

Martin DudekOP•11mo ago

i start getting worried we need to wait for a future life before somebody comes along who is actually willing to implement a regex engine for Mojo 😉 it feels more and more like a pretty dry and quite complex task to dive into to be honest ...

bunny•11mo ago

It's super dry, super complex, and ....... we just gotta find somebody that enjoys a lot of pain. 😂

moosems_yeehaw•11mo ago

The channel is great, by the way. I’d recommend watching their other videos

Darkmatter•11mo ago

I think that having a bifurcated version of regex the way Rust does would be a good idea. One which implements most of the things most people will ever need but has guaranteed linear time evaluation, and another which is a full PCRE implementation with all the dangers that entails. Call the safe one regex and the full one PCRE since that is correct from a computer science theory perspective, so that unaware people reach for the safe and faster implementation first.

Ryulord•11mo ago

You could also possibly make a single library that just has multiple backends and uses whichever is most optimal for your pattern. Could even have an optional flag to just give you a compile time error if your pattern isn't compatible with the fast backend.

Darkmatter•11mo ago

That would probably require more compile time capabilities than Mojo currently has. If the compiler is willing to cooperate it would be great, but otherwise we are probably looking at needing Zig comptime/Rust proc macros to actually give compile time errors. Iirc the Rust regex crate doesn't give compile time errors in the name of allowing you to give it a string at runtime.

ModularBot•11mo ago

Congrats @Darkmatter, you just advanced to level 1!

Darkmatter•11mo ago

Also, automatically swapping to a backend with polynomial-in-input-size performance regressions is behavior I'm not sure I want to see in a library. It should be a manual switch.

Ryulord•11mo ago

This could probably be solved by making the default mode be strict regex only and slower modes require explicit opt in. A little more verbose for "script-like" code though. As for compiling your pattern at comptime vs. runtime it should be as simple as alias my_pattern = re.compile(my_regex_string) vs var my_pattern = re.compile(my_regex_string) but Mojo doesn't currently support calling functions that might raise at compile time. It seems like it should support that eventually though.

Darkmatter•11mo ago

What would the actual data layout of my_pattern look like in this case? I could see a potential path to turing completeness in the type system via Church encoding and lambda calculus, but that amounts to a torture test of the compiler.

Ryulord•11mo ago

Yeah, that's a bit tricky. Might have to be a variant type or a class

Darkmatter•11mo ago

aliases can't be recursive or co-recursive. Not sure if that's by design or a current implementation limitation.

alias my_pattern = Variant[Int16, my_pattern]

alias my_pattern = Variant[Int16, my_pattern]

^ That errors out Remember that it has to be done 100% at compile time in order to construct a state machine you can then traverse at run-time in an O(n) manner. And the state machine will be required so that you can check for anything that is incompatible with linear-time evaluation. That's why I said this needed compiler support or more powerful compile-time capabilities.

Ryulord•11mo ago

This isn't true is it? Can't we just check for presence of certain syntax?

Darkmatter•11mo ago

([a-zA-Z]+)* What part of that syntax can you ban? Because that's a Regex which allowed a Denial of Service attack in 2014.

Darkmatter•11mo ago

https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS

Regular expression Denial of Service - ReDoS on the main website for The OWASP Foundation. OWASP is a nonprofit foundation that works to improve the security of software.

Ryulord•11mo ago

And maybe have an optional type wrapping the data for each backend? then calling match would just check for the fastest non-None?

Darkmatter•11mo ago

I ended up giving up and doing this since nobody was answering my question about a JSON parser and I ran into the variant issue. This is VERY wasteful on space, and there's probably a magic type somewhere in the standard library that introduces indirection and lets you do recursive variants, but I can't find it.

@value
struct JsonType:
    var value: UInt8
    alias null = UInt8(0)
    alias bool = UInt8(1)
    alias string = UInt8(2)
    alias number = UInt8(3)
    alias list = UInt8(4)
    alias object = UInt8(5)
    alias integer = UInt8(6)

@value
struct JsonValue:
    var type: JsonType
    var bool: Bool
    var integer: Int64
    var number: Float64
    var string: String
    var list: List[Self]
    var object: Dict[String, Self]

@value
struct JsonType:
    var value: UInt8
    alias null = UInt8(0)
    alias bool = UInt8(1)
    alias string = UInt8(2)
    alias number = UInt8(3)
    alias list = UInt8(4)
    alias object = UInt8(5)
    alias integer = UInt8(6)

@value
struct JsonValue:
    var type: JsonType
    var bool: Bool
    var integer: Int64
    var number: Float64
    var string: String
    var list: List[Self]
    var object: Dict[String, Self]

Ryulord•11mo ago

Oh now I understand. I was confused about you mentioning recursion in Variant. The idea would be that my_pattern would have type Variant[re_state_machine, cfg_parser] and the issue here isn't syntax but the fact that the engine can do backtracking. You can change the semantics so this isn't quadratic

Darkmatter•11mo ago

I was thinking about the parse tree since I'm currently writing a JSON parser while we're talking.

ModularBot•11mo ago

Congrats @Darkmatter, you just advanced to level 2!

Darkmatter•11mo ago

I don't see a good way to build the parse tree at compile time. Most of the approaches I can think of would probably run into issues around co-recursive type definitions.

Ryulord•11mo ago

That makes more sense. Yeah, not sure how that would work. You could I guess do something kind of cursed where you represent the tree using a List[Node] and just have each node hold incides but that's pretty gross

Darkmatter•11mo ago

Except it's not a List because I think you can't have a list with members at compile time, it's a Cons list, which makes it much worse.

Cons[Node, Cons[Node, Cons[Node, None]]]

Cons[Node, Cons[Node, Cons[Node, None]]]

Ryulord•11mo ago

I just meant an ordinary stdlib List, which works fine at compile time

Darkmatter•11mo ago

If that does work then that's nice. And it might be doable, if very annoying.

Ryulord•11mo ago

just checked and this does all work fine

@value
struct BinaryTreeNode:
    var l: Optional[Int]
    var r: Optional[Int]

    fn __str__(self) -> String:
        return "This is a node"

fn build_binary_tree() -> List[BinaryTreeNode]:
    var tree = List[BinaryTreeNode]()
    tree.append(BinaryTreeNode(None, None))
    tree.append(BinaryTreeNode(None, None))
    tree.append(BinaryTreeNode(0, 1))
    return tree

alias my_tree = build_binary_tree()

fn main():
    var root_idx = 2
    var l_idx = my_tree[root_idx].l
    if l_idx:
        var l = my_tree[l_idx.or_else(0)]
        print(l)

@value
struct BinaryTreeNode:
    var l: Optional[Int]
    var r: Optional[Int]

    fn __str__(self) -> String:
        return "This is a node"

fn build_binary_tree() -> List[BinaryTreeNode]:
    var tree = List[BinaryTreeNode]()
    tree.append(BinaryTreeNode(None, None))
    tree.append(BinaryTreeNode(None, None))
    tree.append(BinaryTreeNode(0, 1))
    return tree

alias my_tree = build_binary_tree()

fn main():
    var root_idx = 2
    var l_idx = my_tree[root_idx].l
    if l_idx:
        var l = my_tree[l_idx.or_else(0)]
        print(l)

could make things easier to use if the List[BinaryTreeNode] was ARCed and every node had a reference to it

Darkmatter•11mo ago

Ok, a parse tree is doable. Is there a recursion depth limit at compile time?

Ryulord•11mo ago

like can you get a stack overflow? I would assume so

benny•11mo ago

i’ve definitely gotten to a compile time memory limit i’m just not sure the value i’m pretty sure you can get llvm out of memory, an MLIR error about memory stuff, and it can just stall forever

Darkmatter•11mo ago

That's probably a real OOM. Which does give us some space to work with. Ok, time to validate that with some large tensors. Ok, the parser (compiler and LSP) crashed before LLVM did, so I guess I'm filing a bug report tomorrow.

Ryulord•11mo ago

I figured out where you might have been going wrong and this also works fine. Your recursive structs just need pointers or references so they have some known size.

@value
struct BinaryTreeNode:
    var data: Int
    var _l: UnsafePointer[Optional[BinaryTreeNode]]
    var _r: UnsafePointer[Optional[BinaryTreeNode]]

    fn __init__(inout self, data: Int):
        self._l = self._l.alloc(1)
        self._r = self._r.alloc(1)
        self._l[] = None
        self._r[] = None
        self.data = data

    fn getl(self) -> Optional[BinaryTreeNode]:
        return self._l[]

    fn getr(self) -> Optional[BinaryTreeNode]:
        return self._r[]

    fn setl(self, new_value: Optional[BinaryTreeNode]):
        self._l[] = new_value

    fn setr(self, new_value: Optional[BinaryTreeNode]):
        self._r[] = new_value

    fn __str__(self) -> String:
        return 'BinaryTreeNode(' + str(self.data) + ')'

fn make_tree() -> BinaryTreeNode:
    var l = BinaryTreeNode(1)
    var r = BinaryTreeNode(2)
    var root = BinaryTreeNode(3)
    root.setl(l)
    root.setr(r)
    return root

alias tree = make_tree()

fn main():
    print(tree) # BinaryTreeNode(3)
    print(tree.getl().or_else(0) if tree.getl() else String('None')) # BinaryTreeNode(1)
    print(tree.getr().or_else(0) if tree.getr() else String('None')) # BinaryTreeNode(2)

@value
struct BinaryTreeNode:
    var data: Int
    var _l: UnsafePointer[Optional[BinaryTreeNode]]
    var _r: UnsafePointer[Optional[BinaryTreeNode]]

    fn __init__(inout self, data: Int):
        self._l = self._l.alloc(1)
        self._r = self._r.alloc(1)
        self._l[] = None
        self._r[] = None
        self.data = data

    fn getl(self) -> Optional[BinaryTreeNode]:
        return self._l[]

    fn getr(self) -> Optional[BinaryTreeNode]:
        return self._r[]

    fn setl(self, new_value: Optional[BinaryTreeNode]):
        self._l[] = new_value

    fn setr(self, new_value: Optional[BinaryTreeNode]):
        self._r[] = new_value

    fn __str__(self) -> String:
        return 'BinaryTreeNode(' + str(self.data) + ')'

fn make_tree() -> BinaryTreeNode:
    var l = BinaryTreeNode(1)
    var r = BinaryTreeNode(2)
    var root = BinaryTreeNode(3)
    root.setl(l)
    root.setr(r)
    return root

alias tree = make_tree()

fn main():
    print(tree) # BinaryTreeNode(3)
    print(tree.getl().or_else(0) if tree.getl() else String('None')) # BinaryTreeNode(1)
    print(tree.getr().or_else(0) if tree.getr() else String('None')) # BinaryTreeNode(2)

should probably add as an extra heads up that this isn't a safe abstraction. It really shouldn't be copyable and also I forgot to add __del__. Ideally it would just use references but those are still a bit rough to use.

Darkmatter•11mo ago

I'll probably write my own HeapRef[T] type that does do safe management.

moosems_yeehaw•11mo ago

Is the pointer actually necessary here?

Ryulord•11mo ago

Yup. A struct has all its fields stored contiguously in memory and the total size is the sum of the size of those fields. This means you can't have a struct with a field that has the same type as the struct itself because you get infinite recursion trying to determine the size. You get around this by using pointers since a pointer has fixed size regardless of what's being pointed to.

moosems_yeehaw•11mo ago

This feels like a bug but it isnt

Darkmatter•11mo ago

No, this is how every single language that doesn't do heap indirection for you works. Languages that let you do it are usually automatically putting things behind pointers.

moosems_yeehaw•11mo ago

I know 😄 I figured it out pretty fast

Ryulord•11mo ago

Well it is kinda a bug atm in the sense that if you try your syntax highlighter will crash and trying to compile gives you a seemingly unrelated error message

moosems_yeehaw•11mo ago

Is there an issue on GH?

Ryulord•11mo ago

not sure

moosems_yeehaw•11mo ago

You might want to raise one 🙂

Ethan•11mo ago

@Darkmatter FYI I'm writing a parser in mojo and it works elegently for me.

Darkmatter•11mo ago

We were discussing doing the parsing at compile time.

Ethan•11mo ago

Never heard of it. How it that different from regular parsers?

Darkmatter•11mo ago

The result of doing the parsing and as much other stuff as you want to do is embedded into the resulting binary, instead of being done at runtime. For instance, you could parse a string the specifies a regex at compile time and convert it into a state machine, so that the resulting binary only has a "match" function which takes the state machine. This will be much faster than compiling the regex at runtime. For something like a serverless web request handler, this is useful because it lets you convert all of your routes into a big lookup table with perfect hashing.

moosems_yeehaw•11mo ago

I could imagine significant benefits from this in places like pygments

Heyitsmeguys•3mo ago

Or we could make a regex engine similar to this one: https://dl.acm.org/doi/abs/10.1145/3704837 It guarantees linear time complexity in a lot more cases

ACM Digital Library

Heyitsmeguys•3mo ago

Also, making a regex dialect (similar to in this paper) seems like a good idea rather than making yet another ad-hoc compiler in a compiler stack that is supposed to unify compiler development (MLIR).

ACM Digital Library

ModularBot•3mo ago

Congrats @Heyitsmeguys, you just advanced to level 9!

Heyitsmeguys•3mo ago

It was already mentioned that this is in development: https://youtu.be/Lpr_GcX5uKE We'll need a lot of learning materials though, for people who aren't compiler experts to be able to use this

LLVM

YouTube

2024 LLVM Dev Mtg - Unlocking High Performance in Mojo through User...

2024 LLVM Developers' Meeting https://llvm.org/devmtg/2024-10/ ------ Unlocking High Performance in Mojo through User-Defined Dialects Speaker: Mathieu Fehr, Jeff Niu ------ Slides: https://llvm.org/devmtg/2024-10/slides/quicktalks/Fehr-Niu-UnlockingHighPerformanceInMojo.pdf ----- Traditionally, a clear separation exists between language librari...

Darkmatter•3mo ago

Linear complexity is probably a good idea. We might even be able to reuse most of the linear engine for the one that can have exponential behavior, but the exponential engine needs to be separated out. Regex dialect sounds like a decent idea, but we also want runtime evaluation without needing to spin up a compiler. If Modular exposed the JIT in a "give me a function pointer" way, that would work. I think the learning materials are compiler textbooks. This is "just" another way to interface with a compiler. If we write it as a library, then it's much easier to use and people don't need to learn that.

Gaming

Programming

Regular Expression Engine for Mojo, any plans?

Did you find this page helpful?