#Compiling Regex to NFA

15 messages · Page 1 of 1 (latest)

gentle zenith
#

As a little side learning project, I am attempting to implement my own little Regex engine. I've decided to go with compiling Regex expressions into nondeterministic finite automata since that seems to be what a lot of literature suggests, but I'm a little stuck on the actual implementation.

Specifically, I'm trying to figure out how to handle priority between transitions. A state can have multiple transitions, for instance a* will have one ε transition to the end state and one a transition back to itself. * is greedy, so even though the ε transition is always an option, it should "prefer" the a transition back to itself, and only when out of as or when backtracking should the ε transition be traversed. a*? has the same transitions but conversely "prefers" the ε transition since *? is lazy and should only consume input when backtracking.

How would you handle this kind of priority system in practice? I'm currently representing the automata as a graph where each state is a node and each transition an edge.

modern idol
#

NFA doesn't directly encode any "priority" system. You take whichever path gives you a more favorable outcome.

gentle zenith
modern idol
gentle zenith
#

Like, (a+) can match just a single a in the string aaa, but if you care about the capture group being correct then you can't just use the first best option. I think at least.

abstract cipher
#

Can you just order them appropriately? How are you storing them?

gentle zenith
#

Yeah I could probably just order the transitions tbh

vital juniper
# gentle zenith As a little side learning project, I am attempting to implement my own little Re...

Confusingly, "regular expressions" refer to two different things.

  1. Regular expressions as found in theoretical computer science. The * thing is not greedy. It matches however many characters is needed to make the rest of the pattern matches. This kind of regular expression can be straightforwardly converted into an NFA, without needing any kind of priority system.

  2. Regex, as descended from Perl. This is found in many programming languages. The * thing is greedy, and will match as many repetitions as possible. I don't know if this can be compiled to an NFA or not.

gentle zenith
#

I did find this article about it, and it doesn't seem that hard.

vital juniper
#

Searching for "pcre" might find the kind of regex you want

gentle zenith
vital juniper