#Simplified Regex Parser Code Review

20 messages · Page 1 of 1 (latest)

tired marsh
#

I just finished making a simplified regex parser from scratch, and would like to see if there are things that could be done better.
This version of regex is targeted at what has given me the most issues, so it's limited to simple groups ( ), alternations |, star repeats *, and literal chars.

The main new idea I had is to use splicing on the groups, so a sequence of tokens [..., GroupStart, ... GroupEnd, ...] gets coalesced into [..., Group(...), ...], and [..., x, Repeat] to [..., Repeat(x)].
Additionally I plan to stuff most of the complexity like char classes and escapes into the lexing stage, since they can be treated as literals for the actual parsing stage.

I really don't like how the tokens and output share the Regex enum instead of being a separate Token enum, but I don't see a clean way to do the splicing without it.
Theoretically Regex: Clone shouldn't be needed, since if I could split the Vec<Regex> into 3 owned Vecs, only the middle one needs to be moved, and then the two remaining get joined back with the resulting group, but I don't see a good way to do that.

I also still don't see how the parsing is possible using any of the existing libraries despite trying almost all of them for way too long, so if anyone wants to make an example it would be very welcome.

Playground link of the code

dull dirge
#

At a risk of giving you something you're trying to avoid for learning purposes, you're aware of regex_syntax?

Gimme some time, I should be able to take a look at your code soonish

#

I don't see a clean way to do the splicing without it.
The conventional answer would be that you don't. Usually, the parser's output is not made by modifying the lexer's output in-place, it's its own entirely unrelated thing. First you produce a list of tokens, and then you process that into a regex AST

#

As for the parsing... if we ignore the fact you for some reason decided to do it with Vec::splice and that gave you a number of problems (it's the reason you need Clone and it's the reason you can't make two separate enums), your parser appears to work, so I'm not sure what you mean when you say you don't see how it's possible?

tired marsh
tired marsh
dull dirge
#

The initial recap is that it is possible to parse using an API like this

pub enum Regex {
    Empty,
    Literal(char),
    Alt(Vec<Regex>),
    Concat(Vec<Regex>),
    Group(Box<Regex>),
    Repeat(Box<Regex>),
    Error(Error),
}
pub enum Token {
    GroupStart,
    GroupEnd,
    Alt,
    Repeat,
    Literal(char),
}
pub enum Error {
    UnclosedGroup(Box<Regex>),
    UnopenedGroup,
    NoTokenToRepeat,
    UnrepeatableToken,
}
pub fn lex(input: &str) -> Vec<Token> {
    let mut chars = input.chars();
    let mut tokens = Vec::<Regex>::new();

    while let Some(chr) = chars.next() {
        tokens.push(match chr {
            '(' => Token::GroupStart,
            ')' => Token::GroupEnd,
            '|' => Token::Alt,
            '*' => Token::Repeat,
            c => Token::Literal(c),
        })
    }

    tokens
}
pub fn parse(tokens: &[Token]) -> Regex {...}
#

As for the body of parse... You might be interested in asking #lang-dev if they have advice on parsing expressions. Parsing stuff that looks like (a|b)c* isn't very different, conceptually, from (a+b)*c.

When the entire grammar is solely expressions, I like to write a pratt parser, which there is a good tutorial for here.

If you want to do this with existing libs, chumsky has a pratt combinator, but otherwise you'll probably benefit from reading up on how to parse expressions in a "recursive descent parser" and in particular on what "left recursion" is and how to deal with it

#

(I don't curently have time to write an example, apologies)

tired marsh
#

@dull dirge I've read through that article, and some other resources multiple times, and I still don't get it, since I can't seem to modify it to do what I need. I still don't get how to do anything besides parse into S-Exprs, which are terrible for representing regex and would need to be fixed after, and I also need two pre-processing steps to get the implicit tokens correct. And it still doesn't work, because the alts only pull from the surrounding tokens, not all surrounding tokens of lower power. Could you please help me understand where I'm going wrong with this? playground link version with trace

I'm also slowly approaching understanding recursive descent. My most recent try uses a pre-parsing step to add the depth, sort of like pratt parsing, then uses that to determine the group and alt ranges for something like recursive descent. The issue is that this doesn't work nicely with repeat, even with the ability to parse the tokens in reverse.

One other fun thing I thought of is that the complexity of repeat can't be eliminated from the equation simply by parsing in reverse, since the source of group type information and unclosed group error both rely on parsing left to right

dull dirge
# tired marsh <@319492293352620044> I've read through that article, and some other resources m...

I had some time today, so here's a recursive descent parser: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=1d1e389993386b0684fabdc905ace47b. Note this isn't an exact match for your parser: some of the errors are slightly different. For example, if you feed it |* it will tell you you're trying to repeat an empty sequence (<empty>|<empty>*, in a way) rather than telling you you're trying to repeat the alternation token. It could be made an exact match with some more work, but at some point error messages become subjective (as best as I can tell, both my and your interpretation of |* are reasonable, for example)

Apparently I misled you by accident: There isn't really a need for precedence shenanigans (or rather, the only place where you could have precedence issues, code like a|b*, ends up with the correct precedences without the need to explicitly worry about that). Apologies, I expected parsing regex to require more complex tooling.

While I was writing it, it also occurred to me that your original splice-based implementation is closer in principle to a recursive ascent parser (which is a kind of LR parser). I know those extremely vaguely, but the fundamental principle is something like "look at the local sequence of tokens, and if it matches some AST variant, replace it with that AST variant". I personally find splice parsing to be really hard to understand, but since you clearly find it simpler, you might be interested in this idea.

tired marsh
# dull dirge I had some time today, so here's a recursive descent parser: <https://play.rust-...

Thanks for taking the time to write this

Looking at it, I think I understand. The state is stored in two different places, the call stack and sequence vec. The call stack stores the nesting, and the sequence vec stores adjacent items. This lets the previous sequence be used for the calls that need it. I was able to fix some things caused by my non-comprehensive examples, so I think that does show I've improved. I would have never thought of Peeked though, that is really smart way to check the last item irrespective of the depth it got to. As usual my fix is of dubious quality, since I always seem to need to have Regex: Clone. link

Regarding recursive ascent, I have looked into those since I also saw that connection, but those seem even more impossible to grasp, or even find quality literature on it that isn't a bajillion dollars. The key difference with all of those and mine seem to be that they put all the stuff onto a single stack, and either shift or reduce, thus the name shirt-reduce parsers, instead of actually parsing the bottom elements first as the name would imply. It's probably still possible to do, but the state table would most likely be hellishly complicated to account for the alt/concat/repeat issues. There's also a variant I've seen that uses the call stack as the stack, but that looks even more impossible to me, at least in a "pure" manner that doesn't use an extra storage, which in the end would probably result in code like what you just wrote.

#

It looks to me like the key methodology difference between my understanding an recursive decent is how to handle precedence issues. Using the separate stacks allows for operations that need to take the past items to take only them, and since regex has clear scope demarcations, it ends up fine. My idea is to "parse what you need before you need it", which I can't seem to find anywhere else. I have also since realized it can be made cleaner by making the stack a Vec<Either<Token, Regex>>, which still works the same, but I still can't find a closely comparable idea.

dull dirge
# tired marsh Thanks for taking the time to write this Looking at it, I think I understand. T...

Peeked was originally just a Peekable iterator while I wrote this, but I realized that I wanted the "peeked element" to be a different type from Token since most variants of Token would have been flatly impossible and matching them anyway seemed sad, so I did that trick :). I considered having a &mut Peeked argument instead, but it doesn't make much of a difference.

Something else I considered was a &mut Token argument for "last token seen", which would have allowed things like parse_repeat to check if this looked like an attempt to parse a nonrepeatable token, but in the end that would only have changed the kind of error, so I didn't. It's a neat trick if you prefer the error message to say "attempted to repeat |, which isn't repeatable" rather than "attempted to repeat an empty sequence", though.

As for Regex: Clone, this is only necessary because you used [left, right].concat(): concat more or less works on &[&[T]] (ok, &[impl Borrow<[T]>], but that's fairly similar in the end), and that means it needs to clone the Ts to get owned ones.
You could get rid of it by going through iterators instead: [left, right].into_iter().flatten().collect() (or equivalently iter::once(left).chain([right]).flatten().collect() or left.into_iter().chain(right).collect())

#

Finally, I see you've decided to represent a|b|c as an Alt([a, b, c]) rather than Alt([Alt([a, b]), c]). I suppose that explains why your original code had a vector, I'd assumed an Alt stood for one alternative. My first instinct is to check if this is worth it: does whatever you're doing with this AST benefit from the increase in parser complexity?
If it does, your design for concatenating consecutive Alts isn't bad. It's not ideal, but I'm reasonably sure the ideal design involves a lot more complexity (I think you'd want to duplicate parse_sequence to decide whether it should consume |s or just peek them, and add a Peeked::Alt variant which parse_alt looks for?), so you might not find the hypothetical extra performance (I'm not sure this would be faster, but I suspect it would be) worthwhile

#

About recursive ascent, I can't really help you much with that. I know it exists, but that's about all I know. #lang-dev would likely know more

#

huh, nightly has an iter::chain(left, right).collect(), that would be the shortest way to write the concatenation

tired marsh
dull dirge
#

Excellent, objection retracted