instaparse | Discljord | Page 1

runic sierra Oct 24, 2023, 12:19 AM

#

Hey folks - I'm struggling with operator precedence in an instaparse grammar, and wondering if the collected brains trust might be able to help out? (probably doesn't help that I'm super tired while trying to work on this, so it's entirely possible I'm doing something incredibly dumb...)

#

Basically I have a "simple" little test grammar that I thought would support operator precedence, but it doesn't appear to do what I'm after.

#

Welp it appears in creating a test case for this thread I might have figured out what was going on. Typical! 🤡

#

Here it is, for anyone who's curious:
Here's the grammar:

(def grammar "expression = ('+'|'-'|ε) term (('+'|'-') term)*
              term = factor (('*'|'/') factor)*
              factor = var | number | (expression)
              var = #\"\\p{Alpha}\\p{Alnum}*\"
              number = #\"\\d+\"")

And here's some code that exercises it:

(require '[instaparse.core :as insta])
(require '[clojure.pprint :as pp])
(def parser (insta/parser grammar :start :factor))
(defn parse [s] (pp/pprint (parser s)))

; Some test cases

(parse "20")
;=> [:factor [:number "20"]] - ✅

(parse "1+2")
;=> [:factor
;=>  [:expression
;=>   [:term [:factor [:number "1"]]]
;=>   "+"
;=>   [:term [:factor [:number "2"]]]]] - ✅

(parse "1+2*3")
;=> [:factor
;=>  [:expression
;=>   [:term
;=>    [:factor
;=>     [:expression
;=>      [:term [:factor [:number "1"]]]
;=>      "+"
;=>      [:term [:factor [:number "2"]]]]]
;=>    "*"
;=>    [:factor [:number "3"]]]]] - ❌

(parse "1*2+3")
;=> [:factor
;=>  [:expression
;=>   [:term
;=>    [:factor [:number "1"]]
;=>    "*"
;=>    [:factor
;=>     [:expression
;=>      [:term [:factor [:number "2"]]]
;=>      "+"
;=>      [:term [:factor [:number "3"]]]]]]]] - ❌

tl;dr - I do seem to be getting precedence, just with the specific operators around the wrong way. 🤡

gloomy crypt Oct 24, 2023, 1:35 AM

#

your parens expression in factor doesn't match parens

#

it's just expression in a group with no alternates

#

match actual parens and it should work @runic sierra

runic sierra Oct 24, 2023, 1:40 AM

#

Yeah I copied this from an EBNF grammar.

#

Looks like I forgot to quote those parens.

gloomy crypt Oct 24, 2023, 1:44 AM

#

yeep

runic sierra Oct 24, 2023, 1:51 AM

#

Thanks!!

gloomy crypt Oct 24, 2023, 1:51 AM

#

No problem!

runic sierra Oct 24, 2023, 6:57 PM

#

So revisiting now that I'm fresher, and quoting the parens didn't fix the incorrect precedence.

#

New grammar:

(def grammar "expression = ('+'|'-'|ε) term (('+'|'-') term)*
              term = factor (('*'|'/') factor)*
              factor = var | number | expression | '(' expression ')'
              var = #\"\\p{Alpha}\\p{Alnum}*\"
              number = #\"\\d+\"")

gloomy crypt Oct 24, 2023, 6:58 PM

#

expression should not be a factor

#

only expression in parens should be

runic sierra Oct 24, 2023, 6:59 PM

#

Nah - then simple cases like (parse "1+1") fail.

#

user=> (def grammar "expression = ('+'|'-'|ε) term (('+'|'-') term)*
              term = factor (('*'|'/') factor)*
              factor = var | number | '(' expression ')'
              var = #\"\\p{Alpha}\\p{Alnum}*\"
              number = #\"\\d+\"")
#'user/grammar
user=>
user=> (def parser (insta/parser grammar :start :factor))
(defn parse [s] (pp/pprint (parser s)))
#'user/parser
user=> #'user/parse
user=>
user=> (parse "1+1")
{:index 0,
 :reason
 [{:tag :string, :expecting "("}
  {:tag :regexp, :expecting #"\d+", :full true}
  {:tag :regexp, :expecting #"\p{Alpha}\p{Alnum}*", :full true}],
 :line 1,
 :column 1,
 :text "1+1"}
nil

#

Oh doh - I have the wrong start tag. disappointed_toledo

#

Now it's working.

#

Man I really shouldn't try to code+fatigue...

gloomy crypt Oct 24, 2023, 7:01 PM

#

no idea why 1+1 would fail, it's at your top level expression grammar

#

yeah, fatigue is as bad as drinking for a lot of things

runic sierra Oct 24, 2023, 7:01 PM

#

Yeah I was starting the parse at factor, instead of expression:

(def parser (insta/parser grammar :start :factor))  ; This is wrong - it should be :expression

gloomy crypt Oct 24, 2023, 7:02 PM

#

ah, that'd do it

runic sierra Oct 24, 2023, 7:02 PM

#

Yeah, and because it's not "in" the grammar I wasn't even looking there.

#

Thanks again!!

gloomy crypt Oct 24, 2023, 7:02 PM

#

no problem!

runic sierra Oct 24, 2023, 7:03 PM

#

Have you heard of "rubber ducky" debugging? It's from The Pragmatic Programmer book (a mega classic).

#

I swear the results they describe from that practice are 100% true.

#

Sometimes literally speaking your problem out loud to an inanimate object makes it obvious what the problem is, even if you've been silently wrestling with it for ages.

gloomy crypt Oct 24, 2023, 7:04 PM

#

yeah, I use my friends to rubber duck all the time

runic sierra Oct 24, 2023, 7:04 PM

#

I feel like this has to be telling us something about how the verbal brain machinery engages a different part, or more, of the brain.

#

I should mention this to my older kiddo, who just started a neuroscience degree. Maybe they can formally study it. 😉

gloomy crypt Oct 24, 2023, 7:05 PM

#

that'd be cool

runic sierra Oct 24, 2023, 7:05 PM

#

Not that you're a rubber duck btw @gloomy crypt - not sure I would have found that unquoted parens issue as fast if you hadn't pointed it out right away. 😉

gloomy crypt Oct 24, 2023, 7:06 PM

#

lol I wouldn't mind anyway, but I'm happy to help

runic sierra Nov 16, 2024, 8:16 PM

#

Does anyone know if instaparse has a "find" mode? Basically like re-find but using a grammar rather than a regex.

I can think of many reasons why that's probably not something it offers, but was just wondering, as it does have a "total parse" mode that is kind of halfway to accomplishing this (in that mode, extraneous input at the end can be ignored - if there's a way to do that at the start too, then I think that would be enough for a "find" mode).

gloomy crypt Nov 16, 2024, 8:17 PM

#

I mean the way I'd do that is to just make a "prefix" node, but yeah it might be worth checking their API

runic sierra Nov 16, 2024, 8:17 PM

#

oh doh yeah of course.

gloomy crypt Nov 16, 2024, 8:17 PM

#

#"(?s).*?" or something

#

(?s) is the "dotall" mode where . can match newlines

#

as an aside I wonder if instaparse uses instaparse to parse the specification

runic sierra Nov 16, 2024, 8:19 PM

#

Though then I guess I'd need to repeatedly slice the input after each (successful) parse, in order to find other instances of the parsable text.

#

(which isn't a big deal ofc)

gloomy crypt Nov 16, 2024, 8:20 PM

#

oh yeah if you're trying to get re-seq using instaparse that's definitely something you should be doing by building a specification that matches the whole text

runic sierra Nov 16, 2024, 8:20 PM

#

The problem is the input is unstructured text.

#

The use case here is "finding" all SPDX expressions in a freeform text (e.g. a README.txt file)

gloomy crypt Nov 16, 2024, 8:21 PM

#

gloomy crypt `#"(?s).*?"` or something

yeah, you'd just do something where between the expressions you want to parse you put "arbitrary text" nodes which do the same kind of matching as here and build this into your whole parser.

runic sierra Nov 16, 2024, 8:21 PM

#

I also thought about writing a regex that "approximately matches" SPDX expressions, and then runs the parser over what it finds.

#

That may actually perform better?

#

Basically find "potential candidates" expressions using regex (which can't precisely match actual SPDX expressions AFAICT), then attempt to parse each snippet to determine whether it's actually an expression or not.

#

(given I already have a well tested SPDX expressions parser)

gloomy crypt Nov 16, 2024, 8:23 PM

#

runic sierra That may actually perform better?

could be

runic sierra Nov 16, 2024, 8:30 PM

#

Something like this would be a start, but it's not going to narrow things down much...

#"(s?)[\p{Alnum}\:\(\)\+\.\-]+(\s+(AND|OR|WITH)\s+[\p{Alnum}\:\(\)\+\.\-]+)*+"

Though perhaps if the first character class was replaced with an alternation of all possible ids (which is how the ABNF is defined) it'd be specific (albeit the size of the regex would blow out - there are ~750 ids in SPDX).

#

Plus LicenseRefs and AdditionRefs

#

(though I already have well tested regexes for those too)

#

I already have so much programmatic generation of regexes going on, what's a little more? 😜

gloomy crypt Nov 16, 2024, 8:33 PM

#

lol fair enough

runic sierra Nov 16, 2024, 8:34 PM

#

I'm sure if someone looked at the memory usage of this thing it'll be 75% instances of java.util.regex.Pattern by count. 😜

#

I already generate 2 regexes per license (so ~1500 regexes, based on SPDX license list v3.25.0), and then I have a bunch of hand-written ones (that, in some cases, are partially programmatically generated) on top of that.

#

Sometimes I can't help wondering what a "regexes as data structures" might look like...

#

Because generating valid regex strings, with correct escaping etc., is hellish...

#instaparse