#instaparse

1 messages ยท Page 1 of 1 (latest)

runic sierra
#

Hey folks - I'm struggling with operator precedence in an instaparse grammar, and wondering if the collected brains trust might be able to help out? (probably doesn't help that I'm super tired while trying to work on this, so it's entirely possible I'm doing something incredibly dumb...)

#

Basically I have a "simple" little test grammar that I thought would support operator precedence, but it doesn't appear to do what I'm after.

#

Welp it appears in creating a test case for this thread I might have figured out what was going on. Typical! ๐Ÿคก

#

Here it is, for anyone who's curious:
Here's the grammar:

(def grammar "expression = ('+'|'-'|ฮต) term (('+'|'-') term)*
              term = factor (('*'|'/') factor)*
              factor = var | number | (expression)
              var = #\"\\p{Alpha}\\p{Alnum}*\"
              number = #\"\\d+\"")

And here's some code that exercises it:

(require '[instaparse.core :as insta])
(require '[clojure.pprint :as pp])
(def parser (insta/parser grammar :start :factor))
(defn parse [s] (pp/pprint (parser s)))

; Some test cases

(parse "20")
;=> [:factor [:number "20"]] - โœ…

(parse "1+2")
;=> [:factor
;=>  [:expression
;=>   [:term [:factor [:number "1"]]]
;=>   "+"
;=>   [:term [:factor [:number "2"]]]]] - โœ…

(parse "1+2*3")
;=> [:factor
;=>  [:expression
;=>   [:term
;=>    [:factor
;=>     [:expression
;=>      [:term [:factor [:number "1"]]]
;=>      "+"
;=>      [:term [:factor [:number "2"]]]]]
;=>    "*"
;=>    [:factor [:number "3"]]]]] - โŒ

(parse "1*2+3")
;=> [:factor
;=>  [:expression
;=>   [:term
;=>    [:factor [:number "1"]]
;=>    "*"
;=>    [:factor
;=>     [:expression
;=>      [:term [:factor [:number "2"]]]
;=>      "+"
;=>      [:term [:factor [:number "3"]]]]]]]] - โŒ

tl;dr - I do seem to be getting precedence, just with the specific operators around the wrong way. ๐Ÿคก

gloomy crypt
#

your parens expression in factor doesn't match parens

#

it's just expression in a group with no alternates

#

match actual parens and it should work @runic sierra

runic sierra
#

Yeah I copied this from an EBNF grammar.

#

Looks like I forgot to quote those parens.

gloomy crypt
#

yeep

runic sierra
#

Thanks!!

gloomy crypt
#

No problem!

runic sierra
#

So revisiting now that I'm fresher, and quoting the parens didn't fix the incorrect precedence.

#

New grammar:

(def grammar "expression = ('+'|'-'|ฮต) term (('+'|'-') term)*
              term = factor (('*'|'/') factor)*
              factor = var | number | expression | '(' expression ')'
              var = #\"\\p{Alpha}\\p{Alnum}*\"
              number = #\"\\d+\"")
gloomy crypt
#

expression should not be a factor

#

only expression in parens should be

runic sierra
#

Nah - then simple cases like (parse "1+1") fail.

#
user=> (def grammar "expression = ('+'|'-'|ฮต) term (('+'|'-') term)*
              term = factor (('*'|'/') factor)*
              factor = var | number | '(' expression ')'
              var = #\"\\p{Alpha}\\p{Alnum}*\"
              number = #\"\\d+\"")
#'user/grammar
user=>
user=> (def parser (insta/parser grammar :start :factor))
(defn parse [s] (pp/pprint (parser s)))
#'user/parser
user=> #'user/parse
user=>
user=> (parse "1+1")
{:index 0,
 :reason
 [{:tag :string, :expecting "("}
  {:tag :regexp, :expecting #"\d+", :full true}
  {:tag :regexp, :expecting #"\p{Alpha}\p{Alnum}*", :full true}],
 :line 1,
 :column 1,
 :text "1+1"}
nil
#

Oh doh - I have the wrong start tag. disappointed_toledo

#

Now it's working.

#

Man I really shouldn't try to code+fatigue...

gloomy crypt
#

no idea why 1+1 would fail, it's at your top level expression grammar

#

yeah, fatigue is as bad as drinking for a lot of things

runic sierra
#

Yeah I was starting the parse at factor, instead of expression:

(def parser (insta/parser grammar :start :factor))  ; This is wrong - it should be :expression
gloomy crypt
#

ah, that'd do it

runic sierra
#

Yeah, and because it's not "in" the grammar I wasn't even looking there.

#

Thanks again!!

gloomy crypt
#

no problem!

runic sierra
#

Have you heard of "rubber ducky" debugging? It's from The Pragmatic Programmer book (a mega classic).

#

I swear the results they describe from that practice are 100% true.

#

Sometimes literally speaking your problem out loud to an inanimate object makes it obvious what the problem is, even if you've been silently wrestling with it for ages.

gloomy crypt
#

yeah, I use my friends to rubber duck all the time

runic sierra
#

I feel like this has to be telling us something about how the verbal brain machinery engages a different part, or more, of the brain.

#

I should mention this to my older kiddo, who just started a neuroscience degree. Maybe they can formally study it. ๐Ÿ˜‰

gloomy crypt
#

that'd be cool

runic sierra
#

Not that you're a rubber duck btw @gloomy crypt - not sure I would have found that unquoted parens issue as fast if you hadn't pointed it out right away. ๐Ÿ˜‰

gloomy crypt
#

lol I wouldn't mind anyway, but I'm happy to help

runic sierra
#

Does anyone know if instaparse has a "find" mode? Basically like re-find but using a grammar rather than a regex.

I can think of many reasons why that's probably not something it offers, but was just wondering, as it does have a "total parse" mode that is kind of halfway to accomplishing this (in that mode, extraneous input at the end can be ignored - if there's a way to do that at the start too, then I think that would be enough for a "find" mode).

gloomy crypt
#

I mean the way I'd do that is to just make a "prefix" node, but yeah it might be worth checking their API

runic sierra
#

oh doh yeah of course.

gloomy crypt
#

#"(?s).*?" or something

#

(?s) is the "dotall" mode where . can match newlines

#

as an aside I wonder if instaparse uses instaparse to parse the specification

runic sierra
#

Though then I guess I'd need to repeatedly slice the input after each (successful) parse, in order to find other instances of the parsable text.

#

(which isn't a big deal ofc)

gloomy crypt
#

oh yeah if you're trying to get re-seq using instaparse that's definitely something you should be doing by building a specification that matches the whole text

runic sierra
#

The problem is the input is unstructured text.

#

The use case here is "finding" all SPDX expressions in a freeform text (e.g. a README.txt file)

gloomy crypt
# gloomy crypt `#"(?s).*?"` or something

yeah, you'd just do something where between the expressions you want to parse you put "arbitrary text" nodes which do the same kind of matching as here and build this into your whole parser.

runic sierra
#

I also thought about writing a regex that "approximately matches" SPDX expressions, and then runs the parser over what it finds.

#

That may actually perform better?

#

Basically find "potential candidates" expressions using regex (which can't precisely match actual SPDX expressions AFAICT), then attempt to parse each snippet to determine whether it's actually an expression or not.

#

(given I already have a well tested SPDX expressions parser)

gloomy crypt
runic sierra
#

Something like this would be a start, but it's not going to narrow things down much...

#"(s?)[\p{Alnum}\:\(\)\+\.\-]+(\s+(AND|OR|WITH)\s+[\p{Alnum}\:\(\)\+\.\-]+)*+"

Though perhaps if the first character class was replaced with an alternation of all possible ids (which is how the ABNF is defined) it'd be specific (albeit the size of the regex would blow out - there are ~750 ids in SPDX).

#

Plus LicenseRefs and AdditionRefs

#

(though I already have well tested regexes for those too)

#

I already have so much programmatic generation of regexes going on, what's a little more? ๐Ÿ˜œ

gloomy crypt
#

lol fair enough

runic sierra
#

I'm sure if someone looked at the memory usage of this thing it'll be 75% instances of java.util.regex.Pattern by count. ๐Ÿ˜œ

#

I already generate 2 regexes per license (so ~1500 regexes, based on SPDX license list v3.25.0), and then I have a bunch of hand-written ones (that, in some cases, are partially programmatically generated) on top of that.

#

Sometimes I can't help wondering what a "regexes as data structures" might look like...

#

Because generating valid regex strings, with correct escaping etc., is hellish...