2017-07-06 instaparse | Clojure Slack Archive

instaparse

matan 2017-07-06T17:01:13.035226Z

@aengelberg many thanks for the confirmation! of course, it stems from the difference between what regular expression language is and what grammars are, not from how instaparse works doesn't it 🙂 Would you agree to this?

aengelberg 2017-07-06T17:05:46.181443Z

You mean regexes’ greediness? I would say that’s an artifact of regexes’ supported use case not matching what Instaparse needs it to do. If regexes had a way to lazily emit multiple parse results of the same string, we could maybe use that to make regex non-terminals behave more intuitively.

aengelberg 2017-07-06T17:08:47.278085Z

e.g. if I match #"^a+" on "aaa" I’d like to see "a", "aa", "aaa"

matan 2017-07-06T17:10:15.323480Z

Yes, well, at a higher level, regular expressions and grammars generate disparate automatons, whereas what you mention is one difference in the detail thereof (the most relevant one). This relates to two area I'm lately looking at ― fuzzy parsing and scannerless parsing

matan 2017-07-06T17:10:47.340146Z

As to regex, have you ever used something like this? https://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches

matan 2017-07-06T17:11:44.369659Z

Oh, not really, ignore that last one

aengelberg 2017-07-06T17:12:17.387099Z

That’s returning one match for each of a variety of starting indexes, which is not quite what I want

matan 2017-07-06T17:12:36.396673Z

Yes, again, ignore that

aengelberg 2017-07-06T17:13:08.413012Z

You could theoretically do (for [i (range (count s))] (re-match re (subs s 0 i)))

aengelberg 2017-07-06T17:13:23.421224Z

but that is super slow and might not actually work depending on the regex

aengelberg 2017-07-06T17:15:22.483776Z

the automaton structure of regexes is simply not designed to (efficiently) reason about the set of all possible matches for one string, using backtracking and laziness

aengelberg 2017-07-06T17:15:53.500052Z

actually maybe if you compile a NDFA

aengelberg 2017-07-06T17:16:02.504707Z

then run it through one character at a time

aengelberg 2017-07-06T17:16:56.532679Z

and mark every time you are in a success state

matan 2017-07-06T17:25:03.793104Z

The interaction of a grammar with the need to account for syntactic categories (a.k.a variables, and/or non-terminals) that translate to non-finite sets of productions is very interesting. Right now, we typically "escape" to regular expressions for that. Of course, this goes beyond the scope of my original question with its particular silly use case.

aengelberg 2017-07-06T17:27:11.863939Z

In this case, the set of productions is not infinite; for the current index i, all I really need to know is, the set of indexes j \in [0,N) such that s[i:j] is a complete parse according to the regex.

matan 2017-07-06T17:29:22.938677Z

Right, obviously, when looking at a specific input text (or string).

matan 2017-07-06T17:32:13.034978Z

But in the general sense, I also meant to say that we use regular expressions for where we want to stipulate a non-finite set, whereas a plain CFG doesn't provide (IIRC) that kind of support, which is why we use regex along with it. I have to refresh and brush up more on formal languages though, maybe what I just said is totally incorrect, or just sufficiently inaccurate to be incorrect.

Clojurians Log v2

instaparse