Fork me on GitHub
#instaparse
<
2017-02-23
>
anthony.naddeo21:02:02

I've eliminated a ton of ambiguity but I'm still seeing that its performance scales poorly with the size of the input, even when there are only 1 or two possible parses

anthony.naddeo21:02:23

but that same input when fed chunk by chunk adds up to a much more reasonable parse time

anthony.naddeo21:02:22

I was reading about the :optimize :memory option, but that seems to make it worse

anthony.naddeo21:02:41

I even tried it with Java instead of Node to see if it was just a JS thing and the results were still similar

anthony.naddeo21:02:27

My next step was about to be to write a function that breaks the input up into chunks, but to do that I'll need to know where a chunk starts/stops or impose some sort of convention around the input

hiredman22:02:57

have you seen https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md? I feel like I have seen splitting input in to chunks recommended for larger inputs to instaparse

anthony.naddeo22:02:30

Yeah that's where I got it from

anthony.naddeo22:02:47

It just seems like something I shouldn't have to do

anthony.naddeo22:02:03

If only because I need my parser to determine when a readable chunk starts/stops

anthony.naddeo22:02:10

it isn't a line by line thing

hiredman22:02:32

I might try and eliminate regexes in the grammar definition, if I recall correctly those can be (or maybe they were?) inefficient

anthony.naddeo22:02:59

Are there any examples you know of large grammars that scale linearly with the size of the input?

anthony.naddeo22:02:10

Before I go down a rabbit hole I want to make sure I know what to expect

hiredman22:02:59

no, I don't

hiredman22:02:44

I don't know of any large grammars, and I have only used smallish grammars on small inputs

anthony.naddeo22:02:48

I'm just afraid that the performance actually won't get better as hard as I may try and I want to cut my losses if I can.

anthony.naddeo22:02:57

Or just use this for things that only require parsing snippets

anthony.naddeo22:02:07

Thanks for the advice though @hiredman

hiredman22:02:53

it would be neat if instaparse could emit warnings and suggest fixes if your grammar is not LL(1)

anthony.naddeo22:02:06

At this point, it seems like what I really want is a mode where it parses in chunks itself. My grammar is really just a repetition of 4 possible blocks. It seems like it could just parse the first block independent of the second. That is to say, as soon as it matches just consider it a block and move on.

anthony.naddeo22:02:25

Given the right level of ambiguity I would be ok with massaging the results

aengelberg23:02:21

How large are your problematic inputs?

anthony.naddeo23:02:30

Well, it scales poorly. There isn't a size in specific. It parses the Elm programming langauge. When testing, I start with a single function, then I just keep duplicating that function and observe the performance

anthony.naddeo23:02:47

each additional function adds more than its fair share of time

anthony.naddeo23:02:55

I'll paste a snippet of something that takes too long

anthony.naddeo23:02:55

(def input "nextSource : Source a -> Source a
nextSource (Source a next) =
    Source (next a) next


nextSource : Source a -> Source a
nextSource (Source a next) =
    Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
    Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
    Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
    Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
    Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
  Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
  Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
  Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
  Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
  Source (next a) next

")

anthony.naddeo23:02:15

That one I actually just had to kill

anthony.naddeo23:02:28

but if you cut it in half it only takes about a second or so to parse

anthony.naddeo23:02:44

And its a pretty reasonable size for an Elm source file

aengelberg23:02:14

I'll try to take a look soon

anthony.naddeo23:02:52

Cool, thanks a lot

anthony.naddeo23:02:19

Also, definitely pretty new to clojure still. If I can make the code more approachable/replable feel free to point that out

anthony.naddeo23:02:24

I kind of just hacked everything together

aengelberg23:02:53

@anthony.naddeo I fixed it before:

start = (<ws>* block <ws>*)+
after:
start = (<ws> block <ws>)+

anthony.naddeo23:02:14

That? Let me try now

aengelberg23:02:34

the large input now takes 100ms on my machine

anthony.naddeo23:02:46

100ms? I just tried it and it is WAY better, but I wish I had those numbers

anthony.naddeo23:02:52

are you just in a node repl?

anthony.naddeo23:02:55

I'm down to 900ms now

anthony.naddeo23:02:02

compared to never stopping

anthony.naddeo23:02:14

I wouldn't have thought to do that

aengelberg23:02:26

sorry I'm in JVM, not JS

aengelberg23:02:46

900ms makes sense for node, instaparse in cljs is known to be ~ 10x slower

anthony.naddeo23:02:50

oh interesting. Did you use my parser.clj file?

anthony.naddeo23:02:10

yeah this is way better

aengelberg23:02:07

I just copied the grammar string from parser.cljs into my clojure file since the EBNF syntax is the same on both platforms

anthony.naddeo23:02:43

One odd thing though

anthony.naddeo23:02:58

I would expect that the * impacts performance because it adds ambiguity right?

anthony.naddeo23:02:29

If so, shouldn't there be more than a single parse for it? (time (count (insta/parses parser/parser input :unhide :all))) returned 1 unless I did something wrong

hiredman23:02:11

ambiguity can mean multiple results, but it can also mean a single result that required checking lots of different cases to arrive at

anthony.naddeo23:02:42

What metrics could people use to determine ambiguity if not parse counts?

hiredman23:02:22

#6 in that peformance doc suggests looking at rules individually to see if you have rules that in isolation can result in multiple parses

aengelberg23:02:24

Instaparse de-duplicates results, so some internal ambiguity can lurk around but still impact the perf

anthony.naddeo23:02:14

well, I'm glad that was an easy fix, thanks guys

anthony.naddeo23:02:24

I don't think I can go back to other parsers. I would have been heart broken

aengelberg23:02:11

btw @anthony.naddeo, I made the following optimization to your Name and name which halved the time for me:

Name = #'(?!\\b(if|then|else|in|let|case|of)\\b)[A-Z][a-zA-Z0-9]*'
    name = #'(?!\\b(if|then|else|in|let|case|of|type)\\b)[a-z][a-zA-Z0-9]*'

aengelberg23:02:39

Actually the first \b doesn't really do anything, because the first character of this particular token is the first char in the string from the regex's perspective

aengelberg23:02:43

But my optimization was to shift more responsibility into the regex; that is always ideal