Fork me on GitHub
#clojure-spec
<
2018-02-25
>
Petrus Theron14:02:26

Having a hard time with Spec performance on runaway regular expressions for a simple lexer/parser to lex a string into tokens before I parse it. I'm trying to parse a string into tokens so I can put it through a second "compiler" pass:

(s/def ::plexpr*
  (s/+
    (s/alt
      :whitespace (s/+ #{\space \tab \formfeed \backspace})
      :line-sep (s/+ #{\newline \return \formfeed})                   ;(s/+ line-separator?)
      :atom (s/+ (set "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890")))))

(s/def ::plexpr
  (s/and
    (s/conformer seq)
    ::plexpr*))

(s/conform ::plexpr "drive home fast") ;; => (fast), [[:atom [\d \r \i \v \e]] [[:whitespace [\space]] [:atom [\h \o \m \e]] [:whitespace [\space]] [:atom [\f \a \s \t]]]]
(s/conform ::plexpr "drive home fast but now it gets slow") ;; => extremely slow
I'm not using any optional s/? regexes, so I don't understand why this would cause any kind of exponential search

seancorfield18:02:05

@petrus clojure.spec is not intended for string parsing -- you know that, right?

seancorfield18:02:45

The regexes are for "parsing" sequences, not strings. You probably want to look at Instaparse which is designed for parsing strings.

Petrus Theron07:02:43

Thanks - a string is a sequence, no? 😉 I can switch to instaparse for tokenising, but I'm worried about the length of data sequences I'll be able to handle with Spec

seancorfield17:02:07

A string can be coerced to a sequence 🙂 I saw @U0516PHE3 mention that your spec is an ambiguous grammar and that might be causing the exponential slow down. As for data sequences, I would expect most folks would use coll-of or every and the elements would be self-contained specs (so, no ambiguity).

Petrus Theron19:02:49

How can I make my spec unambiguous? If I split my string into a word tokens, can I still use spec and coll-of/every to parse the substrings? Or do I need an instaparse step before I send it to spec?

seancorfield20:02:24

Right now "aa" can parse as "atom (a)" "atom (a)" or "atom (aa)" so you could specify that the overall sequence is "atom" then a sequence of "whitespace" and "atom" perhaps? (assuming you trim leading/trailing whitespace before parsing?)

seancorfield20:02:02

If you require whitespace separating the atoms, then "aa" can only parse as "atom (aa)"

seancorfield20:02:10

You'll want to make sure whitespace and line endings don't overlap too (so there's no ambiguity there) and if you care about the difference have your overall spec clearly indicate line / sep / line structure and line as a spec indicate atom / whitespace / atom structure. Make sense?

gfredericks19:02:43

it's still a surprising slowdown, I would say

gfredericks19:02:54

(and I did say it, in fact)

seancorfield19:02:34

I'm kinda surprised even the 15-element string is handled "fast". That would be long for an argument list so I'd say it was outside the intended design space?

gfredericks19:02:19

the spec regexes are only meant for arglists?

seancorfield19:02:26

Processing "drive" takes 3ms, "drive home" takes 72ms, "drive home fast" takes almost 1 second. That doesn't feel "fast" to me.

gfredericks19:02:41

I coincidentally just wrote code that uses spec to parse the lines of a 17277-line file; it took 0.6s

seancorfield19:02:09

@gfredericks Not just arglists but I don't think folks are spec'ing long sequences in data...? And data is more likely to get spec'd with coll-of or every I think?

ghadi19:02:19

I wonder why that example appears to show exponential behavior. What are the bounds on parsing with derivatives?

gfredericks19:02:30

I'm using regexes because I actually want the parsing capabilities, and I think this would be too hard with instaparse

ghadi19:02:40

There was an updated paper Matt Might released that improved performance.

ghadi19:02:22

Even if that example above isn't a bug -- it's a valuable example for spec's test suite.

ghadi19:02:45

(... but it's probably a bug)

gfredericks19:02:04

yeah I can't imagine any good reason for something that simple to take a long time

seancorfield19:02:32

I got an OOM error trying to parse that long string 👀

ghadi19:02:58

Is the performance still bad if you feed it the seq directly and remove the s/and & the conformer?

seancorfield19:02:43

Adding each new word seems to double the time it takes. I think there's a clue in how it conforms -- for one word you get a sequence of one element which is a sequence for the atom: (time (s/conform ::plexpr "one")) => [[:atom ["o" "n" "e"]]] (in just a ms or two), but (time (s/conform ::plexpr "one two")) => [[:atom ["o" "n" "e"]] [[:whitespace [" "]] [:atom ["t" "w" "o"]]]] -- note: two elements, with the second containing the whitespace and the next word. Adding more words still produces a two-element sequence with the second one containing all the extra elements.

aengelberg19:02:18

You could change the set predicates to functions that print some debug information. That would give you an idea of what things the spec engine is unnecessarily repeating.

aengelberg19:02:13

I think that parser is ambiguous, because a string of "aa" could match [:atom ["a"]] [:atom ["a"]] or [:atom ["a" "a"]]. That could explain the exponential growth.

gfredericks19:02:07

ooh good point

devn23:02:06

How does one access a spec within the registry's req/req-un/opt/opt-un info?

devn23:02:27

I'd like to peek at a spec's list of req/req-un/opt/opt-un

Petrus Theron07:02:43

Thanks - a string is a sequence, no? 😉 I can switch to instaparse for tokenising, but I'm worried about the length of data sequences I'll be able to handle with Spec

Petrus Theron19:02:49

How can I make my spec unambiguous? If I split my string into a word tokens, can I still use spec and coll-of/every to parse the substrings? Or do I need an instaparse step before I send it to spec?