This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2018-02-25
Channels
- # aleph (1)
- # beginners (72)
- # boot (3)
- # cider (28)
- # cljs-dev (193)
- # cljsrn (11)
- # clojure (73)
- # clojure-brasil (3)
- # clojure-gamedev (2)
- # clojure-russia (6)
- # clojure-spec (30)
- # clojure-uk (26)
- # clojured (1)
- # clojurescript (32)
- # code-reviews (9)
- # core-async (4)
- # datascript (5)
- # datomic (9)
- # dirac (38)
- # fulcro (23)
- # garden (11)
- # leiningen (1)
- # luminus (11)
- # lumo (6)
- # off-topic (17)
- # quil (2)
- # re-frame (2)
- # reagent (3)
- # ring (3)
- # shadow-cljs (12)
- # spacemacs (4)
- # sql (2)
- # unrepl (85)
- # vim (3)
Having a hard time with Spec performance on runaway regular expressions for a simple lexer/parser to lex a string into tokens before I parse it. I'm trying to parse a string into tokens so I can put it through a second "compiler" pass:
(s/def ::plexpr*
(s/+
(s/alt
:whitespace (s/+ #{\space \tab \formfeed \backspace})
:line-sep (s/+ #{\newline \return \formfeed}) ;(s/+ line-separator?)
:atom (s/+ (set "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890")))))
(s/def ::plexpr
(s/and
(s/conformer seq)
::plexpr*))
(s/conform ::plexpr "drive home fast") ;; => (fast), [[:atom [\d \r \i \v \e]] [[:whitespace [\space]] [:atom [\h \o \m \e]] [:whitespace [\space]] [:atom [\f \a \s \t]]]]
(s/conform ::plexpr "drive home fast but now it gets slow") ;; => extremely slow
I'm not using any optional s/? regexes, so I don't understand why this would cause any kind of exponential search@petrus clojure.spec
is not intended for string parsing -- you know that, right?
The regexes are for "parsing" sequences, not strings. You probably want to look at Instaparse which is designed for parsing strings.
Thanks - a string is a sequence, no? 😉 I can switch to instaparse for tokenising, but I'm worried about the length of data sequences I'll be able to handle with Spec
A string can be coerced to a sequence 🙂 I saw @U0516PHE3 mention that your spec is an ambiguous grammar and that might be causing the exponential slow down. As for data sequences, I would expect most folks would use coll-of
or every
and the elements would be self-contained specs (so, no ambiguity).
How can I make my spec unambiguous? If I split my string into a word tokens, can I still use spec and coll-of/every to parse the substrings? Or do I need an instaparse step before I send it to spec?
Right now "aa"
can parse as "atom (a)" "atom (a)" or "atom (aa)" so you could specify that the overall sequence is "atom" then a sequence of "whitespace" and "atom" perhaps? (assuming you trim leading/trailing whitespace before parsing?)
If you require whitespace separating the atoms, then "aa"
can only parse as "atom (aa)"
You'll want to make sure whitespace and line endings don't overlap too (so there's no ambiguity there) and if you care about the difference have your overall spec clearly indicate line / sep / line structure and line as a spec indicate atom / whitespace / atom structure. Make sense?
it's still a surprising slowdown, I would say
(and I did say it, in fact)
I'm kinda surprised even the 15-element string is handled "fast". That would be long for an argument list so I'd say it was outside the intended design space?
the spec regexes are only meant for arglists?
Processing "drive"
takes 3ms, "drive home"
takes 72ms, "drive home fast"
takes almost 1 second. That doesn't feel "fast" to me.
I coincidentally just wrote code that uses spec to parse the lines of a 17277-line file; it took 0.6s
@gfredericks Not just arglists but I don't think folks are spec'ing long sequences in data...? And data is more likely to get spec'd with coll-of
or every
I think?
I wonder why that example appears to show exponential behavior. What are the bounds on parsing with derivatives?
I'm using regexes because I actually want the parsing capabilities, and I think this would be too hard with instaparse
Even if that example above isn't a bug -- it's a valuable example for spec's test suite.
yeah I can't imagine any good reason for something that simple to take a long time
I got an OOM error trying to parse that long string 👀
Is the performance still bad if you feed it the seq directly and remove the s/and & the conformer?
Adding each new word seems to double the time it takes. I think there's a clue in how it conforms -- for one word you get a sequence of one element which is a sequence for the atom: (time (s/conform ::plexpr "one"))
=> [[:atom ["o" "n" "e"]]]
(in just a ms or two), but (time (s/conform ::plexpr "one two"))
=> [[:atom ["o" "n" "e"]] [[:whitespace [" "]] [:atom ["t" "w" "o"]]]]
-- note: two elements, with the second containing the whitespace and the next word. Adding more words still produces a two-element sequence with the second one containing all the extra elements.
You could change the set predicates to functions that print some debug information. That would give you an idea of what things the spec engine is unnecessarily repeating.
I think that parser is ambiguous, because a string of "aa" could match [:atom ["a"]] [:atom ["a"]]
or [:atom ["a" "a"]]
. That could explain the exponential growth.
ooh good point
Thanks - a string is a sequence, no? 😉 I can switch to instaparse for tokenising, but I'm worried about the length of data sequences I'll be able to handle with Spec
How can I make my spec unambiguous? If I split my string into a word tokens, can I still use spec and coll-of/every to parse the substrings? Or do I need an instaparse step before I send it to spec?