Fork me on GitHub
#clojure-spec
<
2021-11-06
>
qqq01:11:29

Since spec allows arbitrary predicates, which can use arbitrary time / memory, how do most reason about the time/space usage of spec functions / algorithms ? Do they (a) not worry about it, since in most cases, it's fine or (b) have sufficient knowledge of spec internals to know at what time what predicate is called on what, etc ...

seancorfield02:11:21

@qqq The same way you reason about any other functions in Clojure.

qqq03:11:55

@seancorfield: Are you saying this because spec internals are so simple one should be able to reason about it's runtime like one reasons about map / filter / vec-ops ... ; or are you trying to end the conversation before the discussion on spec internals can start ?

seancorfield03:11:46

You said "arbitrary predicates, which can use arbitrary time / memory" -- can you reason about those predicates?

seancorfield04:11:08

Because if you start from the premise that the complexity of the predicates can be arbitrary, then you have to see that the complexity of Specs that are built on predicates can also be arbitrary, right?

seancorfield04:11:12

How would you reason about any library that can take in arbitrary functions (predicates) and calls them on a data structure you pass in?

seancorfield04:11:02

Two people have already said there are no guarantees on the run time of Spec -- and I agree -- and there are also no guarantees on the space usage (for the same reason: it's built on arbitrary predicates). But Spec isn't "magic", it's open source so anyone can inspect the source code and reason about it just as much as they can with any other open source library.

seancorfield04:11:01

I think, in general, people don't worry much about the internals of any library like that -- unless they see unpredictable or unacceptable time/space usage and then they profile their code and identify the root cause.

seancorfield04:11:29

But with any of these complex libraries (rules engines and databases/query-based systems also come to mind), you just have to be careful about the complexity of the "queries" you run -- and there are nearly always pathological cases that catch us out. That's just kind of the way things are in software.

seancorfield04:11:31

At work, we've used Spec very heavily ever since it was released. We use it for a wide variety of things (I've blogged about our usage of Spec at http://corfield.org if you're interested) -- and we do use it in production code for data validation. Most of the data structures we validate are pretty flat and, in general, relatively small -- so the Specs around them are relatively straightforward. And, in that context, we have seen no performance problems.

seancorfield04:11:59

But there are (obviously) going to be data structures that are sufficiently complex that a full Spec is also going to be very complex and checking might be "slow". There might be values that have sufficiently complex semantics that a fully-fleshed out Spec might be "slow" -- but that's true whether you use Spec or whether you write the validation logic by hand (with the caveat that a generic system like Spec is always likely to have more overhead than hand-crafted code).

seancorfield04:11:11

Does that help answer your questions @qqq?

qqq04:11:23

@seancorfield That is a reasonable interpretation of my question, but not the way I was intending. Consider for a moment, a strict, non-lazy map. Let's call it smap. Let f be some arbitrary function. What is the running time of (smap f lst) Now, you could make the argument that because f can take arbitrary memory/time, we can't say anything about (smap f lst). On the other hand, we can say, well it's running time is (reduce + (map (fn [x] (running_time 'f x)) lst)). One can argue about whether this is useful, but knowing the internals of how smap works definitely allows us to create some expression (in terms of f, lst) that expresses the running time of (smap f lst). Now, if we look at https://clojure.org/guides/spec -- unless I am reading it incorrectly, it teaches the reader a mental model of "what value spec will return." What it does not seem to do however, is cover the internals of how spec computes the answer. I.e. it talks about "what", not "how".

qqq04:11:47

So now, what does this have to do with my earlier question of whether predicates can take arbitrary memory / time. Because spec allows this flexibility, it seems that the answer to "what is the running time of this spec query" depends on the predicates we pass it -- and this likely means that we need to figure out (1) how often will our predicate be called, (2) what data will it be called on -- and this requires a quite detailed knowledge of spec internals.

seancorfield05:11:07

But this is true of any library that you pass both functions and data structures into and have it call those functions on those data structures. There's nothing novel about Spec in that regard.

qqq04:11:48

Let's consider this example of Regex vs Spec. In normal regex, we can't specify arbitrary predicates. So it goes something like this: regex -> nfa -> dfa -> run DFA on input string. So we can say: well, the running time for a regex of size S on input of length N will be O(N & 2^S). [The DFA can be exponential in size of the NFA]. This is easy to assert without knowing too much about the internals of how the regex is matched. On the other hand, in the case of spec, because we can pass arbitrary predicates; it seems to reason about spec's runtime / space, we have to figure out how often / when / what input the user specified predicates are called upon -- this is something which AFAIK can only be understood via understanding spec internals, i.e. HOW spec recognizes, rather than WHAT spec recognizes. Does this sound reasonable? @seancorfield

seancorfield05:11:49

I already answered this in the thread that you either ignored or chose not to respond to.

qqq05:11:35

One practical question here is: do you ever do "untrusted user input => parse as edn => run spec on it"; do you know whether there exists some malicious input of size N which can cause spec to take 2^N (or even N^6) time ?

seancorfield05:11:58

Just parsing "untrusted user input" as EDN (or JSON) can be subject to malicious input attacks. That's why there are often recommendations to not do this -- but, practically speaking, people do this all the time. Adding Spec to the mix adds more processing and increases that risk.

Eddie05:11:19

Some of this is touched on in a few of the docstrings. https://clojuredocs.org/clojure.spec.alpha/every and https://clojuredocs.org/clojure.spec.alpha/coll-of come to mind. Both are intended to declare a collection where all the elements match a spec/predicate, but s/every will sample the collection and s/coll-of will exhaustively check each element.

Alex Miller (Clojure team)05:11:39

I think most specs are pretty closely tied to the size of collections (s/coll-of will check all N elements of the coll, s/map-of will check all N entries of a map etc). if you're using regex specs, I would look to the regex derivatives papers for complexity bounds (https://matt.might.net/papers/might2011derivatives.pdf and related there but it's exponential in N). From a practical pov though, regexes are mostly used to describe syntax for things like macros, and in general most macro calls have small N (you don't invoke a macro with 100s of args or whatever).

Alex Miller (Clojure team)05:11:36

"do you know whether there exists some malicious input of size N which can cause spec to take 2^N (or even N^6) time" - no, you don't know this, and it is possible.

seancorfield05:11:58

Just parsing "untrusted user input" as EDN (or JSON) can be subject to malicious input attacks. That's why there are often recommendations to not do this -- but, practically speaking, people do this all the time. Adding Spec to the mix adds more processing and increases that risk.

qqq06:11:51

Does spec do memoization / caching / dynamic programming? If not, can we generate 2^N time on O(N) size input as follows: (s/def :foo (s/or :a something_that_uses_foo :c something_that_uses_foo)) then running it on s-expr of (1 (1 (1 (1 (1 (1 ... )))))) This is similar to the cfg rule A => B | C , where B / C both use A. If we use plain backtracking, this can easily go exponential. In the case of strings, I believe packrat / peg gets around this issue by memoization -- using O(N * S) memory where N = length of input, S = # of states in grammar. How does Spec handle this issue? Clearly it has to backtrack at s/or . The question is whether it memoizes results or if it just re-evaluates everything.

qqq06:11:57

@seancorfield: I have a custom "untrusted string" -> s-exp parser that takes O(N) space, O(N) time and O(1) stack space. So the "untrusted user input => edn" step is not a problem for me. However, the "edn => spec" step currently scares me. (Among other things, the example just raised above).

qqq07:11:53

@ikitommi: Is this how the attack works? You pull you some dependency Foo that has defined some slow spec ::foo-slow-spec. You are unaware of these slow specs (and don't care because you're not explicitly using them). An attacker constructs a HTTP request that you run spec on. They add a field "::foo-slow-spec", which triggers the spec from Foo dependency. And the defense to this attack is to ensure that the incoming data only has keys explicitly mentioned in the spec?

ikitommi08:11:40

yes. you could also have a slow spec in your codebase, as normal input, that is slow.