Fork me on GitHub
#xtdb
<
2020-12-06
>
nivekuil02:12:27

why does iterator-seq on an open-q result take so long? it doesn't look like any of it is becoming realized, but here's what tufte says:

pId                  nCalls        Min      50% ≤      90% ≤      95% ≤      99% ≤        Max       Mean   MAD      Clock  Total

:iterator-seq             1   154.49ms   154.49ms   154.49ms   154.49ms   154.49ms   154.49ms   154.49ms   ±0%   154.49ms    85%
:open-q                   1   606.99μs   606.99μs   606.99μs   606.99μs   606.99μs   606.99μs   606.99μs   ±0%   606.99μs     0%
ah, it's :order-by. no way that can be done lazily?

refset10:12:42

Yep that's right. Unfortunately, order-by is necessarily eager. You can use simple "range constraint" queries that will naturally return results in the order found in the index though, e.g. https://github.com/juxt/crux/blob/master/crux-test/test/crux/query_test.clj#L1813

nivekuil10:12:00

ah, that's actually quite unfortunate. looks like I'll have to DIY a sorted index to make use of this fancy manifold-powered lazy sort I wrote

nivekuil10:12:32

hm, so I have one predicate that looks like this: [(time-after? ?time ￱~￱oldest-time)] is this less efficient than a range constraint could be?

nivekuil10:12:54

not sure what the difference is between range constraint and predicate per that comment -- what does it mean to jump to an expected place in the index when the natural order doesn't seem to be sorted in any particular way?

refset23:12:20

what type of triple values are you using behind that ?time lvar? On the release you're using Crux will only sort Date values natively in the index, but we have just added index support for a bunch of new Java time types (as well as other Java types) to master: https://github.com/juxt/crux/pull/1281

nivekuil23:12:57

?time there is an #inst made from juxt/tick, which I think would be the freshly added Instant? So, crux does sort values natively in the index? What about plain integers (whatever class underlies them in clojure)? That's what I want to order by, in another case -- can those be done effectively lazily?

refset09:12:01

yeah so Instants are only just supported, so you could test your approach on master or switch to Dates briefly (with your current version) to check your query returns tuples in the order you want > crux does sort values natively in the index? When Crux knows how to do the encoding, yes, otherwise it defaults to Nippy encoding (which doesn't provide a natural sort order). These are the supported encodings on master : https://github.com/juxt/crux/blob/master/crux-core/src/crux/codec.clj#L78-L93

refset09:12:26

We can store Integers as sorted Longs, so they shouldn't be a problem, and can be returned lazily in a sorted order 🙂

nivekuil20:12:09

aha, it's using :in [[foo ...]] that breaks the sort order. otherwise open-q is indeed returned sorted by the attribute targeted by the range query

refset21:12:54

Interesting! I don't understand the context, but are you thinking that it seems odd to be happening like that? If so then a small example is the next step. Btw, fixed that txfn stall error bug - PR is awaiting merge 🙂

👍 3
nivekuil21:12:25

I think it's a little weird.. at least I can't think of a way around it, but I get why crux might behave like this. Toy example, first is what I want:

(put {:crux.db/id 1 :rank 5 :foo "e"})   (put {:crux.db/id 2 :rank 4 :foo "c"})   (put {:crux.db/id 3 :rank 3 :foo "b"})   (put {:crux.db/id 4 :rank 2 :foo "a"})   (put {:crux.db/id 5 :rank 1 :foo "f"})  ;; ([1 5] [2 4] [3 3] [4 2] [5 1]) (iterator-seq (open-q `{:find  [?r ?e]                           :where [[(>= ?r 0)]                                   [?e :rank ?r]]}))  ;; ([2 4] [3 3] [4 2] [5 1] [1 5])   (iterator-seq (open-q `{:find  [?r ?e]                           :where [[(>= ?r 0)]                                   [?e :rank ?r]                                   [?e :foo foos]]                           :in    [[foos ...]]}                         ["a" "b" "c" "e" "f"]))

nivekuil21:12:31

I think in SQL it'd be a composite index but I have no idea how that works with crux triples

refset21:12:47

ah okay, so what is happening is that the :in relation is taking precedence in the join order, which is a general built-in heuristic (and may be a poor default choice, I'm not sure...) - to send it back in the queue behind the range constraint you can introduce a little indirection like so, to get the result you want:

;; ([1 5] [2 4] [3 3] [4 2] [5 1])
(iterator-seq (open-q `{:find  [?r ?e]
                        :where [[(>= ?r 0)]
                                [?e :rank ?r]
                                [?e :foo foosx]
                                [(= foos foosx)]]
                        :in    [[foos ...]]}
                               ["a" "b" "c" "e" "f"]))

nivekuil21:12:28

hm, I get ([2 4] [3 3] [4 2] [5 1] [1 5]) from your code

😅 3
nivekuil21:12:15

2020-12-08T21:56:59.295Z machina DEBUG [crux.query:326] - :query {:find [app.db/?r app.db/?e], :where [[(clojure.core/>= app.db/?r 0)] [app.db/?e :rank app.db/?r] [app.db/?e :foo app.db/foosx] [(clojure.core/= app.db/foos app.db/foosx)]], :in {:bindings [[:collection [app.db/foos ...]]]}} 2020-12-08T21:56:59.297Z machina DEBUG [crux.query:326] - :join-order :aev app.db/?e app.db/?r {:e app.db/?e, :a :rank, :v app.db/?r} 2020-12-08T21:56:59.297Z machina DEBUG [crux.query:326] - :join-order :ave app.db/foosx app.db/?e {:e app.db/?e, :a :foo, :v app.db/foosx} 2020-12-08T21:56:59.300Z machina DEBUG [crux.query:326] - :where [[:pred {:pred {:pred-fn #function[clojure.core/>=], :args [app.db/?r 0]}}] [:triple {:e app.db/?e, :a :rank, :v app.db/?r}] [:triple {:e app.db/?e, :a :foo, :v app.db/foosx}] [:pred {:pred {:pred-fn #function[clojure.core/=], :args [app.db/foos app.db/foosx]}}]] 2020-12-08T21:56:59.300Z machina DEBUG [crux.query:326] - :vars-in-join-order [app.db/foosx app.db/?e app.db/foos app.db/?r] 2020-12-08T21:56:59.300Z machina DEBUG [crux.query:326] - :attr-stats {:crux.db/id 19297, :rank 20, :foo 14} 2020-12-08T21:56:59.301Z machina DEBUG [crux.query:326] - :var->bindings {app.db/foos #crux.query.VarBinding{:e-var nil, :var app.db/foos, :attr nil, :result-index 2, :result-name crux.query.value/foos, :type :in-var, :value? true}, app.db/?e #crux.query.VarBinding{:e-var app.db/?e, :var app.db/?e, :attr :crux.db/id, :result-index 1, :result-name app.db/?e, :type :entity, :value? false}, app.db/?r #crux.query.VarBinding{:e-var app.db/?e, :var app.db/?r, :attr :rank, :result-index 3, :result-name app.db/?e, :type :entity, :value? false}, app.db/foosx #crux.query.VarBinding{:e-var app.db/?e, :var app.db/foosx, :attr :foo, :result-index 0, :result-name app.db/?e, :type :entity, :value? false}}

refset21:12:43

weird, are you using the RC version still?

nivekuil21:12:57

yup. maybe my computer's been on too long a bit flipped

refset21:12:17

what happens if you try using clojure.core/= instead:

(iterator-seq (open-q `{:find  [?r ?e]
                        :where [[(>= ?r 0)]
                                [?e :rank ?r]
                                [?e :foo foosx]
                                [(clojure.core/= foos foosx)]]
                        :in    [[foos ...]]}
                               ["a" "b" "c" "e" "f"]))

nivekuil21:12:40

same thing, reliably ([2 4] [3 3] [4 2] [5 1] [1 5])

nivekuil21:12:55

I think syntax quote does that already anyway

refset22:12:19

Actually (= x y) is doing "unification" at the level of Datalog logical variables, which is pretty different to using regular Clojure = as a predicate, I was hoping the Clojure = would force the desired join order in your case but I'm not sure why it didn't. In any case, I think I figured out why you're seeing a different result:

refset22:12:07

The text search PR at the bottom is when we cut that RC version, and since then we've fiddled with join order heuristics some more, e.g. that "Break var dependency cycles" PR two above

nivekuil22:12:52

ah, sweet :) more to look forward to!Wait, so does that mean syntax quote has different semantics than not syntax quoting? The namespace qualification happens at macro expansion time, before crux can do anything about it right?

refset22:12:53

So you could clone Crux and run lein sub install to test with the very latest version of master if you're really keen, or hang on until this next release 🙂

refset22:12:28

> so does that mean syntax quote has different semantics than not syntax quoting? I think so, but to be honest I'm a bit fuzzy on syntax quoting still. I'd need to dig into it tomorrow. There was a thread on the Zulip about it not so long ago that might be useful to you https://juxt-oss.zulipchat.com/#narrow/stream/194466-crux/topic/Simplifying.20documentation.3F.20Syntax.20Quote

refset22:12:56

> The namespace qualification happens at macro expansion time, before crux can do anything about it right? Actually I think it has to stay quoted all the way through, until Crux tries to run it, at which point the query engine uses requiring-resolve. All of clojure.core is available for use as predicates, without a namespace, but everything else you must fully qualify in quoted form. See https://github.com/juxt/crux/blob/master/crux-core/src/crux/query.clj#L63-L65

nivekuil22:12:43

yeah, but it'll be the quoted qualified symbol that gets passed to crux/q, so if you carelessly syntax quote everywhere you'll actually accidentally never do unification, and always clojure.core/=?

nivekuil22:12:52

should probably be mentioned somewhere if true :)

✔️ 3
refset22:12:54

oh I see, hmm, yeah maybe :thinking_face: I guess you can confirm pretty quick with the debug logs turned on

nivekuil22:12:31

2020-12-08T22:17:46.740Z machina DEBUG [crux.query:326] - :where [[:pred {:pred {:pred-fn #function[clojure.core/>=], :args [app.db/?r 0]}}] [:triple {:e app.db/?e, :a :rank, :v app.db/?r}] [:triple {:e app.db/?e, :a :foo, :v app.db/foosx}] [:pred {:pred {:pred-fn #function[clojure.core/=], :args [app.db/foos app.db/foosx]}}]]
vs
2020-12-08T22:18:03.103Z machina DEBUG [crux.query:326] - :where [[:range [[:sym-val {:op >=, :sym ?r, :val 0}]]] [:triple {:e ?e, :a :rank, :v ?r}] [:triple {:e ?e, :a :foo, :v foosx}] [:range [[:sym-sym {:op =, :sym-a foos, :sym-b foosx}]]]]

nivekuil22:12:54

the effect of s/￱`￱/'

refset22:12:10

wow, okay, yes we need to document this! thanks for checking and thinking of it in the first place 🙏

nivekuil22:12:29

my pleasure :) In general unification is "better" to do I assume?

refset22:12:43

definitely yeah, because it's like realising that two virtual indexes are the same virtual index, otherwise it will scan each item in the two virtual indexes and run (e.g.) clojure.core/= on each of them (...or something approximately as inefficient sounding).

nivekuil22:12:34

sounds good. I distinctly remember skipping all my prolog lectures so unfortunately all the resources I'm turning up are going over my head now

😄 3
refset22:12:39

Looking at your previous query debug log above I see now it wasn't even using the >= range predicate (using clojure.core/>= instead), so I'm surprised it ever returned sorted results

refset22:12:34

Oh gosh, sorry about this, I'm telling lies, the predicate constraint that does unification is == not = ...so = is simply doing clojure.core/= anyway

refset22:12:26

The point still stands for > >= < <= though

nivekuil22:12:20

ah, right that's a >= sign. The point would also stand for == as well I take it

nivekuil22:12:59

but I guess you'd never use it in a query directly, probably binding it with a tuple instead

refset22:12:52

Huh, TIL clojure.core/== is a thing! And just to backtrack on what I just said a bit, = actually will be treated as a range predicate in the same way as > >= < <=, I spoke too soon. You can see it here: https://github.com/juxt/crux/blob/master/crux-core/src/crux/query.clj#L861

refset22:12:59

> but I guess you'd never use it in a query directly, probably binding it with a tuple instead Not certain what you mean by binding with a tuple, but it is definitely more efficient to unify than anything else. There are some examples of using [(== ... in here: https://github.com/juxt/crux/blob/master/crux-test/test/crux/query_test.clj#L999-L1067 (and also plenty of using [(= ...)

nivekuil22:12:39

e.g. you could write (p1 :crux.db/id p2) instead of (== p1 p2) -- but actually I like == better

nivekuil22:12:21

ah, I don't think that works for all of them though

nivekuil22:12:25

ctrl+f == on https://opencrux.com/reference/queries.html has 0 hits, so that's the source of my ignorance :)

refset23:12:47

Yes, very fair criticism, the query page needs a lot of work to catch up with everything the engine can do these days! We have a plan to get things in shape soon, particularly before we embark on a big phase of marketing in Q1

nivekuil13:12:01

ok, sorry to bump this thread (probably should have used zulip) but I played around a bit more with the join order, armed with new knowledge of unification. I think the difference in sort order is actually caused by the difference between == and =

nivekuil13:12:10

;; not sorted by rank
(iterator-seq
   (db/open-q
    '{:find  [?rank ?fresh]
      :where [[?e :entry/rank ?rank]
              [(>= ?rank 0)]
              [?e :entry/fresh? ?fresh]
              [(== ?fresh true)]]}))

:query {:find [?rank ?fresh], :where [[?e :entry/rank ?rank] [(>= ?rank 0)] [?e :entry/fresh? ?fresh] [(== ?fresh true)]], :in nil}
:join-order :aev ?e ?rank {:e ?e, :a :entry/rank, :v ?rank}
:join-order :ave ?fresh ?e {:e ?e, :a :entry/fresh?, :v ?fresh}
:where [[:triple {:e ?e, :a :entry/rank, :v ?rank}] [:range [[:sym-val {:op >=, :sym ?rank, :val 0}]]] [:triple {:e ?e, :a :entry/fresh?, :v ?fresh}] [:pred {:pred {:pred-fn ==, :args [?fresh true]}}]]
:vars-in-join-order [?fresh ?e ?rank]
:attr-stats {:crux.db/id 18576, :entry/rank 13178, :entry/fresh? 13178}
:var->bindings {?e #crux.query.VarBinding{:e-var ?e, :var ?e, :attr :crux.db/id, :result-index 1, :result-name ?e, :type :entity, :value? false}, ?rank #crux.query.VarBinding{:e-var ?e, :var ?rank, :attr :entry/rank, :result-index 2, :result-name ?e, :type :entity, :value? false}, ?fresh #crux.query.VarBinding{:e-var ?e, :var ?fresh, :attr :entry/fresh?, :result-index 0, :result-name ?e, :type :entity, :value? false}}

;; sorted by rank
(iterator-seq
   (db/open-q
    '{:find  [?rank ?fresh]
      :where [[?e :entry/rank ?rank]
              [(>= ?rank 0)]
              [?e :entry/fresh? ?fresh]
              [(= ?fresh true)]]}))

:query {:find [?rank ?fresh], :where [[?e :entry/rank ?rank] [(>= ?rank 0)] [?e :entry/fresh? ?fresh] [(= ?fresh true)]], :in nil}
:join-order :ave ?rank ?e {:e ?e, :a :entry/rank, :v ?rank}
:join-order :aev ?e ?fresh {:e ?e, :a :entry/fresh?, :v ?fresh}
:where [[:triple {:e ?e, :a :entry/rank, :v ?rank}] [:range [[:sym-val {:op >=, :sym ?rank, :val 0}]]] [:triple {:e ?e, :a :entry/fresh?, :v ?fresh}] [:range [[:sym-val {:op =, :sym ?fresh, :val true}]]]]
:vars-in-join-order [?rank ?e ?fresh]
:attr-stats {:crux.db/id 18576, :entry/rank 13178, :entry/fresh? 13178}
:var->bindings {?e #crux.query.VarBinding{:e-var ?e, :var ?e, :attr :crux.db/id, :result-index 1, :result-name ?e, :type :entity, :value? false}, ?rank #crux.query.VarBinding{:e-var ?e, :var ?rank, :attr :entry/rank, :result-index 0, :result-name ?e, :type :entity, :value? false}, ?fresh #crux.query.VarBinding{:e-var ?e, :var ?fresh, :attr :entry/fresh?, :result-index 2, :result-name ?e, :type :entity, :value? false}}

nivekuil13:12:58

this is why you ended up getting the sorted result in https://clojurians.slack.com/archives/CG3AM2F7V/p1607464427387900?thread_ts=1607220987.378000&amp;cid=CG3AM2F7V : using = instead of unification made it a [:range instead of a [:pred

nivekuil13:12:38

now, the drawback is that = really is much slower than unification, so I seem to have actually lost the benefits of lazy sorting in the process :p

refset15:12:08

> sorry to bump this thread It's no problem, though yeah Zulip is preferable if only for proper Clojure formatting in snippets 🙂 The idea that == changes the join order for your case makes sense, because symbolically it must have the same logical meaning as:

(iterator-seq
   (db/open-q
    '{:find  [?rank ?fresh]
      :where [[?e :entry/rank ?rank]
              [(>= ?rank 0)]
              [?e :entry/fresh? true]]})) ;; <- `true` can be symbolically substituted in here
...and Crux will almost always join first based on triple clauses where v or e is already bound (in this case true is bound), the exceptions to this being when index statistics suggest that e.g. the cardinality of values under :entry/fresh? is higher, and then the query planner might choose a different path Thinking about this again with a fresher brain, if you want to guarantee both properties (laziness and ordering) across two attributes, then in Crux as things stand today, you have to combine them before submission in a flattened materialization of some kind, e.g. [?e :entry.fresh/rank ?rank] or potentially some sortable byte-array value encoding to achieve something conceptually a bit like this [?e :entry/fresh-rank [true ?rank]] where the user-space tuple values sort naturally (note Crux won't handle vectors like this, it will treat them as independent triples, this is just an illustration). DataScript actually handles this latter option really seamlessly, as long as the value implements ICompare (which vectors do). In Datomic there's an explicit "composite attributes" feature which is more or less the same idea again. That said, neither DataScript or Datomic offer lazy queries so you'd need to roll your own cursor-based paging/batch approach.

nivekuil16:12:09

hm, I like your fresh brain -- actually combining the two might be a better data model to begin with. Thanks!

🙂 3
nivekuil16:12:36

yup, that works great. also a testament to crux (and really clojure, integrant et al) that I was able to refactor my data model and verify it in ￱~￱5 minutes

refset16:12:20

awesome! so you went for the [?e :entry.fresh/rank ?rank] approach?

nivekuil16:12:11

not quite; I actually decided to overload :entry/rank to also imply "not fresh" if it's negative

nivekuil16:12:45

should help the query perform better in general, I think

💯 3
refset16:12:46

perfect ☺️