This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2024-02-17
Channels
- # announcements (11)
- # babashka (29)
- # beginners (45)
- # biff (3)
- # cider (5)
- # clj-kondo (55)
- # clojure-austin (2)
- # clojure-europe (6)
- # clojure-norway (24)
- # clojure-uk (1)
- # datalevin (28)
- # fulcro (8)
- # gratitude (1)
- # hyperfiddle (7)
- # keechma (1)
- # membrane (31)
- # other-languages (1)
- # polylith (22)
- # shadow-cljs (12)
A query performance experience report on a small/medium sized database. Some info about the dataset
;; total accounts 1 407 494
;; total transactions 837 197
;; db size 13G (lots of large edn blobs)
;; :address :db/unique
;; :account :db.type/ref :db.cardinality/many
;; :signature :db/unique
;; :block-time :db.type/long
The query is for finding the first transaction of an account.
(time
(d/q '[:find (min ?block-time) ?signature
:where
[?a :account/address
"9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN"]
[?t :transaction/account ?a]
[?t :transaction/block-time ?block-time]
[?t :transaction/signature ?signature]]
@d/db))
;; "Elapsed time: 1696.461059 msecs"
We can make this faster by using pull syntax and not joining
on signature.
(time
(d/q '[:find [(min ?block-time) (pull ?t [:transaction/signature])]
:where
[?a :account/address
"9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN"]
[?t :transaction/account ?a]
[?t :transaction/block-time ?block-time]]
@d/db))
;; "Elapsed time: 491.879502 msecs"
We can make this even faster by using the account ref directly
[?a :account/address “9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN”]
(time
(d/q '[:find [(min ?block-time) (pull ?t [:transaction/signature])]
:where
[?t :transaction/account
[:account/address
"9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN"]]
[?t :transaction/block-time ?block-time]]
@d/db))
;; "Elapsed time: 34.356091 msecs"
Passing ?address in is slow
(time
(d/q '[:find
[(min ?block-time) (pull ?t [:transaction/signature])]
:in $ ?address
:where
[?a :account/address ?address]
[?t :transaction/account ?a]
[?t :transaction/block-time ?block-time]]
@d/db
"9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN"))
;; "Elapsed time: 6817.759179 msecs"
Passing ?address-ref in is faster but still not as fast as
inlined address-ref
(time
(d/q '[:find
[(min ?block-time) (pull ?t [:transaction/signature])]
:in $ ?account-ref
:where
[?t :transaction/account ?account-ref]
[?t :transaction/block-time ?block-time]]
@d/db
[:account/address
"9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN"]))
;; "Elapsed time: 482.963981 msecs"
Some of this was a little surprising. I guess doing less joins is faster.
Using a ref is really fast (less joins?) i.e:
[?t :transaction/account
[:account/address "9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN"]]
instead of:
[?a :account/address "9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN"]
[?t :transaction/account ?a]
But I wasn’t expecting passing in a variable to be so much slower than an inline value. What’s the reason for that?
UPDATE: should I be writing queries like this?
(time
(let [address "9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN"]
(d/q `[:find
[(min ?block-time) (pull ?t [:transaction/signature])]
:where
[?t :transaction/account
[:account/address ~address]]
[?t :transaction/block-time ?block-time]]
@d/db)))
;; "Elapsed time: 42.78259 msecs"
Seems to give a 10x vs passing the variable in the normal way.The current datascript query processing is very simple. It does things the following way: 1. turn each binding in :in
into a relation
2. join clauses one by one, by first make a Cartesian product with existing relations
That's it.
That's why.
Ahh that makes complete sense. No wonder it’s slower.
he doesn't like inelegant code I guess, but for performance, you cannot have all elegant code
So that also explains why pull is so much faster
(time
(d/q
'[:find [?address ...]
:where
[?a :account/owner [:account/address
"meRjbQXFNf5En86FXT2YPz1dQzLj4Yb3xK8u1MVgqpb"]]
[?a :account/address ?address]]
@d/db))
;; "Elapsed time: 2225.817526 msecs"
(time
(d/q
'[:find (pull ?a [:account/address])
:where
[?a :account/owner [:account/address
"meRjbQXFNf5En86FXT2YPz1dQzLj4Yb3xK8u1MVgqpb"]]]
@d/db))
;; "Elapsed time: 266.901356 msecs"
I guess, if you are using datascript on the client in mem you are not dealing with large datasets
so it matters less
Yeah, I expected pull to be slower.
Does that also mean passing in a collection [:id1 :id2 :id3] SQL equivalent of IN won’t lead to a crazy Cartesian product? I know datomic does an optimisation for this, I think it might get rewritten as an (or
.
ahh, I must have missed that update. I’ve been avoiding collection bindings.
For initial rewrite, the planner is handling "normal" cases, additional features can be moved to it gradually.
the planner is a classic Selinger query planner, same as those in Postgres etc. The advantage of a triple store is better cardinality estimation, so we likely will generate better plan.
I am relying on LMDB's performance for raw table scan performance. The hope is to perform on par with sqlite, Postgres etc on complex queries. When that is done, I will call it a success.
the reason pull will be slower than query in the new query engine is this: pull returns diatoms, here's a cost of converting values to datoms, where as in new engine, we don't do this conversion, values are directly poured into relations.
we rewrite as much operations as possible into range scan routines. Four range scan routines cover most of work, then, the remaining work is done afterwards with hash-join.
The order of these range scans is planned using the classic query planner with dynamic programming. My hope of beating RDBMS is to have more accurate cardinality estimation. In fact, we will count as much as possible, so the estimates are more accurate. Range count in triple store is cheap.
With the new storage format, i.e. dupsort, obtaining most counts is O(1), e.g. the number of datoms matching [?e :attribute ?v]
is a single read on LMDB cursor's count field
Thank you for the explanation, this is brilliant. Honestly, I’m amazed how performant datalevin already is even with the simple datascript query engine.