Fork me on GitHub
#datalevin
<
2024-02-17
>
andersmurphy16:02:57

A query performance experience report on a small/medium sized database. Some info about the dataset

;; total accounts     1 407 494
;; total transactions   837 197
;; db size 13G (lots of large edn blobs)
;; :address    :db/unique
;; :account    :db.type/ref :db.cardinality/many
;; :signature  :db/unique
;; :block-time :db.type/long
The query is for finding the first transaction of an account.
(time
  (d/q '[:find (min ?block-time) ?signature
         :where
         [?a :account/address
          "9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN"]
         [?t :transaction/account ?a]
         [?t :transaction/block-time ?block-time]
         [?t :transaction/signature ?signature]]
    @d/db))

;; "Elapsed time: 1696.461059 msecs"
We can make this faster by using pull syntax and not joining on signature.
(time
  (d/q '[:find [(min ?block-time) (pull ?t [:transaction/signature])]
         :where
         [?a :account/address
          "9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN"]
         [?t :transaction/account ?a]
         [?t :transaction/block-time ?block-time]]
    @d/db))

;; "Elapsed time: 491.879502 msecs"
We can make this even faster by using the account ref directly [?a :account/address “9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN”]
(time
  (d/q '[:find [(min ?block-time) (pull ?t [:transaction/signature])]
         :where
         [?t :transaction/account
          [:account/address
           "9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN"]]
         [?t :transaction/block-time ?block-time]]
    @d/db))

;; "Elapsed time: 34.356091 msecs"
Passing ?address in is slow
(time
  (d/q '[:find
         [(min ?block-time) (pull ?t [:transaction/signature])]
         :in $ ?address
         :where
         [?a :account/address ?address]
         [?t :transaction/account ?a]
         [?t :transaction/block-time ?block-time]]
    @d/db
    "9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN"))

;; "Elapsed time: 6817.759179 msecs"
Passing ?address-ref in is faster but still not as fast as inlined address-ref
(time
  (d/q '[:find
         [(min ?block-time) (pull ?t [:transaction/signature])]
         :in $ ?account-ref
         :where
         [?t :transaction/account ?account-ref]
         [?t :transaction/block-time ?block-time]]
    @d/db
    [:account/address
     "9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN"]))

;; "Elapsed time: 482.963981 msecs"
Some of this was a little surprising. I guess doing less joins is faster. Using a ref is really fast (less joins?) i.e:
[?t :transaction/account
 [:account/address "9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN"]]
instead of:
[?a :account/address "9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN"]
[?t :transaction/account ?a]
But I wasn’t expecting passing in a variable to be so much slower than an inline value. What’s the reason for that? UPDATE: should I be writing queries like this?
(time
  (let [address "9Pu92fit6R4L7ZQUya6Ux6ovQh3CAdYtYot4jDLFxLeN"]
    (d/q `[:find
           [(min ?block-time) (pull ?t [:transaction/signature])]
           :where
           [?t :transaction/account
            [:account/address ~address]]
           [?t :transaction/block-time ?block-time]]
      @d/db)))

;; "Elapsed time: 42.78259 msecs"
Seems to give a 10x vs passing the variable in the normal way.

Huahai17:02:52

The current datascript query processing is very simple. It does things the following way: 1. turn each binding in :in into a relation 2. join clauses one by one, by first make a Cartesian product with existing relations That's it. That's why.

andersmurphy17:02:35

Ahh that makes complete sense. No wonder it’s slower.

Huahai17:02:41

the join method is hash-join, and tonsky refuse to merge my optimization for it too

Huahai17:02:35

he doesn't like inelegant code I guess, but for performance, you cannot have all elegant code

andersmurphy17:02:51

So that also explains why pull is so much faster

(time
    (d/q
      '[:find [?address ...]
        :where
        [?a :account/owner [:account/address
                            "meRjbQXFNf5En86FXT2YPz1dQzLj4Yb3xK8u1MVgqpb"]]
        [?a :account/address ?address]]
      @d/db))
  ;; "Elapsed time: 2225.817526 msecs"

  (time
    (d/q
      '[:find (pull ?a [:account/address])
        :where
        [?a :account/owner [:account/address
                            "meRjbQXFNf5En86FXT2YPz1dQzLj4Yb3xK8u1MVgqpb"]]]
      @d/db))
  ;; "Elapsed time: 266.901356 msecs"

andersmurphy17:02:18

I guess, if you are using datascript on the client in mem you are not dealing with large datasets

Huahai17:02:22

In the new query engine, pull will likely slower than query

andersmurphy17:02:23

so it matters less

andersmurphy17:02:45

Yeah, I expected pull to be slower.

Huahai17:02:28

basically, the new query engine doesn't do join, just do table scan.

andersmurphy17:02:22

Does that also mean passing in a collection [:id1 :id2 :id3] SQL equivalent of IN won’t lead to a crazy Cartesian product? I know datomic does an optimisation for this, I think it might get rewritten as an (or.

Huahai17:02:47

right now, I am not handling collection bindings

andersmurphy17:02:36

ahh, I must have missed that update. I’ve been avoiding collection bindings.

Huahai17:02:40

in the future, yes, we will gradually move these features over

🙏 1
Huahai17:02:49

For initial rewrite, the planner is handling "normal" cases, additional features can be moved to it gradually.

👍 1
Huahai17:02:51

I need to get the basis right first.

Huahai17:02:38

the planner is a classic Selinger query planner, same as those in Postgres etc. The advantage of a triple store is better cardinality estimation, so we likely will generate better plan.

Huahai17:02:25

I am relying on LMDB's performance for raw table scan performance. The hope is to perform on par with sqlite, Postgres etc on complex queries. When that is done, I will call it a success.

Huahai17:02:38

at that point, there's really no point using SQL, which I intent to replace

Huahai17:02:21

So the whole idea is really about ergonomics.

Huahai18:02:51

the reason pull will be slower than query in the new query engine is this: pull returns diatoms, here's a cost of converting values to datoms, where as in new engine, we don't do this conversion, values are directly poured into relations.

Huahai18:02:26

we rewrite as much operations as possible into range scan routines. Four range scan routines cover most of work, then, the remaining work is done afterwards with hash-join.

Huahai18:02:54

The order of these range scans is planned using the classic query planner with dynamic programming. My hope of beating RDBMS is to have more accurate cardinality estimation. In fact, we will count as much as possible, so the estimates are more accurate. Range count in triple store is cheap.

Huahai18:02:23

With the new storage format, i.e. dupsort, obtaining most counts is O(1), e.g. the number of datoms matching [?e :attribute ?v] is a single read on LMDB cursor's count field

andersmurphy18:02:40

Thank you for the explanation, this is brilliant. Honestly, I’m amazed how performant datalevin already is even with the simple datascript query engine.

Huahai20:02:33

The current engine does have some obvious optimizations, e.g. reorder the clauses based on sizes, hash-join does the right thing, etc

Huahai20:02:10

These don’t exist in datascript