Fork me on GitHub
#xtdb
<
2024-03-12
>
Hukka06:03:20

I was talking with @jarohen about perf yesterday, and he made some improvements already with a new snapshot release. Originally I had noticed that doing with-open with a local node on 2.0.0-20240226.102659-7 would throw an Arrow exception, and not ingest all the documents that are submitted. On 20240308.120125-8 that didn't happen, but that was because the submissions took 5× time as before, and when it had returned, indexing was already done. With 20240311.190148-9 the submission speed is even faster than -7 (about 60%), but the same problem happens that you cannot .close a node that is doing indexing without bad things happening. One workaround is to run any kind of query after submissions since XTDB reads its own writes. It is the counterpart to the synchronisation where a freshly started node does not yet have all data in it, so queries will return stale data (there the workaround is to submit an empty transaction before the query).

Hukka06:03:16

On a related perf note, at the moment there's a significant penalty for deeply nested data.

(println (query @node '(-> (from :data [])
                              (aggregate {:count (row-count)}))))
   (println "Simple data")
   (time (query @node '(-> (from :data [xt/id id1 simple-data])
                           (limit 1))))
   (time (query @node '(-> (from :data [xt/id id1 simple-data])
                           (limit 500))))
   (time (query @node '(-> (from :data [xt/id id1 simple-data])
                           (where (= id1 "06420618721502")))))
   (time (query @node '(-> (from :data [xt/id id1 simple-data])
                           (where (= xt/id "62e2e8cd715db4a7db2cfd5e")))))

   (println "Nested data")
   (time (query @node '(-> (from :data [id1 nested-data])
                           (limit 1))))
   (time (query @node '(-> (from :data [id1 nested-data])
                           (limit 45000))))
   (time (query @node '(-> (from :data [id1 nested-data])
                           (where (= id1 "06420618721502")))))
   (time (query @node '(-> (from :data [xt/id nested-data])
                           (where (= xt/id "62e2e8cd715db4a7db2cfd5e")))))
; (out) [{:count 264267}]
; (out) Simple data
; (out) "Elapsed time: 47.857703 msecs"
; (out) "Elapsed time: 46.380023 msecs"
; (out) "Elapsed time: 57.310858 msecs"
; (out) "Elapsed time: 31.380848 msecs"
; (out) Nested data
; (out) "Elapsed time: 166.631727 msecs"
; (out) "Elapsed time: 8171.400108 msecs"
; (out) "Elapsed time: 7991.012919 msecs"
; (out) "Elapsed time: 79.18282 msecs"
So fetching nested data is fine if using xt/id, or limit, but using a bind spec (or where) with some other simple data is so much slower, that it is better to first just find the xt/ids and then use those to find them. This was with a local node that was restarted. Starting from fresh, inserting data, and then querying was even slower (about 13 seconds, not 8, for fetching the nested data). I think the time for the node to sync from disk was also a lot slower at first, but after some repetitions got pretty fast. Need to investigate that more.

Hukka05:04:51
replied to a thread:

I've been trying to work on this, but can't make heads or tails out of it. If I take a single document out of my db that I was working on, insert that to a new db, and query, everything works. Both on inmem and local file based. But the same exact query won't work on the db that has lots of records. I've noticed that I can get a similar error by mistyping the :some-data fieldname in the first part. But that's not it; if just run the first part, the one that fetches the xt/id, it works. So not a problem of typos or db that hasn't indexed the data yet. I guess I should test if the amount of documents in the store matters. That's the only avenue I can think of.