This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-10-31
Channels
- # ai (5)
- # announcements (11)
- # beginners (19)
- # biff (1)
- # calva (8)
- # cider (3)
- # clj-kondo (12)
- # clojure (97)
- # clojure-europe (39)
- # clojure-nl (1)
- # clojure-norway (74)
- # clojure-uk (35)
- # clojurescript (8)
- # component (8)
- # conjure (4)
- # cursive (13)
- # data-science (1)
- # datahike (55)
- # datomic (2)
- # emacs (3)
- # etaoin (6)
- # gratitude (1)
- # hoplon (12)
- # hyperfiddle (54)
- # introduce-yourself (1)
- # lsp (70)
- # missionary (40)
- # music (1)
- # off-topic (79)
- # re-frame (78)
- # releases (4)
- # sql (5)
- # squint (9)
- # tree-sitter (4)
- # xtdb (20)
Performing the following pull-many
query with 3 attr-ids (including :db/id
) on a range of 400 or so entities requires ~2,900 ms:
(require '[datahike.api :as d]) ; version 0.6.1531
(d/pull-many @conn [:db/id :book-name :notable?]
(range 1 400))
Is the slow query an inherent trade-off of EAV databases, or am I failing to optimize in some obvious way?
In this case, would it help to set :db/index
to true
for attribute :db/id
? (I assumed true
would be the default for :db/id
but perhaps it isn't.)In what context is this running? JVM clojure, javascript (dev or prod-compilation) or babashka or something?
At first glance quite simple query seems quite slow, if you run in JVM or a production build of clojurescript (doesn't have to be an optimized build but it should have had the opportunity to have the JIT to make its job). Datahike is ported from datascript, but it seems like the pull_api is quite different between the two implementations - at first glance datahike seems to do the parsing of the pull expression in a more elaborate way etc, and with a lot more checks than datascript - probably for the better.
If you want a maybe more production like answer on that queries performance, use either Critereum (jvm) or maybe Tufte (cljs - haven't tried) or something else under the Benchmarking section in https://www.clojure-toolbox.com/
(In the query you are testing there is probably not enough calls to make JIT start to its work).
In Datomic the query would take much less time than 2900 ms. The query is not that complicated - for each :db/id, check if there is an entity for this db/id (this is what DataHike does, not sure about DataScript, Datomic), then look for the attributes :book-name (I assume this is a string?) and :notable? (I assume this is a boolean) in the EAV-index for each db-id. Create a map with the result {:db/id _ :book-name " ..." :notable? <bool>}. If these attributes is multi-arity and/or references (which can have isComponent set to true), things are somewhat more complicated, but I assume their not?
First question - can you provide an idea of how many overall entities are in the DB? And of those, how many of those have attributes :book-name
and notable?
?
Second question - have you tried writing a d/q
query and compared performance? something ala:
(d/q '[:find [?id ?book-name ?notable]
:in $ [?id ...]
:where
[?id :book-name ?book-name]
[(get-else $ ?id :notable false) ?notable]]
db (range 1 400))
Wow, pithyless — your pivorak talk is what got me into EAV and datalog in the first place! Superb talk. Bravo.
I don’t have experience with datahike, but even just page scanning 500 entities should not take that long, I think. Let us know if writing a regular query helps - at least we’ll know if it’s index (query) or pull syntax related.
PS. appreciate the kind words - that talk was always intended for those that go and do something with that material. A start of a journey, vs an actual destination. I’m always happy to hear that it helped someone :)
PPS. I’m taking off now, but will happily peruse any data you provide in the meantime tomorrow
I should have included the shape of the data I am trying to form, as it will hint at what I'm trying to build:
[{:db/id 339, :book-name "Notation as a Tool of Thought"}
{:db/id 338, :book-name "The Science of Radio", :notable? true}
{:db/id 337, :book-name "Journey Into Mathematics"}
{:db/id 336, :book-name "Street Fighting Mathematics"}
...]
@UQY3M3F6D Your assumptions are all correct: > :book-name (I assume this is a string?) and :notable? (I assume this is a boolean) Correct. > If these attributes is multi-arity and/or references (which can have isComponent set to true), things are somewhat more complicated, but I assume their not? Your assumption is correct: they are not references.
This is also a bit surprising to me. I would look at the sampling profiler for CPU time, e.g. in jvisualvm to see where most of the time is spent. Maybe something weird is happening in the iteration over entity ids.
Hmm, interesting ... even pithyless's d/q
is taking almost 1,000 ms to yield one result:
(time (d/q '[:find [?id ?b ?n]
:in $ [?id ...]
:where
[?id :book-name ?b]
[(get-else $ ?id :notable? false) ?n]]
@conn (range 1 400)))
; => "Elapsed time: 945.92125 msecs"
; [320 "paul mockapetris" false]
Strange, I will try to reproduce. Could you make the dataset available to me somehow? Can be private confidential PM.
Happy to. Haven't done this before but it is postgres db so I will try pg-dump, I believe.
We recently released a jdbc backend with less latency https://github.com/replikativ/datahike-jdbc
Thanks to @U8KKDKPG8
Sorry, the :backend
is :jdbc
(and :dbtype "postgresql"
). Apologies, I am still relatively new to Clojure.
@U1C36HC6N So I can generate an edn file that will re-create all transactions (including schema transactions)?
It looks like Datahike has not implemented tx-range
, which may have been helpful here. https://cljdoc.org/d/io.replikativ/datahike/0.3.6/doc/differences-to-datomic-client-api?q=tx-range#differences-to-datomic-client-api
I should have provided my schema for the attrs in question:
[{ :db/ident :book-name
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one}
{ :db/ident :notable?
:db/valueType :db.type/long
:db/cardinality :db.cardinality/one}]
notable?
was set to long
because I thought I might want to count the number of times a book-name gets "noted". Most entities are not noted; when they are, the value of notable?
is 0
. Oddball design but I don't think that would affect query speed.Finally, here are the datoms for an entity that is :notable?
:
(#datahike/Datom [336 :notable? 0 536871355]
#datahike/Datom [336 :url-link "Street Fighting Mathematics" 536871288])
FYI - I've been able to reproduce this issue, by testing datahike vs datomic (mem/file) and datahike vs datomic (pgsql). Two things immediately jump out to me:
First, even in dev mode (writing to file), datahike shows a performance difference between d/q and d/pull-many:
(time
(datahike/q
'[:find [?id ...]
:in $ [?id ...]
:where
[?id :book-name]]
@conn book-ids))
;; "Elapsed time: 4.2875 msecs"
(time
(datahike/q
'[:find ?id ?book-name ?notable
:in $ [?id ...]
:where
[?id :book-name ?book-name]
[(get-else $ ?id :notable? false) ?notable]]
@conn book-ids))
;; "Elapsed time: 158.987833 msecs"
(time
(datahike/pull-many
@conn [:db/id :book-name :notable?]
book-ids))
;; "Elapsed time: 388.851709 msecs"
In Datomic, these queries either from memory or SQL all hover around 5~15 ms (caching, in-memory paging, etc. means that the exact numbers aren't important.. but they're always under 20ms and irrespective of query. In fact, pull-many often is slightly faster for this simple kind of index lookup).
Second, there is a much bigger delay (the kind @U02U1T66REZ is reporting) when Datahike is using datahike-jdbc (via postgres running on a local docker instance):
(time
(datahike/q
'[:find [?id ...]
:in $ [?id ...]
:where
[?id :book-name]]
@conn book-ids))
;; "Elapsed time: 32.260542 msecs"
(time
(datahike/q
'[:find ?id ?book-name ?notable
:in $ [?id ...]
:where
[?id :book-name ?book-name]
[(get-else $ ?id :notable? false) ?notable]]
@conn book-ids))
;; "Elapsed time: 1369.331292 msecs"
(count
(time
(datahike/pull-many
@conn [:db/id :book-name :notable?]
book-ids)))
;; "Elapsed time: 5431.585291 msecs"
All these tests are using 450 "books" and 50 "other" data, per the OP requirements.
[{:db/ident :book-name
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one}
{:db/ident :notable?
:db/valueType :db.type/long
:db/cardinality :db.cardinality/one}
{:db/ident :other
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one}]
There is probably some really low-hanging fruits here that can be addressed or perhaps misconfiguration error on my part, because this is a real small amount of data to be choking on. NOTE: I am not familiar with Datahike in practice, so I may have missed something obvious. My first thought would be - are there certain SQL indices the end user needs to migrate?
FYI, this is what I was comparing:
{:datahike {:extra-deps
{io.replikativ/datahike {:mvn/version "0.6.1552"}
io.replikativ/datahike-jdbc {:mvn/version "0.2.46"}
com.github.seancorfield/next.jdbc {:mvn/version "1.3.894"}
org.postgresql/postgresql {:mvn/version "42.6.0"}}}
:datomic {:extra-deps
{com.datomic/peer {:mvn/version "1.0.6735"}
org.postgresql/postgresql {:mvn/version "42.6.0"}}}}
BTW, I took another look at this - and an initial CPU profiler shows that 18% of the time was spent on konserve-jdbc/read-all
, 37% on konserve-jdbc/blob-exists?
, and another 18% on konserve-jdbc/read-header
.
That's not counting the time to serialize/deserialize Fressian, etc. This is just time spent fanning out and doing a LOT of repeat select 1 ...
style SQL queries to the DB.
To put this in perspective: the logger showed 2145
SQL queries for the d/q
example and 6410
queries for the d/pull-many
. For a query matching ~400 entity ids.
I'm not sure what kind of performance and load you're expecting @U02U1T66REZ - but no matter how much you lower DB latency, that many requests will add up. As I'm not familiar with the topic, I may be missing something. But it feels like the konserve-jdbc is missing a strategy to batch and cache data to minimize hitting raw SQL. 🤷 Perhaps something is on the roadmap?
Wow, this is an awesome interrogation of the system — thank you, Norbert — but somewhat discouraging to me! As enamored as I am with the composability of Datalog, I didn't realize serial retrieval and joins of attributes would suffer this much. Now that I'm more familiar, it seems like a more traditional relational engine is best for the kinds of queries I want to run. If I understand right, Datalog is designed to retrieve results when multiple constraints can be supplied whereas relational is better suited for the kind of serialized map I'm trying to form (“Give me all the objects but tell me when they have a special attribute”).
Hey @U05476190 this is such a useful analysis. Lemme give it a shot.
By the way your talk got me into datalog, and I've never looked back. It's what got me spending time on datahike (datomic was unaffordable at the time). Thank you.
Thanks @U8KKDKPG8 - that's quite a compliment, coming from you. I think you're definitely the more entertaining speaker; you've got the natural flow, calm, and stage presence of someone who can captivate an audience. :)
@U02U1T66REZ This is a problem in Datahike, we need to fix this. It is not your fault. I have not worked with the JDBC backend much, but something is going wrong here.
@U02U1T66REZ This is fixed now with https://github.com/replikativ/datahike/pull/653. Thank you for reporting it, I had not recognized yet that the cache was not properly hit, because I usually ran queries with the query engine which has another cache in front of the store and was focused until now on ensuring that writes are fast and we can access everything in a distributed way. @U05476190 @U8KKDKPG8 This problem had nothing to do with JDBC btw. It hit the store always that often, also for the file backend. You can access cache hit statistics with (:stats (:storage (:store @(:wrapped-atom conn))))
btw. @U05476190 Thanks for looking into the issue!
I will put myself a bit on the shooting line with this now, but in general whenever Datahike is significantly (e.g. more than 10%) slower than DataScript that should be considered a bug. The main reason for rebasing our storage backend on the persistent sorted set was its fast iterator performance (5-10x faster than the hitchhiker-tree). There is still a bit of work needed in reducing transact latency when you write a lot of small values, but other than that we should have comparable performance to DataScript and in general I would also consider performance gaps to other contenders, xtdb, Datomic, Datalevin etc. as noteworthy/potential "bugs". We should already be faster than DataScript and Datalevin in read/write throughput if you use a schema with attribute references.
The persistent sorted set is very carefully constructed by @U050UBKAA to be fast, but I recently learned that https://github.com/lacuna/bifurcan might be faster for certain use cases/index structures (in this case for compiler passes). https://immudb.io/ uses log structured merge trees to have much faster write throughput, The persistent sorted set can do ~200k inserts/sec on my laptop if I recall correctly, immudb claims to be able to do millions per second, but I am not sure how effectively their trees can be snapshotted in a distributed system (i.e. how much copying is needed), and have not done any benchmarking myself. I would also say for now that Datahike users should follow the Datomic strategy and shard when possible (and do cross-db joins), which ultimately is the more scalable strategy (e.g. as employed by nubank).
Having said all this, it would be cool to build a demo for a distributed use case together now that it is possible to use Datahike that way. One of the more challenging things to explore there is how well latency between different backends (potentially hosted in different clouds) can be handled when there are cross-db joins between them.
🚀 Quick turnaround @U1C36HC6N! Glad it was a low-hanging performance bug. :) BTW, I noticed now the project has benchmarking code (that I had not tried to run). Does it highlight this cache miss problem (or could it be modified to catch it)? If so, perhaps this is worth extending as a future-proof regression test? Thanks also for clarifying the project expectations and intentions related to performance. I'm sure this will be really helpful when people are considering their tech choices.
The benchmark suite tests transactions and query, but not pull expressions. Query has its own cache on top of the index iterators, so the problem was not as visible. This pull expression basically runs 400 times here, highlighting the problem. So yes, the benchmark suite should be extended to it. If somebody wants to help with this, I am happy to guide. I am fairly busy right now, but hopefully can get to it later if not.
@U02U1T66REZ Could you update the stackoverflow posting to mention it is fixed once you can verify?
I also tried playing around with index-range
. I thought I might be able to do something like this:
(->>
(d/index-range @conn {:attrid {:book-name :notable?} :start "a" :end "zzzzzzzzzzzzzzz"})
(map :v))
But I believe :attrid
accepts only 1 attribute name.
I had trouble finding source code for index-range
. Replikativ documentation points to this https://github.com/replikativ/datahike/blob/HEAD/src/datahike/api.cljc#L686If anyone is interested in answering publicly, I have provided more detail on SO: https://stackoverflow.com/questions/77396365/slow-query-using-pull-to-retrieve-attributes-on-400-entities
I added my answers above as a more redacted post. Still missing information of you runtime env though,