2023-10-31 datahike | Clojure Slack Archive

datahike

silian 2023-10-31T12:46:33.156239Z

Performing the following pull-many query with 3 attr-ids (including :db/id) on a range of 400 or so entities requires ~2,900 ms:

(require '[datahike.api :as d]) ; version 0.6.1531
(d/pull-many @conn [:db/id :book-name :notable?] 
  (range 1 400))

Is the slow query an inherent trade-off of EAV databases, or am I failing to optimize in some obvious way? In this case, would it help to set :db/index to true for attribute :db/id? (I assumed true would be the default for :db/id but perhaps it isn't.)

silian 2023-11-01T10:23:44.421769Z

I should have provided my schema for the attrs in question:

[{ :db/ident :book-name
   :db/valueType :db.type/string
   :db/cardinality :db.cardinality/one}
 { :db/ident :notable?
   :db/valueType :db.type/long
   :db/cardinality :db.cardinality/one}]

notable? was set to long because I thought I might want to count the number of times a book-name gets "noted". Most entities are not noted; when they are, the value of notable? is 0. Oddball design but I don't think that would affect query speed.

silian 2023-11-01T10:36:24.731679Z

Finally, here are the datoms for an entity that is :notable?:

(#datahike/Datom [336 :notable? 0 536871355]
 #datahike/Datom [336 :url-link "Street Fighting Mathematics" 536871288])

pithyless 2023-11-01T19:36:29.474229Z

FYI - I've been able to reproduce this issue, by testing datahike vs datomic (mem/file) and datahike vs datomic (pgsql). Two things immediately jump out to me:

pithyless 2023-11-01T19:38:01.874519Z

First, even in dev mode (writing to file), datahike shows a performance difference between d/q and d/pull-many:

(time
   (datahike/q
    '[:find [?id ...]
      :in $ [?id ...]
      :where
      [?id :book-name]]
    @conn book-ids))
  ;; "Elapsed time: 4.2875 msecs"

  (time
   (datahike/q
    '[:find ?id ?book-name ?notable
      :in $ [?id ...]
      :where
      [?id :book-name ?book-name]
      [(get-else $ ?id :notable? false) ?notable]]
    @conn book-ids))
  ;; "Elapsed time: 158.987833 msecs"

  (time
   (datahike/pull-many
    @conn [:db/id :book-name :notable?]
    book-ids))
  ;; "Elapsed time: 388.851709 msecs"

pithyless 2023-11-01T19:40:44.899939Z

In Datomic, these queries either from memory or SQL all hover around 5~15 ms (caching, in-memory paging, etc. means that the exact numbers aren't important.. but they're always under 20ms and irrespective of query. In fact, pull-many often is slightly faster for this simple kind of index lookup).

👀 1

pithyless 2023-11-01T19:42:14.656259Z

Second, there is a much bigger delay (the kind @feedmyinbox02_clojuri is reporting) when Datahike is using datahike-jdbc (via postgres running on a local docker instance):

(time
   (datahike/q
    '[:find [?id ...]
      :in $ [?id ...]
      :where
      [?id :book-name]]
    @conn book-ids))
  ;; "Elapsed time: 32.260542 msecs"

  (time
   (datahike/q
    '[:find ?id ?book-name ?notable
      :in $ [?id ...]
      :where
      [?id :book-name ?book-name]
      [(get-else $ ?id :notable? false) ?notable]]
    @conn book-ids))
  ;; "Elapsed time: 1369.331292 msecs"

  (count
   (time
    (datahike/pull-many
     @conn [:db/id :book-name :notable?]
     book-ids)))
  ;; "Elapsed time: 5431.585291 msecs"

pithyless 2023-11-01T19:43:16.012689Z

All these tests are using ~~450 "books" and~~ 50 "other" data, per the OP requirements.

[{:db/ident :book-name
    :db/valueType :db.type/string
    :db/cardinality :db.cardinality/one}
   {:db/ident :notable?
    :db/valueType :db.type/long
    :db/cardinality :db.cardinality/one}
   {:db/ident :other
    :db/valueType :db.type/string
    :db/cardinality :db.cardinality/one}]

pithyless 2023-11-01T19:45:22.277919Z

There is probably some really low-hanging fruits here that can be addressed or perhaps misconfiguration error on my part, because this is a real small amount of data to be choking on. NOTE: I am not familiar with Datahike in practice, so I may have missed something obvious. My first thought would be - are there certain SQL indices the end user needs to migrate?

pithyless 2023-11-01T19:47:20.761199Z

FYI, this is what I was comparing:

{:datahike {:extra-deps
             {io.replikativ/datahike      {:mvn/version "0.6.1552"}
              io.replikativ/datahike-jdbc {:mvn/version "0.2.46"}
              com.github.seancorfield/next.jdbc {:mvn/version "1.3.894"}
              org.postgresql/postgresql {:mvn/version "42.6.0"}}}
  :datomic  {:extra-deps
             {com.datomic/peer          {:mvn/version "1.0.6735"}
              org.postgresql/postgresql {:mvn/version "42.6.0"}}}}

pithyless 2023-11-01T21:24:30.299369Z

BTW, I took another look at this - and an initial CPU profiler shows that 18% of the time was spent on konserve-jdbc/read-all, 37% on konserve-jdbc/blob-exists? , and another 18% on konserve-jdbc/read-header. That's not counting the time to serialize/deserialize Fressian, etc. This is just time spent fanning out and doing a LOT of repeat select 1 ... style SQL queries to the DB. To put this in perspective: the logger showed 2145 SQL queries for the d/q example and 6410 queries for the d/pull-many . For a query matching ~400 entity ids.

pithyless 2023-11-01T21:26:38.922849Z

I'm not sure what kind of performance and load you're expecting @feedmyinbox02_clojuri - but no matter how much you lower DB latency, that many requests will add up. As I'm not familiar with the topic, I may be missing something. But it feels like the konserve-jdbc is missing a strategy to batch and cache data to minimize hitting raw SQL. 🤷 Perhaps something is on the roadmap?

silian 2023-11-01T22:24:48.936179Z

Wow, this is an awesome interrogation of the system — thank you, Norbert — but somewhat discouraging to me! As enamored as I am with the composability of Datalog, I didn't realize serial retrieval and joins of attributes would suffer this much. Now that I'm more familiar, it seems like a more traditional relational engine is best for the kinds of queries I want to run. If I understand right, Datalog is designed to retrieve results when multiple constraints can be supplied whereas relational is better suited for the kind of serialized map I'm trying to form (“Give me all the objects but tell me when they have a special attribute”).

alekcz 2023-11-02T06:41:10.607679Z

Hey @pithyless this is such a useful analysis. Lemme give it a shot.

alekcz 2023-11-02T06:42:52.281769Z

By the way your talk got me into datalog, and I've never looked back. It's what got me spending time on datahike (datomic was unaffordable at the time). Thank you.

❤️ 2

pithyless 2023-11-05T11:53:35.026109Z

🚀 Quick turnaround @whilo! Glad it was a low-hanging performance bug. :) BTW, I noticed now the project has benchmarking code (that I had not tried to run). Does it highlight this cache miss problem (or could it be modified to catch it)? If so, perhaps this is worth extending as a future-proof regression test? Thanks also for clarifying the project expectations and intentions related to performance. I'm sure this will be really helpful when people are considering their tech choices.

whilo 2023-11-05T18:11:09.459869Z

The benchmark suite tests transactions and query, but not pull expressions. Query has its own cache on top of the index iterators, so the problem was not as visible. This pull expression basically runs 400 times here, highlighting the problem. So yes, the benchmark suite should be extended to it. If somebody wants to help with this, I am happy to guide. I am fairly busy right now, but hopefully can get to it later if not.

whilo 2023-11-05T19:10:56.644659Z

@feedmyinbox02_clojuri Could you update the stackoverflow posting to mention it is fixed once you can verify?

👍 1

Linus Ericsson 2023-10-31T13:31:38.665239Z

In what context is this running? JVM clojure, javascript (dev or prod-compilation) or babashka or something?

Linus Ericsson 2023-10-31T13:40:30.102599Z

At first glance quite simple query seems quite slow, if you run in JVM or a production build of clojurescript (doesn't have to be an optimized build but it should have had the opportunity to have the JIT to make its job). Datahike is ported from datascript, but it seems like the pull_api is quite different between the two implementations - at first glance datahike seems to do the parsing of the pull expression in a more elaborate way etc, and with a lot more checks than datascript - probably for the better.

Linus Ericsson 2023-10-31T13:42:46.777619Z

If you want a maybe more production like answer on that queries performance, use either Critereum (jvm) or maybe Tufte (cljs - haven't tried) or something else under the Benchmarking section in https://www.clojure-toolbox.com/

Linus Ericsson 2023-10-31T13:44:58.183329Z

(In the query you are testing there is probably not enough calls to make JIT start to its work).

Linus Ericsson 2023-10-31T13:49:38.843749Z

In Datomic the query would take much less time than 2900 ms. The query is not that complicated - for each :db/id, check if there is an entity for this db/id (this is what DataHike does, not sure about DataScript, Datomic), then look for the attributes :book-name (I assume this is a string?) and :notable? (I assume this is a boolean) in the EAV-index for each db-id. Create a map with the result {:db/id _ :book-name " ..." :notable? <bool>}. If these attributes is multi-arity and/or references (which can have isComponent set to true), things are somewhat more complicated, but I assume their not?

✅ 1

pithyless 2023-10-31T15:18:18.007359Z

First question - can you provide an idea of how many overall entities are in the DB? And of those, how many of those have attributes :book-name and notable? ? Second question - have you tried writing a d/q query and compared performance? something ala:

(d/q '[:find [?id ?book-name ?notable]
       :in $ [?id ...]
       :where
       [?id :book-name ?book-name]
       [(get-else $ ?id :notable false) ?notable]]
  db (range 1 400))

silian 2023-10-31T18:30:26.708059Z

Wow, pithyless — your pivorak talk is what got me into EAV and datalog in the first place! Superb talk. Bravo.

❤️ 2

silian 2023-10-31T18:34:01.477829Z

There are 506 entities in the DB. 416 have :book-name and 3 have :notable?.

pithyless 2023-10-31T18:36:54.001739Z

I don’t have experience with datahike, but even just page scanning 500 entities should not take that long, I think. Let us know if writing a regular query helps - at least we’ll know if it’s index (query) or pull syntax related.

pithyless 2023-10-31T18:39:48.766889Z

PS. appreciate the kind words - that talk was always intended for those that go and do something with that material. A start of a journey, vs an actual destination. I’m always happy to hear that it helped someone :)

💙 2

pithyless 2023-10-31T18:40:35.172289Z

PPS. I’m taking off now, but will happily peruse any data you provide in the meantime tomorrow

silian 2023-10-31T18:42:04.346419Z

Very kind of you

silian 2023-10-31T18:46:22.538949Z

I should have included the shape of the data I am trying to form, as it will hint at what I'm trying to build:

[{:db/id 339, :book-name "Notation as a Tool of Thought"}
 {:db/id 338, :book-name "The Science of Radio", :notable? true}
 {:db/id 337, :book-name "Journey Into Mathematics"}
 {:db/id 336, :book-name "Street Fighting Mathematics"}
...]

silian 2023-10-31T18:52:11.109329Z

@oscarlinusericsson Your assumptions are all correct: > :book-name (I assume this is a string?) and :notable? (I assume this is a boolean) Correct. > If these attributes is multi-arity and/or references (which can have isComponent set to true), things are somewhat more complicated, but I assume their not? Your assumption is correct: they are not references.

whilo 2023-10-31T19:01:06.440389Z

This is also a bit surprising to me. I would look at the sampling profiler for CPU time, e.g. in jvisualvm to see where most of the time is spent. Maybe something weird is happening in the iteration over entity ids.

👍 1

whilo 2023-10-31T19:03:16.399369Z

I would also be interested in how d/q is performing in comparison.

silian 2023-10-31T19:11:52.122289Z

Hmm, interesting ... even pithyless's d/q is taking almost 1,000 ms to yield one result:

(time (d/q '[:find [?id ?b ?n]
         :in $ [?id ...]
         :where
         [?id :book-name ?b]
         [(get-else $ ?id :notable? false) ?n]]
       @conn (range 1 400))) 
; => "Elapsed time: 945.92125 msecs" 
;     [320 "paul mockapetris" false]

whilo 2023-10-31T19:29:37.302659Z

Strange, I will try to reproduce. Could you make the dataset available to me somehow? Can be private confidential PM.

silian 2023-10-31T19:31:58.524259Z

Happy to. Haven't done this before but it is postgres db so I will try pg-dump, I believe.

whilo 2023-10-31T23:33:12.161589Z

So you use postgres as a backend? The easiest would be an edn file to transact.

whilo 2023-10-31T23:33:46.890599Z

We recently released a jdbc backend with less latency https://github.com/replikativ/datahike-jdbc

whilo 2023-10-31T23:33:55.319339Z

Thanks to @alekcz360

silian 2023-11-01T01:08:10.646899Z

Sorry, the :backend is :jdbc (and :dbtype "postgresql"). Apologies, I am still relatively new to Clojure.

silian 2023-11-01T04:25:36.673939Z

@whilo So I can generate an edn file that will re-create all transactions (including schema transactions)?

silian 2023-11-01T04:41:39.510749Z

It looks like Datahike has not implemented tx-range, which may have been helpful here. https://cljdoc.org/d/io.replikativ/datahike/0.3.6/doc/differences-to-datomic-client-api?q=tx-range#differences-to-datomic-client-api

whilo 2023-11-04T20:22:36.201709Z

@feedmyinbox02_clojuri This is a problem in Datahike, we need to fix this. It is not your fault. I have not worked with the JDBC backend much, but something is going wrong here.

whilo 2023-11-05T03:02:13.767899Z

@feedmyinbox02_clojuri This is fixed now with https://github.com/replikativ/datahike/pull/653. Thank you for reporting it, I had not recognized yet that the cache was not properly hit, because I usually ran queries with the query engine which has another cache in front of the store and was focused until now on ensuring that writes are fast and we can access everything in a distributed way. @pithyless @alekcz360 This problem had nothing to do with JDBC btw. It hit the store always that often, also for the file backend. You can access cache hit statistics with (:stats (:storage (:store @(:wrapped-atom conn)))) btw. @pithyless Thanks for looking into the issue!

whilo 2023-11-05T03:09:23.138319Z

I will put myself a bit on the shooting line with this now, but in general whenever Datahike is significantly (e.g. more than 10%) slower than DataScript that should be considered a bug. The main reason for rebasing our storage backend on the persistent sorted set was its fast iterator performance (5-10x faster than the hitchhiker-tree). There is still a bit of work needed in reducing transact latency when you write a lot of small values, but other than that we should have comparable performance to DataScript and in general I would also consider performance gaps to other contenders, xtdb, Datomic, Datalevin etc. as noteworthy/potential "bugs". We should already be faster than DataScript and Datalevin in read/write throughput if you use a schema with attribute references.

whilo 2023-11-05T03:16:06.714659Z

The persistent sorted set is very carefully constructed by @tonsky to be fast, but I recently learned that https://github.com/lacuna/bifurcan might be faster for certain use cases/index structures (in this case for compiler passes). https://immudb.io/ uses log structured merge trees to have much faster write throughput, The persistent sorted set can do ~200k inserts/sec on my laptop if I recall correctly, immudb claims to be able to do millions per second, but I am not sure how effectively their trees can be snapshotted in a distributed system (i.e. how much copying is needed), and have not done any benchmarking myself. I would also say for now that Datahike users should follow the Datomic strategy and shard when possible (and do cross-db joins), which ultimately is the more scalable strategy (e.g. as employed by nubank).

whilo 2023-11-05T03:43:03.978489Z

Having said all this, it would be cool to build a demo for a distributed use case together now that it is possible to use Datahike that way. One of the more challenging things to explore there is how well latency between different backends (potentially hosted in different clouds) can be handled when there are cross-db joins between them.

pithyless 2023-11-02T07:11:55.203519Z

Thanks @alekcz360 - that's quite a compliment, coming from you. I think you're definitely the more entertaining speaker; you've got the natural flow, calm, and stage presence of someone who can captivate an audience. :)

❤️ 1

silian 2023-10-31T12:57:21.077619Z

I also tried playing around with index-range. I thought I might be able to do something like this:

(->> 
 (d/index-range @conn {:attrid {:book-name :notable?} :start "a" :end "zzzzzzzzzzzzzzz"}) 
 (map :v))

But I believe :attrid accepts only 1 attribute name. I had trouble finding source code for index-range. Replikativ documentation points to this https://github.com/replikativ/datahike/blob/HEAD/src/datahike/api.cljc#L686

silian 2023-10-31T13:33:08.288299Z

If anyone is interested in answering publicly, I have provided more detail on SO: https://stackoverflow.com/questions/77396365/slow-query-using-pull-to-retrieve-attributes-on-400-entities

👍 1

Linus Ericsson 2023-10-31T14:14:17.019659Z

I added my answers above as a more redacted post. Still missing information of you runtime env though,

silian 2023-10-31T18:22:36.574569Z

Thanks Linus. It is running on JVM Clojure (1.10.0). Running it locally.

whilo 2023-10-31T19:28:39.786489Z

It would be better to open issues on github, I think.

🐛 1

silian 2023-10-31T20:13:58.124349Z

I wasn't sure if this was a bug or just my own incompetence and oversight.

Clojurians Log v2

datahike