Fork me on GitHub
#datahike
<
2022-03-31
>
Björn Ebbinghaus15:03:32

So: I noticed that my pull performance isn't all that great (with the file backend on my Mac). I have a pull-many call with 9 entities and 8 attributes (including one join) and it takes 40ms I pull multiple times to fulfil a request, so the times are stacking up, reaching 500ms for a request. >1s on my server without SSD. Anyway, I looked into it and noticed that pulling by lookup-ref is noticeable slower compared to pulling by eid.

pId                   nCalls        Min      50% ≤      90% ≤      95% ≤      99% ≤        Max       Mean   MAD      Clock  Total

:pull-with-lookup      1,000   681,47μs   707,46μs   787,53μs   846,66μs   928,76μs     1,90ms   730,65μs   ±5%   730,65ms    50%
:lookup->eid           1,000   431,54μs   447,78μs   495,12μs   519,19μs   592,17μs     1,31ms   460,24μs   ±4%   460,24ms    32%
:pull-with-eid         1,000   246,34μs   256,51μs   283,64μs   306,21μs   342,48μs   456,91μs   264,05μs   ±5%   264,05ms    18%
My thoughts about that: It would be great if datahike would keep (parts of) the avetindex in memory.
(profile {}
  (let [db @conn
        eid-cache (atom {})
        lookup->eid (fn [db lookup] (:e (first (d/datoms db {:index :avet, :components lookup}))))
        pull-with-eid-cache 
        (fn pull-with-eid-cache [db selector eid]
          (d/pull db selector
            (if (integer? eid)
              eid
              (if-let [cached-eid (get-in @eid-cache eid)]
                cached-eid
                (let [e (lookup->eid db eid)]
                  (swap! eid-cache assoc-in eid e)
                  e)))))]
    (doseq [_ (range 1000)]
      (p :lookup->eid
        (lookup->eid db [:decide.models.proposal/id #uuid"6051a9e7-5c78-46b4-90e7-4492c89f4728"]))
      (p :pull-with-lookup
        (d/pull db ['*] [:decide.models.proposal/id #uuid"6051a9e7-5c78-46b4-90e7-4492c89f4728"]))
      (p :pull-with-lookup->eid-cache
        (pull-with-eid-cache db ['*] [:decide.models.proposal/id #uuid"6051a9e7-5c78-46b4-90e7-4492c89f4728"]))
      (p :pull-with-eid
        (d/pull db ['*] 156)))))
pId                              nCalls        Min      50% ≤      90% ≤      95% ≤      99% ≤        Max       Mean   MAD      Clock  Total

:pull-with-lookup                 1,000   691,83μs   722,40μs   873,85μs   911,35μs     1,07ms     1,76ms   764,55μs   ±8%   764,55ms    42%
:lookup->eid                      1,000   434,32μs   454,18μs   545,43μs   569,15μs   671,98μs     1,58ms   479,82μs   ±8%   479,82ms    26%
:pull-with-lookup->eid-cache      1,000   257,03μs   272,01μs   329,56μs   354,27μs   455,49μs     1,25ms   290,05μs  ±10%   290,05ms    16%
:pull-with-eid                    1,000   250,20μs   260,28μs   302,66μs   327,57μs   407,67μs     1,39ms   274,43μs   ±8%   274,43ms    15%

Accounted                                                                                                                      1,81s    100%
Clock                                                                                                                          1,82s    100%

Björn Ebbinghaus15:03:13

With clojure.core.cache

(require '[clojure.core.cache.wrapped :as cw]

(profile {}
  (let [db @conn
        *eid-cache (cw/soft-cache-factory {})
        lookup->eid (fn [db lookup] (:e (first (d/datoms db {:index :avet, :components lookup}))))
        pull-with-eid-cache 
        (fn pull-with-eid-cache [db selector eid]
          (d/pull db selector
            (if (integer? eid)
              eid
              (cw/lookup-or-miss *eid-cache eid #(lookup->eid db %))]
    (doseq [_ (range 1000)]
      (p :lookup->eid
        (lookup->eid db [:decide.models.proposal/id #uuid"6051a9e7-5c78-46b4-90e7-4492c89f4728"]))
      (p :pull-with-lookup
        (d/pull db ['*] [:decide.models.proposal/id #uuid"6051a9e7-5c78-46b4-90e7-4492c89f4728"]))
      (p :pull-with-lookup->eid-cache
        (pull-with-eid-cache db ['*] [:decide.models.proposal/id #uuid"6051a9e7-5c78-46b4-90e7-4492c89f4728"]))
      (p :pull-with-eid
        (d/pull db ['*] 156)))))

kkuehne18:04:57

Yeah, that would be helpful indeed, currently we only have some caching strategies for the queries. The cache should be optional since we want Datahike to be able to also run on smaller systems. A question would be where to add the cache and how this relates to the query cache. Maybe you could add this to a discussion on github and we could see and discuss how we could add that to Datahike.