Fork me on GitHub
#datahike
<
2021-05-17
>
timo08:05:21

Hi @willier. I thought there is a datahike-backend for dynamodb... :thinking_face: but it is quite straight-forward to write a datahike-backend https://cljdoc.org/d/io.replikativ/datahike/0.3.6/doc/backend-development, I already did it recently for cassandra: https://github.com/timokramer/datahike-cassandra.

willier08:05:45

Hi @timo, what does k/new-your-store do? i don't see the source for that, assume it creates the tables?

willier11:05:24

ah thanks! @U899JBRPF

👌 3
whilo18:05:43

@brownjoshua490 First of all the file system backend got some improvements (Java's async nio basically) that are not in the official dependencies yet because we want to provide a seamless migration experience with @konrad.kuehne work on https://github.com/replikativ/wanderung/tree/8-dh-dh-version-migration, which is almost done. So to get the optimal performance you should add [io.replikativ/konserve "0.6.0-alpha3"] first. (The older store had too small buffer sizes and you basically hit Java's FileOutputStreams all the time. Other backends should not be affected by this problem.)

whilo18:05:00

(ns sandbox
  (:require [datahike.api :as d]
            [taoensso.timbre :as t]))

(comment

  (t/set-level! :warn)

  (def schema [{:db/ident       :age
                :db/cardinality :db.cardinality/one
                :db/valueType   :db.type/long}])

  (def cfg {:store  {:backend :file :path "/tmp/datahike-benchmark"}
            :keep-history? false ;; true
            :schema-flexibility :write
            :initial-tx schema})

  (d/delete-database cfg)

  (d/create-database cfg)

  (def conn (d/connect cfg))

  (time
   (do
     (d/transact conn
                 (vec (for [i (range 100000)]
                        [:db/add (inc i) :age i])))
     nil))

  ;; "Elapsed time: 7387.64224 msecs"
  ;; with history: "Elapsed time: 14087.425566 msecs"

  (d/q '[:find (count ?a)
         :in $
         :where [?e :age ?a]]
       @conn) ;; => 100000
  )

whilo18:05:00

Then this is the behaviour on my machine. It took me around 5 seconds (100k/5 = 20k datoms/sec) to transact this last time I checked this microbenchmark 3 months ago, so we might have introduced a slight performance regression in the last releases or my machine got slower.

whilo18:05:42

We currently do not write to the history index in parallel, something that should bring this number closer to 7 seconds (that is why it is approximately double).

whilo18:05:46

Of course this is just a microbenchmark, and as I mentioned, measures bulk throughput. We know how to add a buffer to the transactor to saturate at a similar speed for finer transaction granularity, but we have not done this work yet.

whilo18:05:22

@brownjoshua490 You should definitely see a better performance around 4-5k datoms/sec even on the old file store though. Maybe your Datoms contain a lot of data?

Josh19:05:17

@whilo Thanks for the example. Our datoms are string heavy, lot’s of strings that are ~100 - 200 chars I ran that example code a couple of times here are my results

Josh19:05:16

I’m still seeing 2-4x slower than the times that you have, were you on 0.3.3?

whilo21:05:59

I am on 0.3.6 (current master).

whilo21:05:33

@brownjoshua490 Maybe the string serialization costs us. It could also be our crypto hashing (which is optional). Can you provide either a data set or a representative workload that I can test?

whilo21:05:33

It definitely looks like it is a constant overhead here, because with history is now close to without history.

whilo22:05:07

I created 100k shuffled strings of length 300 from https://github.com/replikativ/zufall/blob/master/src/zufall/core.clj#L4, randomized insertion order and used a different filesystem and I still have similar performance (7.3 secs to transact). @brownjoshua490 Not sure what to make of this.