2024-11-15 datahike | Clojure Slack Archive

datahike

whilo 2024-11-15T02:14:32.505739Z

I get 30ms+-10 per assoc and 17ms+-5 per get on the store (with sync calls, async is already a bit slower).

whilo 2024-11-15T02:30:04.787039Z

It takes around 100ms+-20ms for me to transact a map with 4 entries and around 30ms+-15ms to d/q all the entities (for a small database).

whilo 2024-11-15T02:32:14.641699Z

;; Testing and usage example:
  (require '[konserve.core :as k]
           '[konserve-dynamodb.core :as kd]
           '[clojure.core.async :refer [<!!]])

  ;; DynamoDB configuration
  (def dynamodb-spec {:region "us-west-2"
                      :table "konserve-dynamodb2"
                      :consistent-read? true})

  ;; Connect to the store
  (def store (<!! (kd/connect-store dynamodb-spec :opts {:sync? false})))

  ;; Test inserting and retrieving data
  (time (k/assoc-in store ["foo"] {:foo "baz"} {:sync? true}))

  (time (k/get-in store ["foo"] nil {:sync? true}))

  ;; Release the store connection
  (kd/release store {:sync? true})

  (kd/delete-store dynamodb-spec :opts {:sync? true})


  (require '[datahike-dynamodb.core])

  (def dynamoc-cfg {:store {:backend :dynamodb
                            :scope ""
                            :region "us-west-2"
                            :table "konserve-dynamodb2"
                            :consistent-read? true}
                    :allow-unsafe-config true})

  (d/create-database dynamoc-cfg)

  (def conn (d/connect dynamoc-cfg))

  (def schema [{:db/ident :screen/id
                :db/valueType :db.type/string
                :db/cardinality :db.cardinality/one}
               {:db/ident :screenshot/path
                :db/valueType :db.type/string
                :db/unique :db.unique/identity
                :db/cardinality :db.cardinality/one}
               {:db/ident :screenshot/transcript
                :db/valueType :db.type/string
                :db/cardinality :db.cardinality/one}
               {:db/ident :screenshot/created
                :db/valueType :db.type/instant
                :db/cardinality :db.cardinality/one}])

  (d/transact conn schema)

   ;; transact some dummy data
  (time
   (d/transact conn [{:screenshot/path (str "/some/fake_path/" (java.util.Date.))
                      :screen/id "454345782"
                      :screenshot/created (java.util.Date.)
                      :screenshot/transcript "This is a test transcript."}]))
  
  (swap! (:wrapped-atom conn) (fn [db] (update db :writer #(assoc % :streaming? false))))

  (time 
   (d/q '[:find ?c
         :where
         [?i :screenshot/created ?c]
         ] @conn))

whilo 2024-11-15T02:33:37.022919Z

There two problems: 1. when you create-database the first time and the table is not there it fails because it tries to immediately to connect. There is a somewhat way to solve this by waiting in the table creating call until we can successfully check that the table exists. Alternatively you can also precreate the table in dynamo, which I think should work. Solved with the latest release, it will wait until the table is there.

whilo 2024-11-15T02:36:27.486959Z

The 2. thing to be aware of is that you need to manually set (swap! (:wrapped-atom conn) (fn [db] (update db :writer #(assoc % :streaming? false)))) in the query context, in the example here you need to call it before each d/q call because transact will overwrite it in the same memory context and then d/q will not first fetch the latest snapshot (because transact keeps it in memory already), giving you latencies around 1ms, but this is not representative of the lambda use case you want to know about. In the lambdas you want the transactor to operate in its own singleton lambda, so you would just have to set it like this https://github.com/viesti/clj-lambda-datahike/blob/main/src/clj_lambda_datahike/core.clj#L36C7-L36C94.

whilo 2024-11-15T02:37:30.556229Z

I cannot say anything about snapstart and I don't have time atm. to dive into this side of things, but I think the dynamodb backend looks acceptable latencywise. Lmk if it is not. I assume there are still capacity and memorysize limitations that one would have to work around, but as far as I know this is also true for Datomic.

whilo 2024-11-15T02:40:38.786629Z

Maybe one could say we do have a valid open-source solution similar to Datomic cloud now that can be actually hacked and adjusted to your specific use case.

whilo 2024-11-15T03:00:24.077009Z

And you should be able to run it fully in lambdas including the transactor, once we have sorted the startup problem (which might still be tricky).

whilo 2024-11-15T04:47:30.306579Z

I fixed the table creation problem, konserve now waits until it can successful check that the table exists before the creation function returns. This is covered in the two new versions.

whilo 2024-11-15T04:57:17.767249Z

In comparison an S3 assoc takes 90ms+-20 and get takes 45ms+-10.

whilo 2024-11-15T05:00:19.548259Z

A transact of the same data costs approximately 400ms+-100 and a d/q costs approximately 45ms+-20 after a transact call (but with warm caches, so it is not loading the full index in either case).

whilo 2024-11-15T05:01:35.484419Z

I assume we can still speed things up a bit by tweaking dynamo, but the backends are not that far apart. (Note that this is for small test databases and exploration, nothing big yet).

whilo 2024-11-15T05:01:56.115799Z

@mpenet I think S3 might not be as bad for something like datahike as you thought.

whilo 2024-11-15T05:03:27.810149Z

@sasha_bogdanov_dev it would be interesting to know how well Datomic can perform with dynamo in terms of latency. I have zero experience with this.

whilo 2024-11-15T05:13:29.693639Z

I also would like to reiterate that the writer automatically batches transactions, so write latency does not linearly compound if you use d/transact! and send multiple transactions in parallel from different clients.

whilo 2024-11-15T05:30:56.872959Z

@sasha_bogdanov_dev If you can provide examples of direct API calls for GetItem or PutItem that provides lower latency (they claim it can be single digit ms), then we can just fix the konserve backend to use that to speed things up.

👀 1

grounded_sage 2024-11-15T05:39:27.560889Z

If I have multiple threads doing d/transact does that batch? @whilo When would I use d/transact! Over d/transact. Only when I want it async?

whilo 2024-11-15T05:54:19.656219Z

I think you should always use d/transact! if you don't need the result. There are cases where you only want to update other state after transact has returned, but you can do that also asynchronously in a callback or some other async programming paradigm such as core.async or missionary.

grounded_sage 2024-11-15T09:27:24.464079Z

All calls in electric to the db are done inside of e/offload which is a thread. So it seems that it would not make a difference. I’ve modelled things in such a way that I would say all the calls can happen independently without coordination for consistency.

whilo 2024-11-15T09:38:52.172989Z

if you use d/transact! then you shouldn't even need to offload and can just wait for the callback or use missionaries dataflow variable wrapper for core.async.

whilo 2024-11-15T09:45:23.042319Z

Like this https://github.com/whilo/simmis/blob/main/src/is/simm/runtimes/rustdesk.clj#L117

grounded_sage 2024-11-15T11:02:26.814939Z

Interesting that you have the task inside the transact data. I’ve decoupled all my processes. Then have another namespace that composes them. I see what you are speaking to though. Thanks for pointing to the code.

whilo 2024-11-15T16:48:43.743469Z

This is a hacky prototype code with a lot of loose ends.

grounded_sage 2024-11-15T16:54:14.494669Z

Make it work stage ✨

whilo 2024-11-15T05:56:02.116999Z

I think d/transact is convenient when you develop interactively though.

whilo 2024-11-15T05:56:54.162239Z

Basically you can also transact in parallel with d/transact but then you need to run each call in its own thread that will be blocked. This is kind of unnecessary.

whilo 2024-11-15T06:00:13.662949Z

If you do a naive loop with d/transact to process otherwise independent requests then you will have to wait for all IO on each transaction. While if you transact in parallel asynchronously from wherever you need to then the writer will process at CPU capacity (similar to in-memory write performance) and commit to durable storage whenever it can do so again and then notify all the async requests that were part of the commit.

whilo 2024-11-15T06:00:47.368639Z

@grounded_sage if you have a concrete scenario I can also describe it in its terms.

whilo 2024-11-15T06:10:26.801459Z

Also if you need to do updates of data in the database, where you first would retrieve something, then transact a modified version, then maybe transact another modified thing etc. you can use transactor functions to do this during a single transact call. This will safe you from having to do multiple roundtrips at the cost of potentially slowing down the transactor a bit. We have not done a lot with these functions in Datahike yet, Datomic requires to explicitly install their code through a transact call (but also allows you to install Java etc.), while DataScript just allows you to invoke functions that are loaded in its runtime. So to use transactor functions to do, let's say a compare and swap, or to retrieve the balance of a bank account and then update it atomically, you can use such functions. Otherwise you would have to synchronize on the outside and that would serialize and slow down your process (parallel processes could still run though).

Clojurians Log v2

datahike