Fork me on GitHub
#datomic
<
2020-04-03
>
kenny15:04:59

Hi. We are looking into the best practices for loading lots of data into Datomic Cloud. I have seen the documentation on pipelining transactions for higher throughput https://docs.datomic.com/cloud/best.html#pipeline-transactions. From that section, > Data imports will run significantly faster if you pipeline transactions using the async API, and maintain several transactions in-flight at the same time. The example that follows does not use the Datomic async API. Why is that? Should it use the async API to achieve higher throughput? Are there any additional best practices or things to look out for when loading thousands of entities into Datomic Cloud?

ghadi16:04:53

one key for batch imports is to always put retries in your transactions

kenny16:04:30

I assume the typical exponential backoff + jitter on a retriable anomaly?

ghadi16:04:40

yea that works

4
kenny16:04:59

Give up after 3 retries or more?

ghadi16:04:21

can't say without knowing your loads

kenny16:04:55

What is the function used to calculate number of retries given a particular datoms/transaction + transactions/second?

ghadi16:04:52

The example of transaction pipelining in the docs does not include backing off on retriable anomalies

ghadi16:04:07

I always do some back of the napkin estimation of # of transactions and number of datoms per transaction

kenny16:04:15

Is there a recommended number of datomics/transaction?

ghadi16:04:14

rough order of 1000-10000 datoms

ghadi16:04:48

this is me talking, not the datomic team

4
kenny16:04:45

Why a whole order of magnitude of difference?

John Leidegren09:04:47

@kenny I'm going to guess compression. Some datoms compress better than others, so if you payload is very compressible, you'd get away with putting more datoms in each log segment. If you have really huge log segments you may run into limitations in storage. For example, the DynamoDB backend has a limit of 400 KiB per value. A log segment larger than that (i.e. transaction) cannot be committed into storage.

ghadi16:04:56

good practice to label the transactions themselves with some metadata

ghadi16:04:27

[:db/add "datomic.tx" :db/doc "kenny did this, part 1/15"]

kenny16:04:52

Hmm yeah. In this case I won't know the denominator of your part fraction there. I can still label them though.

ghadi16:04:42

or have stronger idempotence markers in the DB metadata

kenny16:04:40

Also still curious as to if I should be using the Datomic Cloud async api for maximal throughput.

Brian22:04:09

Hello! I'm using pull in my Datomic query and want to blend it with an :as statement to change the name of something. My data looks like this: {:category {:db/ident abcd}} I can use pull do grab it with [{:category [:db/ident]}] which returns a structure like the above. I can rename :db/ident by pulling like this [{:category [[:db/ident :as :hello]]}] . This returns a {:category {:hello abcd}} however what I would love to be able to do would be have it return {:hello abcd} essentially renaming that whole path. I'm doing this within a larger query using pull. I tried pulling this specific part out of the query like this with the :keys option:

(d/q '[:find (pull ?e [:1 :2 :3]) ?ident
       :keys :nums :hello
       ...
       [?cat :db/ident ?ident]
       ...)
but this ends up improperly nesting my return values because I don't want the :nums part and want the hello part to be in the same map as [:1 :2 :3]. Combining all of the above I tried something like
(d/q '[:find (pull ?e [:1 :2 :3
                       [{:category [:db/ident]} :as :hello]]) 
       ...)
However this didn't work and I suspect this isn't possible because pull doesn't know that I'm guaranteed to have a single value at the very end and not a vector somewhere in there. Am I right that it's impossible to tell pull to drill down to that last value and return only that last value under a new specific key? Is it possible to do what I desire some other way?

favila00:04:40

Pull can rename keys or default/limit values, but it cannot transform map shapes. You have to post process