Fork me on GitHub
#datomic
<
2023-03-21
>
Clément Ronzon17:03:39

Hello 👋 I am designing an ETL written with Clojure that is intended to run in a Datomic cluster, it’ll pull entries from a MySQL db located in the same AWS account and push them in the Datomic db. My first goal is to process 9M entries as fast as possible, hopefully in less than 5hrs. The strategy I’d like to implement is the following: • pull a first batch of entries (with offset and limit) from the MySQL db • process those entries leveraging parallelism via the pmap function to push them into Datomic db • pull the next batch of entries • process that next batch • etc. until there is no more entries returned by the MySQL query In my first POC I discovered a few things: • sometimes Datomic returns errors such as :cognitect.anomalies/busy, Busy indexing • other times it returns errors such as :cognitect.anomalies/fault, :datomic.client-spi/exception java.lang.NullPointerException If I remove parallelism, changing pmap for map those issues won’t happen though the ETA is about 30hrs. Is it a bad idea to try to parallelize the ingestion? Please does anyone has an idea why this could happen? Any suggestion?

rolt17:03:51

i did the same task a few years ago, from what I remember: transact batches of 1000 datoms, use transact-async (but deref the value), 10 in flight queries or so (i think i used claypool), backoff on errors i think i found some doc on the official website, or maybe on the github ? to optimise this process. This was a one-shot migration for me so I didn't push it too far, I just wanted a low downtime

Clément Ronzon00:03:46

nice! tyvm for the hints!

frankitox19:03:22

What's the usual way to get the datoms using a transaction id? I'm thinking of using range-tx

frankitox19:03:28

Something like (d/tx-range (d/log (db/_conn conn)) tx (inc (d/tx->t tx)))

Gustavo A.19:03:28

maybe you could use: (d/pull db '[*] your-tx-id)

favila04:03:01

By “the datoms” do you mean the datoms that were asserted/retracted in that transaction, or the datoms that are asserted/retracted on the transaction entity?

favila04:03:04

if the former, then use tx-range; if the latter, use (d/datoms :eavt tx) or d/pull or d/entity on the tx entity id.

favila04:03:16

the former is datoms where the :tx slot matches the tx; the latter is datoms where the :e slot matches the tx.

frankitox20:03:46

It was the former. I'll use tx-range then, thank you!