datomic 2018-06-14 | Slack Archive

jlmr06:06:38

Hi, I'm trying to set something up so that I can sync Datomic entities to ElasticSearch (with Datomic being the source of truth). I would like it to be able to "catch up" on transactions that may have happened while the process was offline (or for rebuilding the Elastic index at some point) as well as continuously keeping pace with transactions as they occur while the process is online. Right now I'm using tx-range for catch-up and tx-report-queue for keeping pace. I'll get the datoms out of the transactions and use them to pull entities for syncing to Elastic. This does seem to work to some extent, however more entities are indexed in Elastic than there are in Datomic. Some pieces of the solution are apperently still missing. I suspect I need pieces of code that: • Tell which entities where added between two t's: these need to be created. • Tell which entities where changed (attributes changed or entities retracted) between two t's: these need to be updated. • Tell which entities where deleted (ie. all attributes retracted) between two t's: these need to be removed. It would be great to get some tips on how I can do this or pointers to earlier material or solutions for similar problems. Thanks in advance!

val_waeselynck07:06:18

@jlmr I've implemented something like that for our stack. In our case, we do it in a batch (as opposed to streaming) fashion, every 30 min, therefore we use the Log API - but this could be applied to the txReportQueue if you wanted to do streaming.

val_waeselynck07:06:50

For simplicity, we don't make any difference between added and changed; in both cases, the whole document gets recomputed and upserted into ES.

val_waeselynck07:06:47

In our case, we're dealing with Customer entities. We detect additions/changes with a (cust-changed ?e ?customer) Datalog rule, in which ?e is an entity that appears in the Log data, and ?customer is a customer that gets affected by this change to ?e. We register an implementation of this rule for each data path leading to a change to the Customer.

val_waeselynck07:06:53

To detect deletions, we detect datoms of the form [?cust :customer/id _ _ false] - which happen iff the Customer gets deleted.

val_waeselynck07:06:11

About ES management:

jlmr07:06:12

Thanks @val_waeselynck, right now I'm using the same general idea to detect deletions, however I'm still unfamiliar with the idea of Datalog rules.

val_waeselynck07:06:28

1) do the updates in batches

val_waeselynck07:06:59

2) Your ES materialized view needs to maintain an 'offset' t - so that it can pick up where it left off

val_waeselynck07:06:17

@jlmr https://docs.datomic.com/on-prem/query.html#rules

jlmr07:06:56

I opted to use t as the external_version number for documents in elastic. That way newer versions get overwritten

jlmr07:06:38

I'll take a look at the Datalog rules as well

val_waeselynck07:06:19

Rules are not mandatory for doing this (you can also do plain old disjunction with or-join), but they will allow you to decouple your code.

val_waeselynck07:06:51

@jlmr I think you'll need something more than this external_version - you want to know at what t the whole materialized view was last updated, not one of its document (what if the last update consisted only of deletions?)

val_waeselynck07:06:54

I recommend to keep track of this t in a document of a dedicated type

val_waeselynck07:06:21

Also, if you're going to do batching, consider using 2 rolling ES indexes that you put behind an ES index alias - this will give you more consistency, as you'll never query an 'in progress' MV

jlmr07:06:21

good points!

jlmr07:06:45

Is there some code I could take a look at?

val_waeselynck07:06:00

No, sorry, all proprietary

val_waeselynck07:06:04

Do watch this if you haven't already, it will set your ideas straight: https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/

jlmr08:06:29

will take a look at it next week when I have time again for this project

jeroenvandijk09:06:58

Regarding datomic ions, will there be a version that accesses the Ion instances directly (through an AWS ELB) to support applications with higher http throughput requirements than AWS lambda offers by default?

jeroenvandijk09:06:15

My gut feeling says this is possible as it would be some different AWS cloudformation template + some http server functionality on the Ions node (that might already be part of the current setup)

jeroenvandijk09:06:44

The above question could also be me missing the point about the benefit of putting lambda in between

Chris Bidler20:06:26

Does the Peer library expose any logging about its use of Memcached servers?

jaret20:06:06

@chris_johnson there is memcache average, sum, samples.

jaret20:06:08

https://docs.datomic.com/on-prem/monitoring.html#transactor-metrics

donaldball20:06:42

Is there a semantic difference between [(missing? $ ?e :foo/bar)] and (not [?e :foo/bar _]) ?

Chris Bidler20:06:33

@jaret Are those not tracking transactor use of memcache? I’m looking at my peer usage

2018-06-14

Channels