Fork me on GitHub

Hi, I have started experimenting with Datomic Analytics. Intention is to use bundled presto server to run adhoc report queries e.g. number of purchases where time window is between x and y (Time of purchase is stored as a fact). My transactor is running on-prem. • I was curious where to put in the;cid=C03RZMDSH files ? ◦ Do they have to be in peer or transactor ? • Does presto server need to run on the transactor ? ◦ Can I have a separate deployment of presto server? I would greatly appreciate if anyone can nudge me to any tutorial/walk through for setting Datomic Analytics up since the documentation is really thin. cc: @dazld


At a very high level, datomic analytics is a presto/trino installation with a datomic client api connector.


it is a separate process: you can run it anywhere that is network-connected to a peer server


> Metaschema files are .edn files in the datomic subdirectory of Trino’s etc-dir. Metaschema files can have any name you find convenient, and Datomic analytics will automatically associate metaschemas with any database that has matching attributes.


From the docs


So you need a “normal” datomic system (cloud or on-prem). If using on-prem you also need a peer-server running (it’s a peer process that provides the client api--cloud only provides the client api). Then you add datomic analytics (presto/trino) and point it at the thing that provides the client-api (peer-server for on-prem, the cloud service itself for cloud).


probably worth mentioning that if datomic's datalog isn't a barrier then setting up presto is just pure overhead


presto can handle much larger queries than datomic’s current datalog implementation, and it has much richer aggregation options


the connector uses memory-efficient divide-and-conquer strategies (using undocumented functions that partition attribute indexes into ranges) so that the intermediate result sets in datalog don’t OOM. An equivalent naive datalog query can easily just take too much memory to complete


of course the connector is implemented with the client api so yes, in theory you are right. In practice however, analytics can handle queries with much bigger intermediate result sets using less memory, and often faster wallclock time because of parallelism and reduced memory pressure


Interesting, don't remember seeing anything about performance/efficiency of the connector in the docs(maybe is included now). So yeah, since it requires the peer server I assumed that's where the bottleneck would be (and maybe storage depending on what one uses)


The peer server can still be bottleneck.


but the real bottleneck is the datomic query engine is not that smart


got it, do you use it in production for non-analytics?


we use it for non-analytics, but not in production


well, maybe it’s considered analytics uses. We’re not doing it for business purposes but for schema maintenance, checking cardinality, counting, etc


data integrity, histograms


that kind of stuff


anything that isn’t a selective query


the equivalent datalog usually doesn’t work at all. d/datoms or index-pull can often do it, but it’s much more thinking and typing


so I hold my nose and type the SQL


datomic can be become a heavy operational burden


maybe they will easy it with cloud and provide pre-setup trino but is still another process to monitor


by data integrity you mean using trino to check for corrupt data?


checking invariants


Thanks a ton @U09R86PA4 and @U01KZDMJ411. One question though, metaschema edn files. need to in datomic-pro-<version>/presto-server/etc/datomic/ , right?