Fork me on GitHub
#datomic
<
2021-09-01
>
Jakub Holý (HolyJak)11:09:49

What is the canonical way to check that an attribute value is in a particular set? Just using a set as a fn, as in:

[:find ?e :where [?e :person/favourite-colour ?val] [(#{:blue :green} ?val)]]
? 🙏

schmee11:09:21

I believe you can use ground for this too 😄

Jakub Holý (HolyJak)13:09:50

Hi, thank you! But what if I want to express something like "find the entity (a person) is a descendant of #{Charlamagne, Leonardo da Vinci} and is married to one of #{Ghandi, M.L.King} ? I could use ground for both sets but is it not doing a cartesian product of values from both sets, effectively? I have no idea how it works underneath so maybe I do not make sense...

favila13:09:19

(comment
 ;; Set filtering cannot be introspected by the query engine
 ;; This can be good if the set is large
 ;; and there's no index datomic could use
 ;; to retrieve matching datoms.
 ;; Evaluation cannot be parallel,
 ;; but the intermediate result set will be smaller
 ;; and none of the unification machinery will get involved. 
 
 ;; As a literal:
 [:find ?e
   :where
   [?e :person/favourite-colour ?val]
   [(#{:blue :green} ?val)]]
 
 ;; As a parameter:
 [:find ?e
   :in $ ?allowed-val-set
   :where
   [?e :person/favourite-colour ?val]
   [(contains? ?allowed-val-set ?val)]]
 #{:green :blue}
 
 ;; Using unification
 ;; If you bind the items you are filtering by to a var
 ;; datalog will perform filtering implicitly via unification.
 ;; This is good if your filter value is indexed,
 ;; because now the query planner can see it
 ;; and possibly use better indexes or parallelize IO.
 ;; However, this may produce larger intermediate result sets
 ;; and consume more memory because of unification.
 
 [:find ?e
   :where
  ;; Could use an index
  [(ground [:green :blue]) [?val ...]]
  [?e :person/favourite-colour ?val]
  ]

 [:find ?e
  :where
  ;; Reverse clause order:
  ;; Now it *probably doesn't* use an index?
  ;; Depends on how smart the planner is.
  ;; Worst-case, it's as bad as a linear exhaustive
  ;; equality check of each val
  ;; which may or may not be worse than a hash-lookup
  ;; depending on the size of the set.
  [?e :person/favourite-colour ?val]
  [(ground [:green :blue]) [?val ...]]]
 
 ;; As a parameter:
 [:find ?e
   :in $ [?val ...]
   :where
   [?e :person/favourite-colour ?val]]
 [:green :blue]
 
 ;; Use a rule with literals
 ;; In most cases this will be the same as the previous approach,
 ;; but without the "maybe"s because you don't need to trust the query planner.
 ;; This is the most explicit and predictable,
 ;; and definitely parallelizeable (rules inherently are).
 ;; But you *must* use literal values.
 [:find ?e
  :in $
  :where
  (or [?e :person/favourite-colour :green]
      [?e :person/favourite-colour :blue])]
 
 
;; In any given case I would benchmark all three.

 )

gratitude 2
favila13:09:09

Summary: There’s three different basic techniques, and they can have dramatically different perf depending on the situation

Jakub Holý (HolyJak)14:09:38

thank you so much! You are a real treasure. I wish this information was available in the official docs...

souenzzo15:09:00

Using datomic on-prem in ~2018 I had a issue where i use a set as a parameter, if you send both query and parameters to a datomic function running on a real transactor (not memory), your set will turn into an array and it will throw, but only in production code.

;; As a parameter:
 [:find ?e
   :in $ ?allowed-val-set
   :where
   [?e :person/favourite-colour ?val]
   [(contains? ?allowed-val-set ?val)]]
 #{:green :blue}

favila18:09:13

I think I’ve been doing this in prod for at least 5 years without problems

souenzzo18:09:22

"send both query and parameters to a datomic function" like [... tx-data .. [:my-custom-db-fn [.. query ...] [.. args ..]]] I used to have a db-fn that receive a query and args, run this query, and if it is not empty, it throws. really convenient db-fn to solve race-conditions

favila18:09:08

Ah I see. So sets turned to vectors (maybe arraylists)?

Jakub Holý (HolyJak)13:09:29

When I Use :db/ident for Enumerationshttps://docs.datomic.com/on-prem/schema/schema-modeling.html#enums, the only way to enforce that a :db.type/ref attribute has only one of the values of the enum I want (imagine :color/green etc) is to install a :db.attr/preds on that attribute and a custom function = predicate that compares the supplied value against a hardcoded set. Correct?

2
favila13:09:00

You can’t really use :db.attr/preds because it receives the value and no db

favila13:09:03

You need the more general :db/ensure mechanism

favila13:09:19

or, just trust the application to do the right thing?

Jakub Holý (HolyJak)13:09:28

I could, if I hardcode the values in the predicate, no?

(defn color? [v] (contains? #{:color/green, ...} v))
Even if I had the DB I would not know how to "find all the defined colors" as I can hardly search for all idents starting with :color/ ? Our experience is that the app breaks the trust and having multiple layers of checks is desirable 😅

favila13:09:28

The value will be an entity ID, not a keyword

favila13:09:33

that the predicate receives

favila13:09:51

if you want to use a keyword type instead of an enum, then you can use a predicate

favila13:09:37

whether to use a ref or a keyword type for enums is really a tradeoff. Being able to use db.attr/preds is one tradeoff

Jakub Holý (HolyJak)13:09:55

I see. Thanks a lot for the clarifications!

favila13:09:56

others are: can you represent the value as a keyword easily with d/pull?

favila13:09:08

(keyword: yes, ident no)

Jakub Holý (HolyJak)13:09:34

So what pros do these idents have?

favila13:09:53

idents are entities, so you can assert additional information on them

👍 2
favila13:09:02

and you get a VAET index of them

favila13:09:32

and because of the semantics of ident lookup you can rename them safely

Jakub Holý (HolyJak)13:09:30

it would be awesome if the https://docs.datomic.com/on-prem/schema/schema-modeling.html#enums explained these things 🙏 (so people would stop bothering you with the same questions 😅)

favila13:09:14

you can also add your own higher-level, introspectable schema layer more easily if the enums are idents. You could have the attr itself reference the allowed ranges in a queryable way (vs being locked inside a predicate fn)

favila13:09:17

I think the blanket “use ident entities for enums” advice dates from before d/pull and attribute+entity predicates

favila13:09:36

the d/entity api represents ident entities as keywords when you navigate to them

favila13:09:08

and because there was no native higher-level predicate enforcement mechanism there were really no other considerations

favila13:09:39

in that world the “keyword” choice is strictly less powerful

JohnJ15:09:50

the d/pull thing with enums is definitely annoying 😉

souenzzo15:09:32

on pull, have :my-ref-to-enum {:db/ident :my-enum} is actually better then :my-ref-to-enum :my-enum for UI/fulcro developers It allow you to ask for labels in the same query that you ask for enums [{:my-ref-to-enum [:db/ident :app.ident/label :app.ident/description]}]

wow 1
stuarthalloway18:09:42

Datomic Cloud "Backup/Restore" FAQ:

kenny19:09:33

Hi Stuart. Thank you very much for an official response on this topic. We too have been using Cloud for several years and backup & restore has been an ongoing struggle. Since this is such a common topic, it would be awesome to add a page to the Cloud documentation with the information you have laid out below.

2
stuarthalloway20:09:45

Hi Kenny. Agreed -- We will update the docs once this conversation is complete.

2
stuarthalloway18:09:35

1. Is my data safe against individual hardware failures? Very much yes -- Datomic Cloud stores data to multiple AWS storage services, each of which is itself redundant.

stuarthalloway18:09:47

2. Does Datomic Cloud have a backup/restore feature that makes a complete second copy of a database? No, but we are looking into options for this.

6
2
2
kenny19:09:35

If Datomic Cloud provided an official backup/restore feature, would it take the same path Tony took?

kenny19:09:54

I saw 🙂 I was hoping for more technical info on how Datomic might approach this from that product space, whatever they can share publicly, ofc.

2
🙂 2
Daniel Jomphe19:09:56

Thanks for your own questions and comments, btw, kenny!

🙂 2
stuarthalloway20:09:56

@U083D6HK9 Those answers are on the other side of the design process, so I don't know yet.

2
stuarthalloway18:09:44

3. In the absence of full backup/restore, how can I make a complete second copy of a Datomic database? You can use the log API to get a complete, time ordered copy of the data and save it anywhere, and/or transact it into a second Datomic system as you go. I have not read the code and so cannot comment on correctness, but @tony.kay’s https://github.com/fulcrologic/datomic-cloud-backup demonstrates the idea.

kenny19:09:19

We have a solution conceptually identical to Tony's. I ran it on one of our production databases (100+m datoms). While ours does not hit a crash like Tony has seen, the speed at which it moves is far too slow. We were looking at multiple weeks of 24/7 run time to do a copy from start. Both the source VM and the target Cloud system were running at below 10% CPU utilization. I do not know of a way around this due to the inherently serial nature of a restore via the log API. The total bytes transferred over the network was also incredibly low. Are there any methods to improve the full restore time with this method or is that just how it goes?

stuarthalloway20:09:59

@U083D6HK9 Are you running inside an ion so that you can eliminate one hop of network latency?

kenny20:09:15

That test was not running in an Ion. I was running locally through my REPL to a Cloud system. Although I don't have concrete evidence readily available, I do not think network latency is the bottleneck.

stuarthalloway18:09:34

4. Why can't I just copy Datomic's S3 and/or DDB tables to another region? Datomic uses region-specific KMS keys for encryption, so copying data at the storage level requires (at least in the naive case) decrypt and re-encrypt steps.

Daniel Jomphe19:09:07

I intuit that this would be the most practical solution. With the con that it wouldn't allow to filter anything in the process. Still, might be a great way to provide a feature that would be used by clients. OTOH I'm happy to see what product-space solution you come up with.

stuarthalloway18:09:33

5. Can I create a new Datomic database and guarantee the same :db/id values as an existing database? Not at present. Also something we are looking at.

kenny19:09:32

Most often a complete restore of the history is desirable, however, we do have some use cases where just "restoring" the current state of the db -- no history -- would be sufficient. The only reason to consider such an approach, for our use cases, is restore speed. If a complete restore would be just as fast as a current state restore, then we'd prefer the former. Given that a complete restore is substantially slower than a current state restore, we wrote a process to do the latter. This process, named a "current state restore," is still bottlenecked by the need to map old ids to new ids, preventing us from applying any parallelism. With the guarantee that a new database would have the same :db/id values as an existing database, the bottleneck could be removed, allowing us to parallelize as much as the target system would allow for.

stuarthalloway20:09:18

How does your "current state restore" organize transactions?

kenny20:09:20

At a high level: 1) reads the eavt index into batches, 2) resolves each batch of datoms against the current old-id->new-id mapping, 3) transacts the resolved datoms, and 4) updates that mapping. The process must be serial due to not knowing how to update the mapping until after the transaction is done.

stuarthalloway11:09:06

What is the batch size? Does increasing it speed things up?

kenny15:09:20

500. I'll give it another shot at 1,000.

kenny17:09:18

Just following up here. I don't have any evidence from my previous attempts stored anywhere unfortunately. The current state restore ran for 57 minutes before throwing an exception in a d/datoms call (will ask a question on that in a separate thread). At that final time, 5,490 transactions succeeded and 2,746,671 datoms had been transacted (sum of the count of :tx-data returned from d/transact). I have attached a screenshot of the CW dashboard for the query group that was used to read datoms. Upon revisiting this, it is unclear whether the bottleneck is from reading datoms via d/datoms or transactions.

kenny00:09:26

Following up again... I have rewritten the datom reading portion to run in parallel. I also added some additional monitoring to get out the queue backlog of number of datoms waiting to go through the serial transact & mapping process. The queue stays pegged at 100% utilization (backlog of 20k datoms). So, I can now confirm, that it is the serial transact & mapping process that is slowing down the current state restore.