Has anybody connected an agent to a readonly copy of datomic to gain the automation benefits that everyone is talking about these days? I’d be interested to hear learnings before I embark on this journey. more in the thread….
E.g. https://youtu.be/B246K_G7mHU they are doing it with postgres, but it’s the same idea
We use datomic cloud, so I’m thinking as an MVP of creating some kind of read only access repl connection to the cloud and then just running Claude on a Mac mini with a connection to a chat group or even just remote control initially
Another idea might be to set up a backup, replication scheduled job and then run a local datomic on the backup to provide the read only access. This would have the added benefit of proving that the backups are valid at all times.
recent feature, relevant to your interests: https://docs.datomic.com/operation/read-only.html
read-only is only on Pro, not cloud, and it is a source of great envy
oic, I didn't read the messages here carefully enough, my mistake.
I'm going to hold my breath until we get read-only Datomic for Cloud. Starting now....
😶
I've got a performance problem I'm struggling to understand. The scenario: I have a unique attribute identifying ~500K entities. These entities (Purchase Invoice Lines) all have a card-many relationship to an intermediary entity (Allocation) and then a card-one relationship to a target entity (Purchase Order Line). The goal is to identify any Purchase Invoice Line that relates to more than one distinct Purchase Order Line. Details in the 🧵 ...
A simple query to identify these looks something like this:
{:find [?pilid]
:in [$]
:where [[?pil :st.purchase-invoice.line/allocations ?aA]
[?pil :st.purchase-invoice.line/allocations ?aB]
[(not= ?aA ?aB)]
[?aA :st.allocation/purchase-order-line ?polA]
[?aB :st.allocation/purchase-order-line ?polB]
[(not= ?polA ?polB)]
[?pil :st.purchase-invoice.line/purchase-invoice-id+number ?pilid]]}This query will timeout too easily as the result set from the first two clauses is on the order of 500K x 500K (not every ?pil has multiple allocations, but a lot do). The first not= is an attempt to cull the result set early -only the second not= is strictly required for my goal. The :st.purchase-invoice.line/purchase-invoice-id+number attr uniquely identifies each Purchase Invoice Line.
To avoid the huge result sets, I adopted a technique espoused earlier by @favila of limiting the query scope by iterating through subsets of the Purchase Invoice Lines found with index-range. The "query" gets tweaked a little to have this shape:
{:find [?pilid]
:in [$ [?pil ...]]
:where [[?pil :st.purchase-invoice.line/allocations ?aA]
[?pil :st.purchase-invoice.line/allocations ?aB]
[(not= ?aA ?aB)]
[?aA :st.allocation/purchase-order-line ?polA]
[?aB :st.allocation/purchase-order-line ?polB]
[(not= ?polA ?polB)]
[?pil :st.purchase-invoice.line/purchase-invoice-id+number ?pilid]]}
The input [?pil ...] is found by successive calls to index-range like this:
(d/index-range $ {:attrid :st.purchase-invoice.line/purchase-invoice-id+number :start nil :end nil :limit m})I have an outer "driver" expression that iterates over successive calls to index-range with new :start values.
The problem is that no matter what size I pick for :limit, the query hits a point where it totally stalls. Last night I ran this with limits of 128 up to 32K -same result: after N queries that take on the order of seconds, the N+1th query stalls and I get a "service unavailable" exception. Adjusting the query timeout up to 5 minutes did not resolve the issue.
It's worth noting that my Datomic Cloud server is highly performant otherwise.
how many outer results are you getting (?pilid)? Could it be large enough that after N queries it causes memory pressure?
(I'm assuming you are accumulating individual query results in memory)
Very very few final (?pilid) results -any result represents a domain failure and I expect zero. My guess is that of the 500K pilids, there are 10 final results.
is this stall introduced by a fixed number of queries or a specific input ?pil?
The intermediary results of ?pil with multiple allocations (?aA/?aB) is probably 50% of the input. So for chunk size of 128, I would expect 64 to proceed past the first not=.
I tried to home in on the question of "what is the stall trigger" but did not get a conclusive answer last night.
I don't think it is an individual ?pil.
(in any case, I struggled to see how a "bad" ?pil could cause the query to timeout).
The last version I ran went through 192 chunks of 1024 before stalling out.
I recorded where it failed (first :v in the index-range datom output). I can start again from there and see if it stalls quickly or moves further. Stand by...
(worth noting perhaps that after ~192K inputs, there were zero accumulated outputs)
Is this query equivalent to finding ?pol with multiple ?a?
No.
?pol with multiple ?a is normal. ?pil with multiple ?a is normal.
?pil with multiple ?pol is not.
because? It's legal to use same ?pol on different ?pil?
(think partial deliveries being invoiced incrementally over time)
Breaking news: restarting the query roughly where it failed last night encounters the stall very quickly. I'm going to try a bisecting search to try to get to the offending ?pil.
(what in #$& could cause that?!)
A pil with a large number of a
Possibly... but domain wise that would not be expected. An average ?pil would be expected to have 10 and the max would be 100. But, that is the most likely explanation.
Is this query equivalent?
'{:find [?pilid]
:in [$ ?pols]
:where
[[(q '{:find [?pil (count ?pol)]
:with [?a]
:in [$ [?pol ...]]
:where [[?a :st.allocation/purchase-order-line ?pol]
[?pil :st.purchase-invoice.line/allocations ?a]]}
$ ?pols) [[?pil ?polcount]]]
[(> ?polcount 1)]
[?pil :st.purchase-invoice.line/purchase-invoice-id+number ?pilid]]}A quick glance says "yes".
FWIW, my attempt to bisect is showing the same results as earlier attempts: the stall point seems to "wander" around.
This one does pil->pol so you dont have to change the outer loop
'{:find [?pilid]
:in [$ ?pils]
:where
[[(q '{:find [?pil (count ?pol)]
:with [?a]
:in [$ [?pil ...]]
:where [[?pil :st.purchase-invoice.line/allocations ?a]
[?a :st.allocation/purchase-order-line ?pol]]}
$ ?pils) [[?pil ?polcount]]]
[(> ?polcount 1)]
[?pil :st.purchase-invoice.line/purchase-invoice-id+number ?pilid]]}Do you have reason to believe that one is "safer"?
more pols than pils
so chunking pols has less chance of amplifying rowcount
OK, I'll give that a try.
nah I think it's wrong
'{:find [?pilid]
:in [$ ?pils]
:where
[[(q '{:find [?pil (count ?a)]
:with [?pol]
:in [$ [?pil ...]]
:where [[?pil :st.purchase-invoice.line/allocations ?a]
[?a :st.allocation/purchase-order-line ?pol]]}
$ ?pils) [[?pil ?acount]]]
[(> ?acount 1)]
[?pil :st.purchase-invoice.line/purchase-invoice-id+number ?pilid]]}you want to find multiple ?a for same ?pil + ?pol
Multiple ?a for same ?pil + ?pol is "normal".
I think your first version was the correct one. Multiple ?pol for ?pil is wrong (what we seek in the query).... you must join via the ?a
(although I do think the :with in the subquery is wrong)
OK, after removing the :with (which, IMO, makes the query "correct" for the goal) it runs to completion in less 1 minute and returns two results. Totally inline with expectations.
I see the subquery as more efficient since it does not have to do a cross product of ?aA x ?aB BEFORE hitting the first not= -and even after that, it could still be a large cross product if there are many allocations. I did not expect more than ~100 allocations per ?pil, but this is making me rethink that.
First order of business: thanks, @favila, for the insight.
without :with, that is just number of distinct ?pol per ?pil
isn't ?pil -> ?a card-many?
Exactly, and that is the goal.
ok. I guess I expect ?pil -> ?pol directly
if it's a 1:1 or N:1 relationship
I misread your goal as "pol related to pil via different ?a"
widgets get allocated to a ?pol over time. At any point, we might invoice and generate a ?pil with all the (not-yet-invoiced) ?as.
But you are correct: the goal is "> 1 ?pol related to a given ?pil (necessarily via different ?as)".
?pil -> ?a : card many ?a -> ?pol : card one
"pol related to pil via different ?a" specifically disallows [pil a0 pol] [pil a1 pol]
so I misunderstood
what you want to find is [pil a0 pol0] [pil a1 pol1]
Exactly.
Even with the misunderstanding, your intuition of the "volume" of intermediate results (?a X ?a) seems to have been the key. That's the only reason AFAICT that the subquery is "better" -and it undeniably is.
i was really just avoiding a cross-join
BTW, I had pacing enabled (two seconds between queries) since yesterday in an attempt to be "nice" to the DC query server. I'm going to take it out and I'll bet this meta query finishes in less than 30s.
any [... ?a1] [... ?a2][(!= ?a1 ?a2)] is going to hurt
Yep. My mistake was assuming (LOL) that the cardinality of ?a for a given ?pil was small (100 max, average 10). I'm pretty sure I was wrong and now I've started an investigation on that.
BTW, with an input set of ?pil, would you expect the query engine to produce an intermediate result set per ?pil or for all ?pil before winnowing down?
I expect it to run 1 query whose result set size is same as input pils, then filter it before adding rows
so the query is still producing all rows, but they don't get materialized in the outer query
I'm not sure of that though: check query-stats
either way, you will never have more rows as pils
and at least once you must have exactly as many rows as pils
Here's a fact that may help your intuition: the next clause is not executed until the current clause never needs to be executed again.
Digesting....
That is an interesting constraint.
If you didn't do this, you would be supplying an incomplete result set to the next clause
whether that is correct or not requires more complex analysis, which the query doesn't do
And then suddenly you have a process instead of unification.
TIL: we have recently added some extreme outliers in allocation-count-by-pil. While our average is 156, our std deviation is 604 and our max is ~4400, which yields a cross-product of roughly 19 million allocations. Ouch!
But, AFAICT, result set in the subquery is still as big as the result set in the original query (IOW, no benefit to subq IIUC)