2026-05-28 datomic | Clojure Slack Archive

datomic 2026-05-28

steveb8n 2026-05-28T03:52:19.870589Z

Has anybody connected an agent to a readonly copy of datomic to gain the automation benefits that everyone is talking about these days? I’d be interested to hear learnings before I embark on this journey. more in the thread….

steveb8n 2026-05-28T03:52:57.371169Z

E.g. https://youtu.be/B246K_G7mHU they are doing it with postgres, but it’s the same idea

steveb8n 2026-05-28T03:58:10.851359Z

We use datomic cloud, so I’m thinking as an MVP of creating some kind of read only access repl connection to the cloud and then just running Claude on a Mac mini with a connection to a chat group or even just remote control initially

🆒 1

steveb8n 2026-05-28T03:59:23.494519Z

Another idea might be to set up a backup, replication scheduled job and then run a local datomic on the backup to provide the read only access. This would have the added benefit of proving that the backups are valid at all times.

Harold 2026-05-28T04:09:42.683609Z

recent feature, relevant to your interests: https://docs.datomic.com/operation/read-only.html

👀 1

danieroux 2026-05-28T06:09:59.256659Z

read-only is only on Pro, not cloud, and it is a source of great envy

Harold 2026-05-28T12:23:22.976729Z

oic, I didn't read the messages here carefully enough, my mistake.

cch1 2026-05-28T13:34:56.552669Z

I'm going to hold my breath until we get read-only Datomic for Cloud. Starting now....

danieroux 2026-05-28T14:36:48.760179Z

😶

😆 2

cch1 2026-05-28T13:38:16.434569Z

I've got a performance problem I'm struggling to understand. The scenario: I have a unique attribute identifying ~500K entities. These entities (Purchase Invoice Lines) all have a card-many relationship to an intermediary entity (Allocation) and then a card-one relationship to a target entity (Purchase Order Line). The goal is to identify any Purchase Invoice Line that relates to more than one distinct Purchase Order Line. Details in the 🧵 ...

cch1 2026-05-28T13:39:58.889009Z

A simple query to identify these looks something like this:

{:find [?pilid]
 :in [$]
 :where [[?pil :st.purchase-invoice.line/allocations ?aA]
         [?pil :st.purchase-invoice.line/allocations ?aB]
         [(not= ?aA ?aB)]
         [?aA :st.allocation/purchase-order-line ?polA]
         [?aB :st.allocation/purchase-order-line ?polB]
         [(not= ?polA ?polB)]
         [?pil :st.purchase-invoice.line/purchase-invoice-id+number ?pilid]]}

cch1 2026-05-28T13:42:10.553649Z

This query will timeout too easily as the result set from the first two clauses is on the order of 500K x 500K (not every ?pil has multiple allocations, but a lot do). The first not= is an attempt to cull the result set early -only the second not= is strictly required for my goal. The :st.purchase-invoice.line/purchase-invoice-id+number attr uniquely identifies each Purchase Invoice Line.

cch1 2026-05-28T13:47:01.533949Z

To avoid the huge result sets, I adopted a technique espoused earlier by @favila of limiting the query scope by iterating through subsets of the Purchase Invoice Lines found with index-range. The "query" gets tweaked a little to have this shape:

{:find [?pilid]
 :in [$ [?pil ...]]
 :where [[?pil :st.purchase-invoice.line/allocations ?aA]
         [?pil :st.purchase-invoice.line/allocations ?aB]
         [(not= ?aA ?aB)]
         [?aA :st.allocation/purchase-order-line ?polA]
         [?aB :st.allocation/purchase-order-line ?polB]
         [(not= ?polA ?polB)]
         [?pil :st.purchase-invoice.line/purchase-invoice-id+number ?pilid]]}

The input [?pil ...] is found by successive calls to index-range like this:

(d/index-range $ {:attrid :st.purchase-invoice.line/purchase-invoice-id+number :start nil :end nil :limit m})

cch1 2026-05-28T13:47:48.043829Z

I have an outer "driver" expression that iterates over successive calls to index-range with new :start values.

cch1 2026-05-28T13:49:43.480759Z

The problem is that no matter what size I pick for :limit, the query hits a point where it totally stalls. Last night I ran this with limits of 128 up to 32K -same result: after N queries that take on the order of seconds, the N+1th query stalls and I get a "service unavailable" exception. Adjusting the query timeout up to 5 minutes did not resolve the issue.

cch1 2026-05-28T13:50:17.849599Z

It's worth noting that my Datomic Cloud server is highly performant otherwise.

favila 2026-05-28T13:51:45.861999Z

how many outer results are you getting (?pilid)? Could it be large enough that after N queries it causes memory pressure?

favila 2026-05-28T13:52:56.825029Z

(I'm assuming you are accumulating individual query results in memory)

cch1 2026-05-28T13:53:14.033429Z

Very very few final (?pilid) results -any result represents a domain failure and I expect zero. My guess is that of the 500K pilids, there are 10 final results.

favila 2026-05-28T13:53:54.364299Z

is this stall introduced by a fixed number of queries or a specific input ?pil?

cch1 2026-05-28T13:54:15.207359Z

The intermediary results of ?pil with multiple allocations (?aA/?aB) is probably 50% of the input. So for chunk size of 128, I would expect 64 to proceed past the first not=.

cch1 2026-05-28T13:54:41.978349Z

I tried to home in on the question of "what is the stall trigger" but did not get a conclusive answer last night.

cch1 2026-05-28T13:54:56.590999Z

I don't think it is an individual ?pil.

cch1 2026-05-28T13:55:22.006199Z

(in any case, I struggled to see how a "bad" ?pil could cause the query to timeout).

cch1 2026-05-28T13:56:46.305279Z

The last version I ran went through 192 chunks of 1024 before stalling out.

cch1 2026-05-28T13:57:44.948349Z

I recorded where it failed (first :v in the index-range datom output). I can start again from there and see if it stalls quickly or moves further. Stand by...

cch1 2026-05-28T14:01:30.355699Z

(worth noting perhaps that after ~192K inputs, there were zero accumulated outputs)

favila 2026-05-28T14:02:52.977829Z

Is this query equivalent to finding ?pol with multiple ?a?

cch1 2026-05-28T14:03:03.973239Z

No.

cch1 2026-05-28T14:03:42.904509Z

?pol with multiple ?a is normal. ?pil with multiple ?a is normal.

cch1 2026-05-28T14:03:50.511869Z

?pil with multiple ?pol is not.

favila 2026-05-28T14:03:50.913389Z

because? It's legal to use same ?pol on different ?pil?

cch1 2026-05-28T14:04:41.811899Z

(think partial deliveries being invoiced incrementally over time)

cch1 2026-05-28T14:06:14.607529Z

Breaking news: restarting the query roughly where it failed last night encounters the stall very quickly. I'm going to try a bisecting search to try to get to the offending ?pil.

cch1 2026-05-28T14:06:28.224869Z

(what in #$& could cause that?!)

favila 2026-05-28T14:06:55.629369Z

A pil with a large number of a

cch1 2026-05-28T14:07:43.021519Z

Possibly... but domain wise that would not be expected. An average ?pil would be expected to have ~~10 and the max would be~~ 100. But, that is the most likely explanation.

favila 2026-05-28T14:12:39.748499Z

Is this query equivalent?

'{:find [?pilid]
  :in   [$ ?pols]
  :where
  [[(q '{:find  [?pil (count ?pol)]
         :with  [?a]
         :in    [$ [?pol ...]]
         :where [[?a :st.allocation/purchase-order-line ?pol]
                 [?pil :st.purchase-invoice.line/allocations ?a]]}
       $ ?pols) [[?pil ?polcount]]]
   [(> ?polcount 1)]
   [?pil :st.purchase-invoice.line/purchase-invoice-id+number ?pilid]]}

cch1 2026-05-28T14:13:34.547109Z

A quick glance says "yes".

cch1 2026-05-28T14:14:01.817489Z

FWIW, my attempt to bisect is showing the same results as earlier attempts: the stall point seems to "wander" around.

favila 2026-05-28T14:14:02.085799Z

This one does pil->pol so you dont have to change the outer loop

favila 2026-05-28T14:14:03.986279Z

'{:find [?pilid]
  :in   [$ ?pils]
  :where
  [[(q '{:find  [?pil (count ?pol)]
         :with  [?a]
         :in    [$ [?pil ...]]
         :where [[?pil :st.purchase-invoice.line/allocations ?a]
                 [?a :st.allocation/purchase-order-line ?pol]]}
       $ ?pils) [[?pil ?polcount]]]
   [(> ?polcount 1)]
   [?pil :st.purchase-invoice.line/purchase-invoice-id+number ?pilid]]}

cch1 2026-05-28T14:14:51.710869Z

Do you have reason to believe that one is "safer"?

favila 2026-05-28T14:15:20.275909Z

more pols than pils

favila 2026-05-28T14:15:57.495499Z

so chunking pols has less chance of amplifying rowcount

cch1 2026-05-28T14:16:25.090899Z

OK, I'll give that a try.

favila 2026-05-28T14:18:56.054289Z

nah I think it's wrong

favila 2026-05-28T14:18:57.645029Z

'{:find [?pilid]
  :in   [$ ?pils]
  :where
  [[(q '{:find  [?pil (count ?a)]
         :with  [?pol]
         :in    [$ [?pil ...]]
         :where [[?pil :st.purchase-invoice.line/allocations ?a]
                 [?a :st.allocation/purchase-order-line ?pol]]}
       $ ?pils) [[?pil ?acount]]]
   [(> ?acount 1)]
   [?pil :st.purchase-invoice.line/purchase-invoice-id+number ?pilid]]}

favila 2026-05-28T14:19:18.472589Z

you want to find multiple ?a for same ?pil + ?pol

cch1 2026-05-28T14:20:25.312749Z

Multiple ?a for same ?pil + ?pol is "normal".

cch1 2026-05-28T14:21:03.197139Z

I think your first version was the correct one. Multiple ?pol for ?pil is wrong (what we seek in the query).... you must join via the ?a

cch1 2026-05-28T14:27:07.911249Z

(although I do think the :with in the subquery is wrong)

cch1 2026-05-28T14:31:50.870789Z

OK, after removing the :with (which, IMO, makes the query "correct" for the goal) it runs to completion in less 1 minute and returns two results. Totally inline with expectations. I see the subquery as more efficient since it does not have to do a cross product of ?aA x ?aB BEFORE hitting the first not= -and even after that, it could still be a large cross product if there are many allocations. I did not expect more than ~100 allocations per ?pil, but this is making me rethink that.

cch1 2026-05-28T14:32:09.921979Z

First order of business: thanks, @favila, for the insight.

favila 2026-05-28T14:33:06.180379Z

without :with, that is just number of distinct ?pol per ?pil

favila 2026-05-28T14:33:22.743829Z

isn't ?pil -> ?a card-many?

cch1 2026-05-28T14:33:23.577469Z

Exactly, and that is the goal.

favila 2026-05-28T14:35:12.656869Z

ok. I guess I expect ?pil -> ?pol directly

favila 2026-05-28T14:35:18.361769Z

if it's a 1:1 or N:1 relationship

favila 2026-05-28T14:36:07.308859Z

I misread your goal as "pol related to pil via different ?a"

cch1 2026-05-28T14:36:08.117879Z

widgets get allocated to a ?pol over time. At any point, we might invoice and generate a ?pil with all the (not-yet-invoiced) ?as.

cch1 2026-05-28T14:37:06.712209Z

But you are correct: the goal is "> 1 ?pol related to a given ?pil (necessarily via different ?as)".

cch1 2026-05-28T14:37:49.293499Z

?pil -> ?a : card many ?a -> ?pol : card one

favila 2026-05-28T14:39:10.798049Z

"pol related to pil via different ?a" specifically disallows [pil a0 pol] [pil a1 pol]

favila 2026-05-28T14:39:24.061999Z

so I misunderstood

favila 2026-05-28T14:40:03.363639Z

what you want to find is [pil a0 pol0] [pil a1 pol1]

cch1 2026-05-28T14:40:09.974209Z

Exactly.

cch1 2026-05-28T14:41:15.657179Z

Even with the misunderstanding, your intuition of the "volume" of intermediate results (?a X ?a) seems to have been the key. That's the only reason AFAICT that the subquery is "better" -and it undeniably is.

favila 2026-05-28T14:42:02.827509Z

i was really just avoiding a cross-join

cch1 2026-05-28T14:42:18.358229Z

BTW, I had pacing enabled (two seconds between queries) since yesterday in an attempt to be "nice" to the DC query server. I'm going to take it out and I'll bet this meta query finishes in less than 30s.

favila 2026-05-28T14:42:28.243679Z

any [... ?a1] [... ?a2][(!= ?a1 ?a2)] is going to hurt

cch1 2026-05-28T14:43:32.254059Z

Yep. My mistake was assuming (LOL) that the cardinality of ?a for a given ?pil was small (~~100 max, average~~ 10). I'm pretty sure I was wrong and now I've started an investigation on that.

cch1 2026-05-28T14:44:31.282589Z

BTW, with an input set of ?pil, would you expect the query engine to produce an intermediate result set per ?pil or for all ?pil before winnowing down?

favila 2026-05-28T14:46:39.327039Z

I expect it to run 1 query whose result set size is same as input pils, then filter it before adding rows

favila 2026-05-28T14:46:52.760549Z

so the query is still producing all rows, but they don't get materialized in the outer query

favila 2026-05-28T14:47:07.207689Z

I'm not sure of that though: check query-stats

favila 2026-05-28T14:47:41.930039Z

either way, you will never have more rows as pils

favila 2026-05-28T14:48:26.571929Z

and at least once you must have exactly as many rows as pils

✅ 1

favila 2026-05-28T14:55:06.681029Z

Here's a fact that may help your intuition: the next clause is not executed until the current clause never needs to be executed again.

cch1 2026-05-28T14:55:22.592949Z

Digesting....

cch1 2026-05-28T14:55:36.005799Z

That is an interesting constraint.

favila 2026-05-28T14:56:08.727669Z

If you didn't do this, you would be supplying an incomplete result set to the next clause

favila 2026-05-28T14:56:29.579509Z

whether that is correct or not requires more complex analysis, which the query doesn't do

cch1 2026-05-28T14:56:32.636009Z

And then suddenly you have a process instead of unification.

cch1 2026-05-28T15:34:56.381609Z

TIL: we have recently added some extreme outliers in allocation-count-by-pil. While our average is 156, our std deviation is 604 and our max is ~4400, which yields a cross-product of roughly 19 million allocations. Ouch!

cch1 2026-05-28T14:22:24.830369Z

But, AFAICT, result set in the subquery is still as big as the result set in the original query (IOW, no benefit to subq IIUC)

Clojurians Log v2

datomic 2026-05-28