2025-09-02 datomic | Clojure Slack Archive

datomic 2025-09-02

cch1 2025-09-02T15:57:07.520439Z

And another obscure error message: Unable to create catalog entry. Any idea what this might mean?

cch1 2025-09-02T15:59:05.044009Z

I wonder if this is relevant: https://clojurians.slack.com/archives/C03RZMDSH/p1722282495559669

cch1 2025-09-02T16:00:08.581289Z

I do run create-database as part of my system startup, so maybe Datomic almost re-created my database -a terrifying thought. Have I been foolish in assuming that create-database was idempotent?

lwhorton 2025-09-02T17:25:30.341169Z

for a given t value can i paginate through the results of a query using limit, offset with guaranteed stability? it seems kind of obvious that this is true, but i can't seem to find documentation that makes this explicit.

Darlei Soares 2025-09-03T13:01:54.788969Z

@lwhorton are you using cloud or datomic pro?

lwhorton 2025-09-03T13:35:59.520569Z

cloud (ions)

👍 1

Darlei Soares 2025-09-03T13:36:40.354159Z

So forget my advice on :limit and :offset, @joe.lane is far better than me on Cloud and can help you better 😄

Joe Lane 2025-09-03T13:39:47.138929Z

How big are "super large result sets"?

lwhorton 2025-09-03T13:40:13.227559Z

maybe 10 to 12k entities, for now, but thatll grow.

Joe Lane 2025-09-03T13:40:32.178879Z

In 5 years, how big could it possibly be if you are wildly successful 🙂

lwhorton 2025-09-03T13:40:56.169119Z

500k to 1M+

Joe Lane 2025-09-03T13:41:58.122299Z

So, is the intent of the query really a scan? is it truly just looking for all entities which have :some/attr and are missing :another/attr or is it more complex than that? Will it change in the future to have more filtering criteria? The details matter here 🙂

lwhorton 2025-09-03T13:47:55.370879Z

its to find a legalperson entity that has no designated-pay-into account (ref) but who has at any point in time been what we call a 'rec-owner'. we have a concept of ownership of a thing through time.

{:find [(pull ?e [*])]
                 :in [$]
                 :where [[?e :legal-person/id]
                         [(missing? $ ?e :legal-person/designated-pay-into)]
                         [?o :ownership/owner ?e]
                         [?g :generator/rec-ownership ?o]]}

lwhorton 2025-09-03T13:50:24.787789Z

logically this is: "find us all legal people who we just can't pay (missing a pay-into), but who we might need to pay (have at some point owned a rec, which is the thing we buy from people)". designated-pay-into is a ref to an account entity ownership/owner is a ref to a legal-person entity generator/rec-ownership is a component (many) to the chronology of owners of recs for a given generator a generator is the thing producing recs so it's not terribly complicated, but it's not trivial imo.

Joe Lane 2025-09-03T13:53:46.149749Z

And this is currently and will continue to be an ongoing business use-case rather than a cleanup operation to find missing? $ ?e :legal-person/designated-pay-into legal-persons and eventually add that attr to all ?es, correct?

lwhorton 2025-09-03T13:58:46.478579Z

solving my business process problems, lol ;). No you are totally right, the intent is that this query serves as step 1 in a process to amend unpayable people. But at some point for a variety of reasons that number of people can get out of hand. Imagine we acquire a few smaller sellers who have already 200k legal people each.

lwhorton 2025-09-03T14:01:35.360379Z

and also, even now when the size is about 10-12k entities, that is far too many for an api consumer. so while we can fit this into memory and leverage some alternative pagination strategy... we'd prefer not even bothering to sort results, and to use offset/limit if that is indeed their intended use case.

Joe Lane 2025-09-03T14:05:48.172749Z

And, realistically, is there any low-latency p99 requirements or anything like that? If the query takes 100ms is that a big deal?

lwhorton 2025-09-03T14:06:41.060169Z

no not at all. it can take 10s i just need to be able to reliably paginate through all the results without missing people for a given basis-t.

Joe Lane 2025-09-03T14:09:58.003239Z

And you don't care about order for the UI, just asking for "order" to support stable pagination to the UI correct?

lwhorton 2025-09-03T14:10:35.246809Z

correct. at this time we see no need to sort the results.

Joe Lane 2025-09-03T14:10:40.027429Z

do :legal-person entitites have an external-id of some kind you can send to the UI?

lwhorton 2025-09-03T14:11:22.254099Z

external as in not a db/id? they have an id which is a v7 uuid.

Joe Lane 2025-09-03T14:12:05.446949Z

~~And I assume it is indexed, yeah?~~ nvm, it's cloud, everything is indexed.

lwhorton 2025-09-03T14:16:30.335229Z

is it possible that the documentation I am looking at is not for cloud/ions? it's a little confusing to me how/why offset would exist if the result sets aren't stably ordered in a cloud/ion deployment.

Joe Lane 2025-09-03T14:20:38.258559Z

I think I would flip this around and paginate via d/datoms or d/index-range over the :legal-person/id , filtered by a predicate of missing-designated-pay-into and was-an-owner-of-a-rec

lwhorton 2025-09-03T14:23:34.525769Z

oh interesting. I'm going to have to go take a look at these. The only thing I've used so far is index pull. if I might ask: what is the thing making you hesitant about stable result sets?

lwhorton 2025-09-03T14:24:36.406119Z

is there something about cloud that is different than on-prem? or is this an undefined behavior and you don't want to over-promise? something else entirely?

Joe Lane 2025-09-03T14:26:43.336799Z

I don't think we ever claim the results of d/query or d/q will be in the same order between 2 calls since the query engine has set-semantics.

lwhorton 2025-09-03T14:28:16.498449Z

i see. what is that chunk in the documentation about offset and limit for d/q? if the results arent ordered between two calls, isn't offset kind of unusable? I feel like I'm just missing something

Joe Lane 2025-09-03T14:30:54.649859Z

Imagine you were consuming the results lazily over a wire. limit can say "only send me 10 results no matter how many you made on the backend" and chunk is "how many to send at a time"

lwhorton 2025-09-03T14:50:40.607979Z

ohh i see. that makes sense. thanks for all your help!

cch1 2025-09-03T14:58:07.472709Z

I'm struggling to understand the value of offset, @joe.lane in the q function of the API. If the order from the db is non-deterministic on every call of the same query and t value, then offset seems like nothing more than a fixed probe into a totally non-deterministic set.

cch1 2025-09-03T14:59:08.826999Z

The limit parameter still has some value, but offset seems ... useless.

Joe Lane 2025-09-03T14:59:26.579249Z

I’ll have to investigate further before I can get back to you

✔️ 1

Darlei Soares 2025-09-02T17:28:22.261749Z

You can paginate using t with guaranteed stability, that's correct. How are you thinking in doing this pagination? We normally don't use limit and offset

lwhorton 2025-09-02T17:33:33.076609Z

that's nice to hear! i'm trying to run a query that's inevitably going to return a very large number of results. i'm ultimately returning those results over an api (paginated). i want to use limit/offset to keep my position in the results set. the query is not a candidate for index-pull, though, given its use of a missing? $ ... clause. this is actually my first time ever trying to use limit/offset not inside an index pull.

Joe Lane 2025-09-02T23:10:55.113759Z

Hey @lwhorton , I’m not convinced query is going to be deterministically ordered across multiple invocations unless you explicitly sort the results. Can you share your use cases and query and we can help you out?

lwhorton 2025-09-02T23:23:50.098899Z

ill see if i can work up a minimum use case and get back to you (probably tomorrow).

👍 2

lwhorton 2025-09-03T02:21:30.163779Z

let me know if this is unclear:

;; normally i would use index-pull for faster results and pagination over
  ;; super large result sets.

  ;; base query 
  (d/q {:query '{:find [(pull ?e sel)]
                 :in [$ sel]
                 :where [[?e :some/attr]
                         [(missing? $ ?e :another/attr)]]}
        :args [db]})

  ;; note that this is not index-pullable due to the use of `missing?`

  ;; i'd like to use offset and limit in succession 
  (d/q {:query '{:find [(pull ?e sel)]
                 :in [$ sel]
                 :where [[?e :some/attr]
                         [(missing? $ ?e :another/attr)]]
                 :offset 0
                 :limit 200}
        :args [db]})

  (d/q {:query '{:find [(pull ?e sel)]
                 :in [$ sel]
                 :where [[?e :some/attr]
                         [(missing? $ ?e :another/attr)]]
                 :offset 200
                 :limit 200}
        :args [(d/as-of db (:t (:db-after previous-db)))]})

  ;; but i'm not positive if the returned results sets are sorted stably, or
  ;; not, in general and with d/as-of. 
  
  ;; this is all to power a paginated api:

  ;; GET /my-resource returns something akin to:

  {:links {:next ".../my-resource?page[limit]=100&page[offset]=100&basis-t=001"
           :self ".../my-resource?page[limit]=100&page[offset]=0&basis-t=001"
           :last ".../my-resource?page[limit]=100&page[offset]=900000&basis-t=001"
           }}

  ;; where the links maintain the basis-t and offset/limit for future requests
  ;; to more pages of my-resource

lwhorton 2025-09-03T02:35:22.176519Z

(query and offset should be at the top level next to :query/:args keys, ignore the typo. though fwiw: https://docs.datomic.com/client/client-api.html#offset-and-limit lists those keys under query.)

cch1 2025-09-11T12:38:09.734069Z

Can you elaborate?

favila 2025-09-11T13:20:10.404009Z

You are repeating all the work of the query

cch1 2025-09-11T13:22:34.503919Z

Depending on the query, that seems like a small price to pay compared to the complexity of the double query approach. At the stage where my company is, DC server load costs very little compared to software complexity.

favila 2025-09-11T13:23:31.861739Z

OP has a “very large number of results”

cch1 2025-09-11T13:25:32.391459Z

Ironically, OP works for my team so I'm pretty comfortable with my statement. But even if he were a stranger, there are certainly complexity benefits to be had with an effective offset feature for some subset of Datomic users.

cch1 2025-09-11T13:26:13.086899Z

That's all hypothetical and I appreciate your insight into the datoms + query approach.

cch1 2025-09-11T13:30:04.304369Z

I think I can see a way to make it work, but how do you manage the variable number of results produced by the query for a given chunk size? It seems as though the offset into the datoms call depends on the number of results returned by the previous query -and given that the query results are not ordered or consistent, I'm struggling to envision a consistent algorithm to get effective pagination.

cch1 2025-09-11T13:33:07.747249Z

Am I seeing this right? If my limit is 50, I need to get a chunk that is bigger than 50 in order to ensure that the query returns at least 50 results. But now I don't know how to construct the next offset into datoms to ensure I don't skip anything.

cch1 2025-09-11T13:34:07.809249Z

If my query returns more than 50 results, I need to "back up" my offset into the datoms in order to not skip anything.

cch1 2025-09-11T13:34:34.125829Z

If my query returns less than 50, I need to iterate another cycle to ensure I get a total of at least 50. And then I am back to not knowing the offset into datoms.

cch1 2025-09-11T13:36:39.705129Z

This seems like it could get really complex if some eid query inputs produce more than one output.

favila 2025-09-11T13:36:57.739619Z

I'm focused on the OP's goal which sounds like "I want to process every single result, but a bit at a time". what i call "query chunking", the limit/offset (really segmenting/scatter-gather) is over the input not the results. I'm assuming no one cares about the page of results, just that they got everything and the query didn't fall over doing it

cch1 2025-09-11T13:37:35.578809Z

I am confident in saying that the OP needs to paginate results into browser.

cch1 2025-09-11T13:37:56.767549Z

I do see how chunking can be effective for processing though.

favila 2025-09-11T13:38:13.405219Z

so the browser's only pagination model is through a limit+offset index into results?

cch1 2025-09-11T13:38:19.923449Z

(where approximately sized chunks are good enough)

cch1 2025-09-11T13:38:24.184479Z

Yes.

favila 2025-09-11T13:38:36.368469Z

that's not a good pagination model for anything large

cch1 2025-09-11T13:38:46.628059Z

pagination model includes a basis t as well.

cch1 2025-09-11T13:39:11.797279Z

And each row in the output includes its offset (index actually, from the first result)... so there is a bit more flexibility than just offset/limit.

favila 2025-09-11T13:40:27.762649Z

can the "offset" in each row be an arbitrary value?

cch1 2025-09-11T13:40:32.638459Z

Yes

favila 2025-09-11T13:41:10.342309Z

so you can get a cursor model

cch1 2025-09-11T13:41:18.051089Z

(I do think comparability is required).

cch1 2025-09-11T13:41:23.400019Z

Yes, it's more of a cursor model.

favila 2025-09-11T13:43:37.227419Z

so the d/datoms call has something which you care about which is ordered (the legal person eid I think in this example). Use that as the offset associated with each result

favila 2025-09-11T13:44:29.671279Z

this makes the offset a semantic, ordered offset into the input, conveyed through the result, so the next offset can start at the next input item

favila 2025-09-11T13:45:45.013189Z

you can just use index-pull with a clojure filter at this point

cch1 2025-09-11T13:46:32.679189Z

The chunks of legal persons need to be processed through a query -it's not a simple matter of pulling them.

favila 2025-09-11T13:47:40.138759Z

very well, either way

cch1 2025-09-11T13:47:40.474809Z

And because the order of the query output is arbitrary (last input eid and it's index), knowing the last legal person on the page does not provide insight as to where to pluck the next chunk.

favila 2025-09-11T13:48:06.212229Z

I'm assuming you sort the input before you give it to the browser

cch1 2025-09-11T13:48:12.451369Z

Yes.

favila 2025-09-11T13:48:30.918689Z

your query result is [?legal-eid-for-offset, (pull ?result)]

favila 2025-09-11T13:48:34.149719Z

sort that

favila 2025-09-11T13:49:04.951389Z

the "offset" for each output row is the corresponding input row

cch1 2025-09-11T13:51:52.086869Z

The query gets a chunk (of ordered and tagged) eids that needs to produce at least N results (page size/limit). The query produces N + m results. The db client limits the output to N... but which eid of the N+m results (in random order) is the right eid to restart the next datoms call?

favila 2025-09-11T13:53:10.381689Z

my impression of this query was that it is a filtering query, so the input items are just filtered

cch1 2025-09-11T13:53:14.315989Z

I think this can only work if the query is simply a filter -no ...

cch1 2025-09-11T13:53:23.814999Z

We're on the same wavelength on this.

cch1 2025-09-11T13:53:33.026349Z

I'll check back with @lwhorton.

favila 2025-09-11T13:53:38.455539Z

I'm not talking about a generic facility, just this problem

cch1 2025-09-11T13:53:45.302649Z

Right.

favila 2025-09-11T13:54:04.607339Z

the "Generic facility" is what all other databases do at the limit: produce the entire result and sort it, then index into it

favila 2025-09-11T13:54:18.595759Z

the only thing datomic doesn't do is the "sort it" part

favila 2025-09-11T13:55:04.206919Z

if you are fine with getting the entire result (query intermediate result set size is not a problem), you can do the same thing yourself in your http handler

cch1 2025-09-11T13:55:13.649299Z

Which wouldn't even be a problem if there were consistency of the output for a given t.

cch1 2025-09-11T13:55:42.953499Z

I think the query set is too big to process it all in the db client in this case.

favila 2025-09-11T13:56:19.469729Z

> I think the query set is too big to process it all in the db client in this case. In which case, a generic limit/offset facility wouldn't save you, unless the query engine was smart enough to process in a particular order and terminate early

😃 1

favila 2025-09-11T13:56:35.949289Z

which some sql engines can do in some circumstances

favila 2025-09-11T13:57:20.480719Z

so my "chunking" scheme is just emulating in user-space the same strategy

favila 2025-09-11T13:58:08.615949Z

the lack of guarantee about result order is inherited from the lack of guarantee that j.u.HashSets give about sort order

favila 2025-09-11T14:01:02.976009Z

at the end of the day, you gotta sort somewhere. You can do it algorithmically via deterministic stepwise processing order (hard to square with datalog's set-orientation, and limits optimizations possible), via using sorted sets (adds an impl tax on every set op, incompatible with :with clause which returns a bag), at the end with a full result (which you can do yourself; having server do it for you is just to reduce bits-over-wire, not overall work done)

cch1 2025-09-11T14:03:16.468089Z

I am starting to the see the extent of the problem and the various workarounds possible. It seems as though it is critical in this "chunking" model that the query results be re-sorted by the same index used in the datoms call before limiting in order to ensure that the next offset does not skip anything or repeat. Now I'm trying to convince myself that ties on that index (that happen to fall on a page boundary) won't result in dupes/omissions....

cch1 2025-09-11T14:05:34.079209Z

I guess not, assuming that the results of a datoms call are consistent for a given basis-t.

favila 2025-09-11T14:06:12.144629Z

they are consistent: it's an inherently sorted order

✔️ 1

favila 2025-09-11T14:07:26.910759Z

This model assumes inputs are unique, ~~sortable~~ ordered, and there's exactly 0 or 1 result per input

favila 2025-09-11T14:08:32.841479Z

if any of those are not true you can maybe tweak it to make it true again

favila 2025-09-11T14:08:41.595899Z

of none of them are true you are back to getting the entire result and sorting

cch1 2025-09-11T14:08:46.401219Z

I can see that. I'm going to try to document this pattern and add a helper or two so that my team (including @lwhorton) don't have to re-learn the limitations and pattern each time.

favila 2025-09-11T14:09:34.826229Z

oh, and that the pagination model treats "offsets" as opaque

favila 2025-09-11T14:09:41.104139Z

that's important

cch1 2025-09-11T14:09:48.195699Z

This pattern is an "upgrade" to our existing pattern which is that once the result set it "too big" you only can support index-pull.

favila 2025-09-11T14:10:13.425239Z

index-pull is just this but there's exactly 1 result per input

✔️ 1

cch1 2025-09-11T14:10:16.873899Z

Our pagination model is pretty good in our (internal UI). I am slightly worried about the JSON:API -Luke's the expert there.

cch1 2025-09-11T14:11:52.267409Z

Thanks for taking the time, @favila, to give me a solid understanding of the limitations inherent in the Datomic limit/offset model, some great workarounds and a solid understanding of when they can be used.

cch1 2025-09-11T14:14:08.027989Z

I will tell you, frankly, that I told my team last week that we could trust that the offset into the q result set (for a given t) was stable because "otherwise it would be useless". I have learned a lot over the past week and while the lack of stability hurts, your workarounds mitigate the problem. I do wish the documentation made this clear.

favila 2025-09-11T14:32:49.801649Z

I just realized I've been saying "d/datoms" for the input when I really meant "seek-datoms"

favila 2025-09-11T14:33:04.747019Z

and cloud doesn't have seek-datoms (although most use cases can be emulated with index-pull)

cch1 2025-09-11T14:34:58.615049Z

I will check the documentation to see if I can see what the implications of this are.

favila 2025-09-11T14:42:52.649129Z

My "chunk input" scheme doesn't use offset on the initial call either. I'm suggesting something like

(->> (d/index-pull db {:index :aevt
                  :selector [:db/id]
                  :start [the-attr eid-offset]
                  :limit limit})
      (mapv :db/id))

or alternatively (if you want your "offsets" to be semantic ids instead of entity ids, or just want a different sort order)

(->> (d/index-range db {:attrid the-unique-attr
                        :start  semantic-id-offset
                        :limit  limit})
     (mapv :e))

This list of eids is the input to your query; and the query :find includes the input's offset value for each result (either the e or the semantic-id, whichever one you chose)

favila 2025-09-11T14:47:12.848959Z

copying query from earlier:

{:find [?e ;; or ?id, depending on which offset scheme you use
        (pull ?e [*])]
 :keys [offset result]
                 :in [$ [?e ...]]
                 :where [[?e :legal-person/id ?id]
                         [(missing? $ ?e :legal-person/designated-pay-into)]
                         [?o :ownership/owner ?e]
                         [?g :generator/rec-ownership ?o]]}

cch1 2025-09-11T14:47:16.981699Z

The second of those two is what I was imagining.

cch1 2025-09-11T14:48:30.102879Z

No sense in pulling the full entity on the index when the it will have to be matched up later with the results from the query anyways.

cch1 2025-09-11T14:48:50.003849Z

Might as well pull in the query.

favila 2025-09-11T14:53:05.366429Z

this query-filter is still simple enough that you could still do all filtering with result of index-pull and clojure code. so that option is still available

favila 2025-09-10T21:00:47.504949Z

offset is useful for apis that have ordered results, and somehow you are restarting consumption, having lost the original channel (async) or call stack (sync)

favila 2025-09-10T21:01:18.823709Z

it is a generic facility, so d/q gets it even though it doesn't make sense

favila 2025-09-10T21:03:32.878949Z

"chunking" query results via constrained inputs is the way to go. Extract your first :where clause, convert to chunks of d/datoms, reissue the query multiple times with a different chunk (set of eids, or eid range) as input each time.

➕ 1

cch1 2025-09-11T01:26:22.133119Z

That's a solution but notably more complex and "unexpectedly" necessary. The unexpected part could be mitigated by a note in the docs along the line that query result orderings are not consistent even for identical queries against the same database and thus limit + offset are not suitable for pagination."

favila 2025-09-11T02:23:54.738489Z

Even if order was stable, limit and offset would not be a good choice for query

Clojurians Log v2

datomic 2025-09-02