And another obscure error message: Unable to create catalog entry. Any idea what this might mean?
I wonder if this is relevant: https://clojurians.slack.com/archives/C03RZMDSH/p1722282495559669
I do run create-database as part of my system startup, so maybe Datomic almost re-created my database -a terrifying thought. Have I been foolish in assuming that create-database was idempotent?
for a given t value can i paginate through the results of a query using limit, offset with guaranteed stability? it seems kind of obvious that this is true, but i can't seem to find documentation that makes this explicit.
@lwhorton are you using cloud or datomic pro?
cloud (ions)
So forget my advice on :limit and :offset, @joe.lane is far better than me on Cloud and can help you better π
How big are "super large result sets"?
maybe 10 to 12k entities, for now, but thatll grow.
In 5 years, how big could it possibly be if you are wildly successful π
500k to 1M+
So, is the intent of the query really a scan? is it truly just looking for all entities which have :some/attr and are missing :another/attr or is it more complex than that? Will it change in the future to have more filtering criteria? The details matter here π
its to find a legalperson entity that has no designated-pay-into account (ref) but who has at any point in time been what we call a 'rec-owner'. we have a concept of ownership of a thing through time.
{:find [(pull ?e [*])]
:in [$]
:where [[?e :legal-person/id]
[(missing? $ ?e :legal-person/designated-pay-into)]
[?o :ownership/owner ?e]
[?g :generator/rec-ownership ?o]]}logically this is: "find us all legal people who we just can't pay (missing a pay-into), but who we might need to pay (have at some point owned a rec, which is the thing we buy from people)". designated-pay-into is a ref to an account entity ownership/owner is a ref to a legal-person entity generator/rec-ownership is a component (many) to the chronology of owners of recs for a given generator a generator is the thing producing recs so it's not terribly complicated, but it's not trivial imo.
And this is currently and will continue to be an ongoing business use-case rather than a cleanup operation to find missing? $ ?e :legal-person/designated-pay-into legal-persons and eventually add that attr to all ?es, correct?
solving my business process problems, lol ;). No you are totally right, the intent is that this query serves as step 1 in a process to amend unpayable people. But at some point for a variety of reasons that number of people can get out of hand. Imagine we acquire a few smaller sellers who have already 200k legal people each.
and also, even now when the size is about 10-12k entities, that is far too many for an api consumer. so while we can fit this into memory and leverage some alternative pagination strategy... we'd prefer not even bothering to sort results, and to use offset/limit if that is indeed their intended use case.
And, realistically, is there any low-latency p99 requirements or anything like that? If the query takes 100ms is that a big deal?
no not at all. it can take 10s i just need to be able to reliably paginate through all the results without missing people for a given basis-t.
And you don't care about order for the UI, just asking for "order" to support stable pagination to the UI correct?
correct. at this time we see no need to sort the results.
do :legal-person entitites have an external-id of some kind you can send to the UI?
external as in not a db/id? they have an id which is a v7 uuid.
And I assume it is indexed, yeah? nvm, it's cloud, everything is indexed.
is it possible that the documentation I am looking at is not for cloud/ions? it's a little confusing to me how/why offset would exist if the result sets aren't stably ordered in a cloud/ion deployment.
I think I would flip this around and paginate via d/datoms or d/index-range over the :legal-person/id , filtered by a predicate of missing-designated-pay-into and was-an-owner-of-a-rec
oh interesting. I'm going to have to go take a look at these. The only thing I've used so far is index pull. if I might ask: what is the thing making you hesitant about stable result sets?
is there something about cloud that is different than on-prem? or is this an undefined behavior and you don't want to over-promise? something else entirely?
I don't think we ever claim the results of d/query or d/q will be in the same order between 2 calls since the query engine has set-semantics.
i see. what is that chunk in the documentation about offset and limit for d/q? if the results arent ordered between two calls, isn't offset kind of unusable? I feel like I'm just missing something
Imagine you were consuming the results lazily over a wire. limit can say "only send me 10 results no matter how many you made on the backend" and chunk is "how many to send at a time"
ohh i see. that makes sense. thanks for all your help!
I'm struggling to understand the value of offset, @joe.lane in the q function of the API. If the order from the db is non-deterministic on every call of the same query and t value, then offset seems like nothing more than a fixed probe into a totally non-deterministic set.
The limit parameter still has some value, but offset seems ... useless.
Iβll have to investigate further before I can get back to you
You can paginate using t with guaranteed stability, that's correct.
How are you thinking in doing this pagination? We normally don't use limit and offset
that's nice to hear! i'm trying to run a query that's inevitably going to return a very large number of results. i'm ultimately returning those results over an api (paginated). i want to use limit/offset to keep my position in the results set. the query is not a candidate for index-pull, though, given its use of a missing? $ ... clause. this is actually my first time ever trying to use limit/offset not inside an index pull.
Hey @lwhorton , Iβm not convinced query is going to be deterministically ordered across multiple invocations unless you explicitly sort the results. Can you share your use cases and query and we can help you out?
ill see if i can work up a minimum use case and get back to you (probably tomorrow).
let me know if this is unclear:
;; normally i would use index-pull for faster results and pagination over
;; super large result sets.
;; base query
(d/q {:query '{:find [(pull ?e sel)]
:in [$ sel]
:where [[?e :some/attr]
[(missing? $ ?e :another/attr)]]}
:args [db]})
;; note that this is not index-pullable due to the use of `missing?`
;; i'd like to use offset and limit in succession
(d/q {:query '{:find [(pull ?e sel)]
:in [$ sel]
:where [[?e :some/attr]
[(missing? $ ?e :another/attr)]]
:offset 0
:limit 200}
:args [db]})
(d/q {:query '{:find [(pull ?e sel)]
:in [$ sel]
:where [[?e :some/attr]
[(missing? $ ?e :another/attr)]]
:offset 200
:limit 200}
:args [(d/as-of db (:t (:db-after previous-db)))]})
;; but i'm not positive if the returned results sets are sorted stably, or
;; not, in general and with d/as-of.
;; this is all to power a paginated api:
;; GET /my-resource returns something akin to:
{:links {:next ".../my-resource?page[limit]=100&page[offset]=100&basis-t=001"
:self ".../my-resource?page[limit]=100&page[offset]=0&basis-t=001"
:last ".../my-resource?page[limit]=100&page[offset]=900000&basis-t=001"
}}
;; where the links maintain the basis-t and offset/limit for future requests
;; to more pages of my-resource(query and offset should be at the top level next to :query/:args keys, ignore the typo. though fwiw: https://docs.datomic.com/client/client-api.html#offset-and-limit lists those keys under query.)
Can you elaborate?
You are repeating all the work of the query
Depending on the query, that seems like a small price to pay compared to the complexity of the double query approach. At the stage where my company is, DC server load costs very little compared to software complexity.
OP has a βvery large number of resultsβ
Ironically, OP works for my team so I'm pretty comfortable with my statement. But even if he were a stranger, there are certainly complexity benefits to be had with an effective offset feature for some subset of Datomic users.
That's all hypothetical and I appreciate your insight into the datoms + query approach.
I think I can see a way to make it work, but how do you manage the variable number of results produced by the query for a given chunk size? It seems as though the offset into the datoms call depends on the number of results returned by the previous query -and given that the query results are not ordered or consistent, I'm struggling to envision a consistent algorithm to get effective pagination.
Am I seeing this right? If my limit is 50, I need to get a chunk that is bigger than 50 in order to ensure that the query returns at least 50 results. But now I don't know how to construct the next offset into datoms to ensure I don't skip anything.
If my query returns more than 50 results, I need to "back up" my offset into the datoms in order to not skip anything.
If my query returns less than 50, I need to iterate another cycle to ensure I get a total of at least 50. And then I am back to not knowing the offset into datoms.
This seems like it could get really complex if some eid query inputs produce more than one output.
I'm focused on the OP's goal which sounds like "I want to process every single result, but a bit at a time". what i call "query chunking", the limit/offset (really segmenting/scatter-gather) is over the input not the results. I'm assuming no one cares about the page of results, just that they got everything and the query didn't fall over doing it
I am confident in saying that the OP needs to paginate results into browser.
I do see how chunking can be effective for processing though.
so the browser's only pagination model is through a limit+offset index into results?
(where approximately sized chunks are good enough)
Yes.
that's not a good pagination model for anything large
pagination model includes a basis t as well.
And each row in the output includes its offset (index actually, from the first result)... so there is a bit more flexibility than just offset/limit.
can the "offset" in each row be an arbitrary value?
Yes
so you can get a cursor model
(I do think comparability is required).
Yes, it's more of a cursor model.
so the d/datoms call has something which you care about which is ordered (the legal person eid I think in this example). Use that as the offset associated with each result
this makes the offset a semantic, ordered offset into the input, conveyed through the result, so the next offset can start at the next input item
you can just use index-pull with a clojure filter at this point
The chunks of legal persons need to be processed through a query -it's not a simple matter of pulling them.
very well, either way
And because the order of the query output is arbitrary (last input eid and it's index), knowing the last legal person on the page does not provide insight as to where to pluck the next chunk.
I'm assuming you sort the input before you give it to the browser
Yes.
your query result is [?legal-eid-for-offset, (pull ?result)]
sort that
the "offset" for each output row is the corresponding input row
The query gets a chunk (of ordered and tagged) eids that needs to produce at least N results (page size/limit). The query produces N + m results. The db client limits the output to N... but which eid of the N+m results (in random order) is the right eid to restart the next datoms call?
my impression of this query was that it is a filtering query, so the input items are just filtered
I think this can only work if the query is simply a filter -no ...
We're on the same wavelength on this.
I'll check back with @lwhorton.
I'm not talking about a generic facility, just this problem
Right.
the "Generic facility" is what all other databases do at the limit: produce the entire result and sort it, then index into it
the only thing datomic doesn't do is the "sort it" part
if you are fine with getting the entire result (query intermediate result set size is not a problem), you can do the same thing yourself in your http handler
Which wouldn't even be a problem if there were consistency of the output for a given t.
I think the query set is too big to process it all in the db client in this case.
> I think the query set is too big to process it all in the db client in this case. In which case, a generic limit/offset facility wouldn't save you, unless the query engine was smart enough to process in a particular order and terminate early
which some sql engines can do in some circumstances
so my "chunking" scheme is just emulating in user-space the same strategy
the lack of guarantee about result order is inherited from the lack of guarantee that j.u.HashSets give about sort order
at the end of the day, you gotta sort somewhere. You can do it algorithmically via deterministic stepwise processing order (hard to square with datalog's set-orientation, and limits optimizations possible), via using sorted sets (adds an impl tax on every set op, incompatible with :with clause which returns a bag), at the end with a full result (which you can do yourself; having server do it for you is just to reduce bits-over-wire, not overall work done)
I am starting to the see the extent of the problem and the various workarounds possible. It seems as though it is critical in this "chunking" model that the query results be re-sorted by the same index used in the datoms call before limiting in order to ensure that the next offset does not skip anything or repeat.
Now I'm trying to convince myself that ties on that index (that happen to fall on a page boundary) won't result in dupes/omissions....
I guess not, assuming that the results of a datoms call are consistent for a given basis-t.
they are consistent: it's an inherently sorted order
This model assumes inputs are unique, sortable ordered, and there's exactly 0 or 1 result per input
if any of those are not true you can maybe tweak it to make it true again
of none of them are true you are back to getting the entire result and sorting
I can see that. I'm going to try to document this pattern and add a helper or two so that my team (including @lwhorton) don't have to re-learn the limitations and pattern each time.
oh, and that the pagination model treats "offsets" as opaque
that's important
This pattern is an "upgrade" to our existing pattern which is that once the result set it "too big" you only can support index-pull.
index-pull is just this but there's exactly 1 result per input
Our pagination model is pretty good in our (internal UI). I am slightly worried about the JSON:API -Luke's the expert there.
Thanks for taking the time, @favila, to give me a solid understanding of the limitations inherent in the Datomic limit/offset model, some great workarounds and a solid understanding of when they can be used.
I will tell you, frankly, that I told my team last week that we could trust that the offset into the q result set (for a given t) was stable because "otherwise it would be useless". I have learned a lot over the past week and while the lack of stability hurts, your workarounds mitigate the problem.
I do wish the documentation made this clear.
I just realized I've been saying "d/datoms" for the input when I really meant "seek-datoms"
and cloud doesn't have seek-datoms (although most use cases can be emulated with index-pull)
I will check the documentation to see if I can see what the implications of this are.
My "chunk input" scheme doesn't use offset on the initial call either. I'm suggesting something like
(->> (d/index-pull db {:index :aevt
:selector [:db/id]
:start [the-attr eid-offset]
:limit limit})
(mapv :db/id))
or alternatively (if you want your "offsets" to be semantic ids instead of entity ids, or just want a different sort order)
(->> (d/index-range db {:attrid the-unique-attr
:start semantic-id-offset
:limit limit})
(mapv :e))
This list of eids is the input to your query; and the query :find includes the input's offset value for each result (either the e or the semantic-id, whichever one you chose)copying query from earlier:
{:find [?e ;; or ?id, depending on which offset scheme you use
(pull ?e [*])]
:keys [offset result]
:in [$ [?e ...]]
:where [[?e :legal-person/id ?id]
[(missing? $ ?e :legal-person/designated-pay-into)]
[?o :ownership/owner ?e]
[?g :generator/rec-ownership ?o]]}The second of those two is what I was imagining.
No sense in pulling the full entity on the index when the it will have to be matched up later with the results from the query anyways.
Might as well pull in the query.
this query-filter is still simple enough that you could still do all filtering with result of index-pull and clojure code. so that option is still available
offset is useful for apis that have ordered results, and somehow you are restarting consumption, having lost the original channel (async) or call stack (sync)
it is a generic facility, so d/q gets it even though it doesn't make sense
"chunking" query results via constrained inputs is the way to go. Extract your first :where clause, convert to chunks of d/datoms, reissue the query multiple times with a different chunk (set of eids, or eid range) as input each time.
That's a solution but notably more complex and "unexpectedly" necessary. The unexpected part could be mitigated by a note in the docs along the line that query result orderings are not consistent even for identical queries against the same database and thus limit + offset are not suitable for pagination."
Even if order was stable, limit and offset would not be a good choice for query