datomic

cch1 2025-07-23T15:57:14.993239Z

I'm seeing some strange behavior in our application that would be easily explained if d/sync's behavior was "unexpected". The docs/contract say that (d/sync conn t) returns a db at or after time t, possibly waiting until such a database is available on that peer/node. The documentation provides no guidance as to whether d/sync biases towards the latest database available on the node or the database at time t. It's easy to assume it biases towards the latest database, but is this always true? More in the ๐Ÿงต .

cch1 2025-07-24T22:02:40.140159Z

I've tracked down my bug and it is indeed due to d/sync biasing towards an old database. Specifically, when using the Cloud client or the db-local client, I see something like this:

[(-> conn datomic.client.api/db :t) (:t (datomic.client.api/sync conn 1))]
=> [12113236 12113236]
But when using an ion client (from a remote REPL tunneled into my QG instance) I see this:
[(-> conn datomic.client.api/db :t) (:t (datomic.client.api/sync conn 1))]
=> [12113236 1]
This bias towards the old db was unexpected and burned me bad.

cch1 2025-07-24T22:04:18.732689Z

Technically d/sync is meeting the contract of its documentation. But the behavior is unexpected and requires mitigation to get the "bias" towards the "current" db.

favila 2025-07-24T22:05:30.771279Z

is "1" a real basis-t?

cch1 2025-07-24T22:06:14.225289Z

apparently so... because that comes from a real REPL session. Or at least it qualifies as a value for "> or =".

favila 2025-07-24T22:06:29.687449Z

can we see the full db values in the ion case?

cch1 2025-07-24T22:06:48.947869Z

Sure.

favila 2025-07-24T22:07:04.659999Z

I don't believe that the db's basis-t is actually 1

cch1 2025-07-24T22:08:55.545669Z

A screen shot of my remote REPL session just completed:

favila 2025-07-24T22:09:40.045399Z

I'm wanting to see the thing returned from db and sync, not just the :t of it

cch1 2025-07-24T22:10:33.622419Z

OK.

favila 2025-07-24T22:13:31.261449Z

so what can that db see?

cch1 2025-07-24T22:14:04.263479Z

Certainly not what the "current" db sees. And that is how I got burned.

cch1 2025-07-24T22:14:47.244359Z

Of course I used basis-t of 1 to illustrate the problem. Doesn't matter what the value is, the bias is towards that basis-t, not current.

favila 2025-07-24T22:15:11.963259Z

that's not a "bias", it's just returning whatever you put in, right? What if you put in a larger number?

cch1 2025-07-24T22:17:06.614079Z

I call it a bias since the docs leave open the possibility of returning any value from the basis t argument through max t. Anywhere on that spectrum is technically meeting the specs of the doc. Where on that spectrum the actual db's t value lands is what I call the bias. Please give me another word to describe the "leaning" towards one end of the spectrum or the other!

cch1 2025-07-24T22:17:24.529669Z

If I put in a larger number, it returns the db at that larger number.

cch1 2025-07-24T22:17:31.965899Z

(but not current)

cch1 2025-07-24T22:17:37.665949Z

(all this only with the ion client)

cch1 2025-07-24T22:18:47.520079Z

Here's an example with a larger number.

cch1 2025-07-24T22:19:51.513239Z

The cloud client biases towards current. So does db-local. The ion client biases towards the t value argument. I'm not sure about peer.

favila 2025-07-24T22:19:52.531039Z

so it doesn't know the actual current t of the db, it's giving you a handle with a required minimum t. What happens if you try to query with the db with a T that is in the future?

cch1 2025-07-24T22:20:31.120869Z

cch1 2025-07-24T22:20:36.117309Z

Yikes!

favila 2025-07-24T22:20:51.345859Z

exactly, so now query with that 99999999 db

favila 2025-07-24T22:21:02.161509Z

(which I assume is in the far future/not reached yet)

cch1 2025-07-24T22:23:18.558629Z

I'm going to do that... but my point is that the bias is unexpected and inconsistent even if technically correct. It means using sync with the cloud client is equivalent to sync followed by db with the ion client (that's how I'm working around this behavior).

cch1 2025-07-24T22:27:57.906759Z

Using that db yields an exception:

Execution error (ExceptionInfo) at datomic.core.anomalies/throw-if-anom (anomalies.clj:94).
Database does not yet have t={:database-id "e684b9c0-af66-450f-b29b-da7e6f080575", :t 9999999999999}

favila 2025-07-24T22:35:24.578049Z

sync returns a database that includes T, without communication. If client does not know T and cannot communicate about it, what would it do besides this?

cch1 2025-07-24T22:37:16.892509Z

I don't think we are seeing this the same way. The cloud client (not ion client) does the expected thing: returns the latest database available without communication that also includes t. The ion client does not.

cch1 2025-07-24T22:38:18.273099Z

It returns the database at t. Despite having a newer one "on-hand".

favila 2025-07-24T22:40:44.435299Z

I understand that you want sync to return value whose :t is "max(oldest-t-client-might-already-know-about-somehow, sync-t-param)", but that is not what it guarantees.

cch1 2025-07-24T22:41:20.355749Z

I totally agree. This is not a bug per the docs. But it is surprising and inconsistent (across clients).

favila 2025-07-24T22:42:04.554069Z

maybe ion-client could match this (I don't know--I don't know what state it has available), but the behavior has to devolve in at least some cases to "I don't know this T, so I'll echo back what you asked for and let that serve as a precondition for future queries"

favila 2025-07-24T22:42:22.992179Z

and a truly stateless client could never do anything other than echo back the T

cch1 2025-07-24T22:46:14.515109Z

Agreed. It would be more useful if d/sync never returned a database older than what d/db returns (on the same thread). No extra cost, but provides a more useful feature than the current behavior.

favila 2025-07-24T22:47:39.259589Z

d/conn requires communication

cch1 2025-07-24T22:47:43.226789Z

The current ion behavior is akin to as-of.

cch1 2025-07-24T22:48:30.437399Z

I was not aware that d/conn required communication... that does change my perception.

favila 2025-07-24T22:49:16.083659Z

A lot of your expectations sound very peer-api-like

cch1 2025-07-24T22:49:45.394549Z

The irony then is that the cloud client does the "expected" thing.

favila 2025-07-24T22:50:10.754139Z

this should not be the same as "as-of"

favila 2025-07-24T22:50:59.058179Z

:t is a basisT limit; as-of-t is a filter. they are different keys on the db

cch1 2025-07-24T22:51:21.929049Z

Whoops... I realized I mean d/db... you said conn.... not sure if we were talking about the same thing.

cch1 2025-07-24T22:51:28.010659Z

(I'm going to update my comment above)

cch1 2025-07-24T22:52:33.907459Z

To restate: It would be more useful if d/sync never returned a database older than what d/db returns (on the same thread). No extra cost, but provides a more useful feature than the current behavior.

favila 2025-07-24T22:53:13.253409Z

IDK what thread means with a fundamentally async, wire-compatible api

cch1 2025-07-24T22:53:17.571329Z

(sorry for the confusion... I wrote d/conn earlier... which doesn't even exist.

favila 2025-07-24T22:53:58.003249Z

perhaps modified to "older than what (d/db conn) returned the last time it was called"?

โœ”๏ธ 1
favila 2025-07-24T22:54:04.525649Z

for same object conn

โœ”๏ธ 1
cch1 2025-07-24T22:54:14.556879Z

? Consider this code:

[(d/db conn) (d/sync conn t)]
both expressions execute on the same thread.

favila 2025-07-24T22:54:30.470159Z

the expressions do, but where does the work happen?

cch1 2025-07-24T22:54:35.625259Z

Don't care.

favila 2025-07-24T22:54:42.600309Z

in process? in another thread? on another machine?

favila 2025-07-24T22:55:19.537229Z

this requirement also makes stateless conns impossible to implement

cch1 2025-07-24T22:57:12.370489Z

However it works with db-local and the cloud client would be really really useful for the ion client. If not, I have to hack around the differences with d/db right after d/sync.

cch1 2025-07-24T22:57:39.404149Z

If that means a stateful connection... that is frustrating.

cch1 2025-07-24T22:58:41.508189Z

It's not a huge deal, I can just write my own wrapper around d/sync to get consistent behavior.

cch1 2025-07-24T22:59:07.924649Z

A note in the docs would have saved me hours of debugging. Just saying....

favila 2025-07-24T22:59:10.993339Z

It would really be better to use a :db-after

cch1 2025-07-24T23:00:08.737269Z

That's clearly the ultimate and right solution. And I do that in many situations. But in this particular situation it is impractical (still possible, just really impractical).

cch1 2025-07-24T23:00:28.788359Z

Much cheaper to wrap d/sync.

cch1 2025-07-23T16:00:35.026849Z

I have confirmed that d/sync returns the latest database in this simple case:

(let [{conn :datomic/connection :as system} @user/sysref]
  [(-> conn d/db :t) (:t (d/sync conn 11992651))])
[12090046 12090046]
This shows that despite passing in an old t value, sync returns the latest db available on the node. This is the expected (but undocumented) behavior. This ^ test was using the client API against a cloud server. I wonder if the behavior might be different in some extreme cases (maybe t parameter and t of current db are "close"), or when running with an ion connection (the case where I am seeing strange behavior)...

Joe Lane 2025-07-23T16:10:01.300109Z

Per https://docs.datomic.com/transactions/client-synchronization.html#sync > sync takes a basis point t, and it returns a database value that includes point t.

cch1 2025-07-23T16:12:33.486829Z

Right, which leaves open the question of whether the bias is towards the latest t or the t parameter. IOW, A. ...that includes t and is as recent as possible or B. ...that includes t and is as close to t as possible I think the B interpretation is weird, but my apps behavior suggests that might be happening. Could also be a bug in my app...

favila 2025-07-23T16:33:32.542049Z

the bias is toward no communication, so it's going to be whatever the node has on-hand unless that's not good enough

favila 2025-07-23T16:34:27.585689Z

IOW sync-with-t is designed to be as cheap as possible while still satisfying the docstring. Any biases are impl details

cch1 2025-07-23T16:38:47.600079Z

Is it true that if a node has db at tA then it "has" all the dbs at tB where tB < tA?

favila 2025-07-23T16:39:35.732749Z

Depends on what you mean by "db"

favila 2025-07-23T16:39:58.436249Z

semantically, the basis t of a db contains all information before it, because that's datomic's time model

cch1 2025-07-23T16:40:49.067639Z

Right. So, with no communication, the sync call can return db at tA or tB with no communication.

cch1 2025-07-23T16:41:14.740449Z

(assuming it has on hand the later of those two)

favila 2025-07-23T16:42:33.094979Z

> sync takes a basis point t, and it returns a database value that includes point t.

favila 2025-07-23T16:42:52.730169Z

to "include it" means >= t

cch1 2025-07-23T16:43:11.932199Z

I got that. My lack of understanding is around the bias.

cch1 2025-07-23T16:44:02.952839Z

I have some evidence that the bias is towards t, not towards the latest on-hand.

favila 2025-07-23T16:45:06.314099Z

It's not a "bias", you are just seeing relativistic effects of information propagation

favila 2025-07-23T16:45:43.266699Z

nodes aren't holding on to older db values just to satisfy sync calls; they always have only the latest they know about

cch1 2025-07-23T16:48:03.468739Z

I use the term bias loosely... the docs do not provide any information one way or the other and, strictly speaking, the behavior I am seeing meets the contract of the docs. BUT, I would LIKE to know if it is possible that asking for a sync at time t even when the node holds a db well after t might sometimes return something other than the latest.

favila 2025-07-23T16:48:41.200299Z

today it does not; no guarantees about tomorrow, because it still satisfies the contract either way

cch1 2025-07-23T16:49:04.234079Z

That is great to know and implies my problem is elsewhere.

favila 2025-07-23T16:49:34.648229Z

The purpose of sync is to make sure the "light cone" of some other process's causality has reached you.

cch1 2025-07-23T16:49:50.481679Z

That was my understanding as well and that is how I'm using it.

cch1 2025-07-23T16:53:51.003919Z

Some context: ten threads (from ten concurrent Lambdas handling an SQS queue) all do a sync on the same t value and then CAS on a database value to refresh a single OAuth token. One thread succeeds, the other nine fail and are requeued. Some short time later, the remaining nine do the exact same thing. If the sync returns the latest db, then they see the work done by the thread that refreshed the token and just use the fresh token. But what I'm seeing is that the nine threads are trying to refresh the token again -which would be the behavior if the "bias" was towards t instead of the latest (with the already-refreshed token).

cch1 2025-07-23T16:54:38.276729Z

It's probable that I have bug or some other unexpected competition between threads, ion nodes and the db that is causing the issue so the hunt goes on.

favila 2025-07-23T16:57:06.714539Z

this seems like a perfectly reasonable thing that could happen?

cch1 2025-07-23T16:57:21.030009Z

Bias towards t instead of latest?

favila 2025-07-23T16:57:33.276799Z

no, you sync on a t, but something happened since then

favila 2025-07-23T16:57:41.634549Z

that invalidates what you want to do

cch1 2025-07-23T16:57:49.412139Z

I don't think I'm being clear.

favila 2025-07-23T16:59:43.389439Z

the 9 failing threads are retrying with the same original t parameter? (which doesn't include the refreshed token)

cch1 2025-07-23T17:00:59.225099Z

In the first round, I expect all ten threads to read (after sync) that the token needs to be refreshed. I then expect one thread to succeed and nine to fail on the CAS (gating the refresh to be serial). The nine that failed the CAS are requeued ... Some time later the nine messages are tried again -probably in a matter of ~10 seconds and probably concurrently. I expect all nine to sync, get the latest db and find a fresh token with no need to refresh. Instead, I'm seeing the nine compete for the CAS to refresh again.

favila 2025-07-23T17:01:31.726199Z

What is the parameter to sync in each case?

cch1 2025-07-23T17:01:33.507529Z

Yes, the nine failing threads are retrying with the same t. But using sync, not as-of!

favila 2025-07-23T17:02:07.884949Z

so, whatever node answered the sync has seen T, but hasn't seen the T with the refreshed token yet ("latest")

cch1 2025-07-23T17:02:52.268629Z

possibly... but my query group (where this is running) only has one instance.

cch1 2025-07-23T17:03:08.383709Z

So concurrency is thread-wise in one node -not across nodes.

favila 2025-07-23T17:04:14.933149Z

so that is suspicious, but regardless in a robust concurrent system you use the transaction failing to refresh as the way to learn latest t and either sync on it or just use the token it returns.

favila 2025-07-23T17:04:38.395599Z

transaction "aborts successfully"--the work you wanted done was already done, here's what it was

favila 2025-07-23T17:04:59.301769Z

the transactor is the only place where there are no relativistic effects possible

cch1 2025-07-23T17:05:00.128369Z

Yes, I could do that. And maybe I need to. But it is a PITA because I need to enqueue a new message instead of simply failing to process the existing message.

favila 2025-07-23T17:05:55.055869Z

the existing message can't use a different token that it learned via refresh attempt?

cch1 2025-07-23T17:06:33.483169Z

The message is to use the token to retrieve some data from an API. First, ensure there is a fresh token and then use it to get the data.

cch1 2025-07-23T17:07:31.170619Z

Ensuring the token is fresh may require a refresh, and that must be gated to prevent token loss.

cch1 2025-07-23T17:09:50.615329Z

So returning the exact same message to the queue (with a t value in the message) is semantically reasonable and very easy. Updating the t value in the message would change the logic quite a bit. FWIW, the t value is in the message to avoid some relativistic problems I have encountered where node A writes to the db and enqueues an SQS message that is handled by node B before the db segments from A have propogated to B.;

favila 2025-07-23T17:11:15.135929Z

IDK what to tell you. If token refresh needs to be atomic, you gotta do that work in a transactor. You can't just "sync harder"

cch1 2025-07-23T17:12:14.849909Z

CAS protects against concurrent token refresh nicely. No need for a tx-fn!

cch1 2025-07-23T17:12:31.123669Z

(each token has a fingerprint and I CAS on that)

cch1 2025-07-23T17:12:53.579889Z

[:db/cas attr fingerprint fingerprint]

favila 2025-07-23T17:13:03.077629Z

the pain you are describing is that if CAS fails, you know you are out of date, but you aren't resetting your sync moment

โœ”๏ธ 1
cch1 2025-07-23T17:13:41.830859Z

RIght... because the sync moment is a "not before" concept and the token is a "at least" concept. They seem to be at odds ... if the "bias" is not towards the latest db on-hand.

cch1 2025-07-23T17:15:04.759099Z

If the bias is towards latest, this should not be a problem. On retry, the nine threads see the fresh token and just do the work.

favila 2025-07-23T17:15:13.933589Z

again, "latest db on hand" --- of whom? You want the latest db the transactor saw (because your tx failed), not what some node saw

favila 2025-07-23T17:15:39.381469Z

even then, you could still fail, because something happened in between issuing a transaction and the transactor attempting to commit it

cch1 2025-07-23T17:15:47.389099Z

I want to see the latest db that the node saw -the one from the one thread that succeeded in updating the token!

favila 2025-07-23T17:16:23.727919Z

and how will you know what that db is, if not told by the transactor?

cch1 2025-07-23T17:16:32.058199Z

If ion node A succeeds in a transaction (from a thread updating the token), I would hope that other threads doing a sync would get nothing later than that db.

cch1 2025-07-23T17:17:20.949989Z

I was under the impression that if node A succeeds in a transaction, it's basis t (from d/db) will not be earlier than the db-after that transaction.

cch1 2025-07-23T17:17:32.336979Z

... for all threads.

favila 2025-07-23T17:20:10.708139Z

even in pro (where that assumption is safer, because I know there's a single in-process atom holding the db and I know if I saw the tx-data that atom has already been updated), I would use the :db-after, not d/db again on the connection. In cloud I have no idea: maybe you hit a different node, maybe each ion has its own db state ๐Ÿคท

cch1 2025-07-23T17:20:27.468849Z

One node only in this case.

favila 2025-07-23T17:21:27.530969Z

I'm saying it's not correct to write code that makes those assumptions in a distributed system, regardless of circumstances where one reasons it might/should work anyway.

favila 2025-07-23T17:21:52.792749Z

you must always be prepared for a transactor to tell you the world has moved on

favila 2025-07-23T17:23:02.472299Z

normally the window for this is quite small (milliseconds), but always exists. the anomaly here is that you say the window is multiple seconds

favila 2025-07-23T17:23:42.741979Z

that's a performance problem though, not a correctness one

cch1 2025-07-23T17:23:43.958579Z

If that is absolutely true, then you could never use (d/db conn). You would need to store the t value and serialize your work externally. Without some assurance that (d/db conn) returns a db after the last transaction on the same node and thread, (d/db conn) could always return the exact same db... forever.

favila 2025-07-23T17:24:02.262459Z

> (d/db conn) could always return the exact same db... forever.

favila 2025-07-23T17:24:12.844129Z

transactions must always be prepared for that

cch1 2025-07-23T17:24:52.311129Z

I am not talking so much about transacting but rather read-after-transact.

favila 2025-07-23T17:25:05.236309Z

read-after-transact is what :db-after is for ๐Ÿ™‚

cch1 2025-07-23T17:27:33.567929Z

I get that -absolutely. But the statistics in my case seem to be particularly unlikely.

cch1 2025-07-23T17:28:42.354299Z

I'm prepared for some "failures" where the SQS messages are processed due to db propogation delays. But in my case, the frequency is disturbing.