datomic

Tobias Sjögren 2025-04-22T11:38:07.142189Z

> "Datomic's current implementations do not have value types suitable for storing large documents, images, audio, or video. It is common practice in Datomic to store such data in a key/value store such as S3 and then store pointers to that data in Datomic." ( https://docs.datomic.com/datomic-overview.html ) I wonder if/how such large documents are involved in Datomic transactions. How does pointers gets stored, using transactions ? What does the pointers consist of ?

ghadi 2025-04-22T11:50:11.205499Z

simplest thing is a uuid

ghadi 2025-04-22T11:50:42.182879Z

representing an object in S3

Tobias Sjögren 2025-04-22T11:53:38.137339Z

The binary file is stored in S3 with an associated UUID and that UUID is what is transacted instead of the actual file ?

Tobias Sjögren 2025-04-22T12:02:03.344979Z

The binary file itself is not part of the transaction ?

ghadi 2025-04-22T12:02:22.030719Z

Correct

Tobias Sjögren 2025-04-22T12:02:53.730219Z

On both questions ?

✔️ 1
ghadi 2025-04-22T12:03:13.757219Z

put the object to s3, include the uuid with the db transaction

Tobias Sjögren 2025-04-22T12:04:05.092689Z

👍

Tobias Sjögren 2025-04-22T12:15:00.378489Z

At what point is the binary file uploaded to S3 - before or after the transaction?

Hendrik 2025-04-22T12:27:30.694069Z

Depends on the usecase. I do it before the transaction. The advantage is that the implementation si simple and your keys (pointers) in the db are always valid. The downside is, that you may get some orphaned data in s3.

Tobias Sjögren 2025-04-22T12:27:36.300739Z

Asynchronously, in parallel with the transaction ?

Hendrik 2025-04-22T12:31:31.988129Z

Asynchronously, in parallel with the transaction ?
I wouldn’t do it in parallel unless you have very good reasons to do so. I can only think of some very tight time budget to persist data, where you can not wait a few ms to do it sequentially

Tobias Sjögren 2025-04-22T12:32:12.390169Z

ms applies to smaller binary files I guess - what about a 1 GB file ?

Hendrik 2025-04-22T12:34:01.595319Z

ms applies to persist the transaction in datomic

cch1 2025-04-22T12:35:27.186459Z

I do it in two steps... first transact a db entity to represent the document, then store the doc in S3 (with metadata linking to the DB entity) and, on the S3 async callback, store the s3 coordinates (bucket + key).

Tobias Sjögren 2025-04-22T12:36:03.893899Z

Does every other transaction need to wait for your binary file upload to finish ?

cch1 2025-04-22T12:36:11.026869Z

More complicated, but if something goes wrong with the second transaction, at least the s3 object carries a ref to the db entity.

cch1 2025-04-22T12:36:53.581429Z

Because I update the db entity on the async callback, the other work on the thread continues normally.

cch1 2025-04-22T12:37:07.675719Z

It's a kind of saga pattern, I suppose.

Tobias Sjögren 2025-04-22T12:38:16.680609Z

Do you send an UUID to the S3 as a temp ID or the permanent ID ?

cch1 2025-04-22T12:39:19.637729Z

I send a UUID. Not sure what you mean by the "permanent ID" -certainly not the :db/id because of the difficulty of preservation across decants.

Tobias Sjögren 2025-04-22T12:39:47.737989Z

Datomic sets the UUID, not S3 ?

Tobias Sjögren 2025-04-22T12:40:10.669719Z

(i'm a Datomic outsider...)

cch1 2025-04-22T12:43:49.214389Z

1. My transaction data includes a client-generated UUID and the document entity has a schema attr for it. 2. Then, I store the S3 document with that same UUID as the key name. Asynchronously. 3. On the async callback, I update the same db entity to store S3-sourced metadata, which confirms storage.

🙌 1
Tobias Sjögren 2025-04-22T13:31:49.949239Z

How do you do it @favila?

2025-04-22T14:13:55.023179Z

We’re doing the same 3 steps as above, generate a unique client side UUID for the s3 document, put it to s3 and then transact. If the transact fails for any reason we can do manual rollback of the s3 document, or simply let it live there as its nominally cheap

2025-04-22T14:14:21.396949Z

we’re doing this almost 500k times per day, and Datomic has good handling of the UUID side if you use use squuids

ghadi 2025-04-22T14:16:09.067169Z

it all depends on the application semantics

ghadi 2025-04-22T14:17:34.411949Z

some use-cases it's ok to upload after db transaction, some it's not

👍 1
Tobias Sjögren 2025-04-22T15:38:26.532269Z

@bhurlow From what I understood from @cch1, he performs a separate transaction before uploading the document - that you don't do ?

Tobias Sjögren 2025-04-22T15:39:40.891949Z

@ghadi Do you have two obvious examples of the two options (upload before/after) ?

2025-04-22T15:48:41.617999Z

ah you’re right, we’re doing upload to s3 optimistically first, then transact the “link” to the document subsequently

Tobias Sjögren 2025-04-22T15:50:00.027749Z

A first transaction to create the document entity, then upload to S3, and then a second transaction ?

2025-04-22T17:05:18.649269Z

we’re performing the upload first before any transaction occurs. Once the upload is complete we transact it into Datomic. This made sense for our application, especially since this particular transaction is unlikely to have any conflicts

Tobias Sjögren 2025-04-25T13:19:59.689429Z

If the document uploaded to S3 is considered a component and owned by some other entity, do you delete the S3 document if its parent is deleted ?

cch1 2025-04-25T13:34:20.601219Z

Doing side-effect things in the transactor is not something I would undertake lightly. My approach for this kind of thing is to use something like a transactional outbox that is monitored by a batch job that uses CAS to "claim" the job, do the work and then delete the job. You could use an external work queue as well, but with the additional risk that queuing up the work could fail.

cch1 2025-04-25T13:35:12.650479Z

Whatever approach you take to have near-transactional removal of the document, make sure you don't bog down your transactor.

Tobias Sjögren 2025-04-25T13:36:44.453559Z

Both uploading to and removing from S3 is outside transaction though?

cch1 2025-04-25T13:38:17.580059Z

With some work, you could do it from within a transaction function. But a transactional outbox approach where the actual blocking IO is performed asynchronously from the create/delete transaction is a safer approach.

Tobias Sjögren 2025-04-25T13:40:31.496639Z

If you go with the outbox approach, do you leave the owned file as an orphan, or do you remove it?

favila 2025-04-25T13:41:45.572129Z

Datomic preserves history, including these s3 file names. That suggests that just like datomic the default position should be not to delete the file

👍 1
cch1 2025-04-25T13:41:46.481519Z

With the outbox approach, when the transaction to delete the document entity completes, it also creates an outbox message entity. A batch job picks up such jobs (with CAS), delete the S3 document and removes the job.

favila 2025-04-25T13:42:21.583489Z

The file is not orphaned if datomic still has a reference to it

cch1 2025-04-25T13:44:05.181839Z

We use this transactional outbox for lots of things, not just deleting S3 docs: firing off webhooks, uploading reports of events, etc. Since keeping S3 docs around is cheap, @favila approach of simply leaving it around seems reasonable (as long as you don't have contention for names in a bucket... you don't, right?!)

Tobias Sjögren 2025-04-25T13:48:33.984089Z

Considering S3 files as append-only seems like a good starting point.

cch1 2025-04-25T13:57:16.604979Z

It's certainly the easiest. In our case, we already had the transactional outbox pattern in place so it was easy to add a message to remove the s3 file. WIthout that infrastructure, I would leave the file alone.

👍 1
Tobias Sjögren 2025-04-25T14:14:53.952089Z

Do you allow the same file to be uploaded to S3 more than once?

cch1 2025-04-25T14:15:17.577439Z

No. We enforce uniqueness on the SHA of the doc contents.

cch1 2025-04-22T01:33:43.044349Z

Is it possible within a query to resolve a lookup ref to the db/id of the entity, and to use that resolved db/id scalar value elsewhere in the query?

cch1 2025-04-22T12:31:19.695499Z

Unfortunately, I have no referring attribute to these entities, so there is no way to unify on a V value.

2025-04-22T12:36:40.629609Z

a lookup ref is an attribute+value pair tho?

cch1 2025-04-22T12:54:45.233599Z

Correct. But that doesn't mean that the entity in question is also the target of some parent's ref attribute.

2025-04-22T12:55:34.980849Z

you can use it however you want. that was just an example.

2025-04-22T12:56:03.693379Z

[?e :lookup/atter ?lookup-rev] [?e :other/attr ?v] works

cch1 2025-04-22T12:59:54.558539Z

I don't think I'm explaining the problem adequately, You are presuming there is an entity P that has a ref attribute pointing at my entity C. That is not the case. Therefore, your technique, which relies on some :parent/attr is unsuitable. FWIW, if there were such a :db.type/ref attr pointing at my entity C then your technique does work by unifying a V with an E. To perhaps clarify, imagine my DB contains only one schema attribute and it is a :db/unique identity attribute. With lookup ref in hand, how can I extract the :db/id of the an entity within the context of the query? Getting the value outside the query is easy -just pull :db/id on the entity. Inside the query is a much different problem.

cch1 2025-04-22T13:03:45.819369Z

(BTW, this problem occurs in the context of querying across two "databases", where the second database has been contrived to hold the E (:db/id) values but of course there are no cross-db refs.

2025-04-22T13:09:37.980129Z

ok I see the problem

👍 1
2025-04-22T13:09:59.646369Z

(d/q '[:find ?e
         :in $ ?v
         :where
         [?e :lookup/attr ?v]]
       db
       "id")

2025-04-22T13:10:52.040589Z

(d/q '[:find ?e
         :in $ ?id
         :where
         [?id :some/attr ?o]
         [?e :some/attr ?o]]
       db
       [:lookup/attr "id"])

2025-04-22T13:11:09.909859Z

both of those appear to work (though the latter does feel like a hack, and I'm not certain it's the best way)

cch1 2025-04-22T13:12:12.992849Z

Actually, more like

(d/q '[:find ?e2
         :in $ $p ?e
         :where
         [?e]
         [$p ?e2 :some/attr ?e]
       db, db2
       [:unique/attr "unique-value"])

2025-04-22T13:13:27.197799Z

yeah the second thing I wrote there has problems. don't do that^

cch1 2025-04-22T13:13:35.538719Z

I think I did not clearly specify an additional requirement: the entity identifier that is a lookup ref might be a db id or a db ident ... IOW, no cheating and knowing it's a lookup ref.

2025-04-22T13:14:54.012809Z

huh, well ya got me for now lol. I gotta think about other stuff.

2025-04-22T13:15:03.387339Z

gl!

cch1 2025-04-22T13:15:03.464919Z

OK, thanks for the input.

2025-04-22T13:15:33.994969Z

I bet @favila would be able to answer this though.

2025-04-22T13:15:53.926409Z

the resident datalog wizard

cch1 2025-04-22T13:15:58.918109Z

I have a nagging feeling he did answer this question for me once before, but in a different context.

2025-04-22T13:16:17.238249Z

probable, given he's the wizard

favila 2025-04-23T20:33:53.091089Z

(d/q '[:find ?e2
       :in % $ $p ?entity-ref
       :where
       ($ resolve-entity-ref ?entity-ref ?e)
       [$p ?e2 :some/attr ?e]
       
       '[[(resolve-entity-ref [?entity-ref] ?e)
          [(long? ?entity-ref)]
          [(identity ?entity-ref) ?e]]

         [(resolve-entity-ref [?entity-ref] ?e)
          [(keyword? ?entity-ref)]
          [?e :db/ident ?entity-ref]]

         [(resolve-entity-ref [?entity-ref] ?e)
          [(vector? ?entity-ref)]
          [(untuple ?entity-ref) [?attr ?v]]
          [?e ?attr ?v]]

         ]
       db, db2
       [:unique/attr "unique-value"])

favila 2025-04-23T20:34:00.147579Z

There may be bugs that prevent this from working

favila 2025-04-23T20:34:42.739819Z

but essentially, reimplement datomic.api/entid as a rule (since this is cloud--in pro just call it)

1
cch1 2025-04-23T21:08:26.222639Z

Exactly... in pro this is easy. But your solution seems like a good alternative for us cloud users.

favila 2025-04-23T21:09:35.150009Z

Note ident resolution is not exactly the same as what pro does

cch1 2025-04-23T21:10:02.416519Z

Because of negative values for tempids?

favila 2025-04-23T21:10:36.343519Z

no, because entid remembers "old" ident values and resolves them; this won't

cch1 2025-04-23T21:12:08.009849Z

I'm going to try to wrap my head around such a change....

favila 2025-04-23T21:12:34.731229Z

Docs are context of changing an attribute name, but this is a facility of ident resolution not attributes per-se

cch1 2025-04-23T21:13:47.719579Z

Ah... so an old ident will resolve to the new entid (in pro) but the rules approach won't see that?

favila 2025-04-23T21:14:01.015069Z

right

👍 1
favila 2025-04-23T21:16:45.418409Z

[?e :db/ident ?ident]
no special handling of ident: matches according to datoms in the datasource
(datomic.api/entid db ident)
Resolves to most recent entity that asserted a :db/ident value of ident , regardless of db filtering (history, as-of, etc) or whether any entity asserts it now.

cch1 2025-04-23T21:18:47.248919Z

...thus illustrating why the docs recommend not re-purposing an ident... filtered queries will not see previous holders of the ident.

2025-04-22T02:08:51.770919Z

either one of these should work:

(d/q '[:find ?p
       :in $ ?lookup-ref
       :where
       [?e :lookup/attr ?lookup-ref]
       [?p :parent/attr ?e]]
     db
     "id")


(d/q '[:find ?p
       :in $ ?e
       :where
       [?p :parent/attr ?e]]
     db
     [:lookup/attr "id"])

2025-04-22T02:09:33.133489Z

either use the value of the ref and add a query clause to resolve the entity, or use the lookup ref tuple as if it were an id.