> "Datomic's current implementations do not have value types suitable for storing large documents, images, audio, or video. It is common practice in Datomic to store such data in a key/value store such as S3 and then store pointers to that data in Datomic." ( https://docs.datomic.com/datomic-overview.html ) I wonder if/how such large documents are involved in Datomic transactions. How does pointers gets stored, using transactions ? What does the pointers consist of ?
simplest thing is a uuid
representing an object in S3
The binary file is stored in S3 with an associated UUID and that UUID is what is transacted instead of the actual file ?
The binary file itself is not part of the transaction ?
Correct
On both questions ?
put the object to s3, include the uuid with the db transaction
👍
At what point is the binary file uploaded to S3 - before or after the transaction?
Depends on the usecase. I do it before the transaction. The advantage is that the implementation si simple and your keys (pointers) in the db are always valid. The downside is, that you may get some orphaned data in s3.
Asynchronously, in parallel with the transaction ?
Asynchronously, in parallel with the transaction ?
I wouldn’t do it in parallel unless you have very good reasons to do so. I can only think of some very tight time budget to persist data, where you can not wait a few ms to do it sequentiallyms applies to smaller binary files I guess - what about a 1 GB file ?
ms applies to persist the transaction in datomic
I do it in two steps... first transact a db entity to represent the document, then store the doc in S3 (with metadata linking to the DB entity) and, on the S3 async callback, store the s3 coordinates (bucket + key).
Does every other transaction need to wait for your binary file upload to finish ?
More complicated, but if something goes wrong with the second transaction, at least the s3 object carries a ref to the db entity.
Because I update the db entity on the async callback, the other work on the thread continues normally.
It's a kind of saga pattern, I suppose.
Do you send an UUID to the S3 as a temp ID or the permanent ID ?
I send a UUID. Not sure what you mean by the "permanent ID" -certainly not the :db/id because of the difficulty of preservation across decants.
Datomic sets the UUID, not S3 ?
(i'm a Datomic outsider...)
1. My transaction data includes a client-generated UUID and the document entity has a schema attr for it. 2. Then, I store the S3 document with that same UUID as the key name. Asynchronously. 3. On the async callback, I update the same db entity to store S3-sourced metadata, which confirms storage.
How do you do it @favila?
We’re doing the same 3 steps as above, generate a unique client side UUID for the s3 document, put it to s3 and then transact. If the transact fails for any reason we can do manual rollback of the s3 document, or simply let it live there as its nominally cheap
we’re doing this almost 500k times per day, and Datomic has good handling of the UUID side if you use use squuids
it all depends on the application semantics
some use-cases it's ok to upload after db transaction, some it's not
@ghadi Do you have two obvious examples of the two options (upload before/after) ?
ah you’re right, we’re doing upload to s3 optimistically first, then transact the “link” to the document subsequently
A first transaction to create the document entity, then upload to S3, and then a second transaction ?
we’re performing the upload first before any transaction occurs. Once the upload is complete we transact it into Datomic. This made sense for our application, especially since this particular transaction is unlikely to have any conflicts
If the document uploaded to S3 is considered a component and owned by some other entity, do you delete the S3 document if its parent is deleted ?
Doing side-effect things in the transactor is not something I would undertake lightly. My approach for this kind of thing is to use something like a transactional outbox that is monitored by a batch job that uses CAS to "claim" the job, do the work and then delete the job. You could use an external work queue as well, but with the additional risk that queuing up the work could fail.
Whatever approach you take to have near-transactional removal of the document, make sure you don't bog down your transactor.
Both uploading to and removing from S3 is outside transaction though?
With some work, you could do it from within a transaction function. But a transactional outbox approach where the actual blocking IO is performed asynchronously from the create/delete transaction is a safer approach.
If you go with the outbox approach, do you leave the owned file as an orphan, or do you remove it?
Datomic preserves history, including these s3 file names. That suggests that just like datomic the default position should be not to delete the file
With the outbox approach, when the transaction to delete the document entity completes, it also creates an outbox message entity. A batch job picks up such jobs (with CAS), delete the S3 document and removes the job.
The file is not orphaned if datomic still has a reference to it
We use this transactional outbox for lots of things, not just deleting S3 docs: firing off webhooks, uploading reports of events, etc. Since keeping S3 docs around is cheap, @favila approach of simply leaving it around seems reasonable (as long as you don't have contention for names in a bucket... you don't, right?!)
Considering S3 files as append-only seems like a good starting point.
It's certainly the easiest. In our case, we already had the transactional outbox pattern in place so it was easy to add a message to remove the s3 file. WIthout that infrastructure, I would leave the file alone.
Do you allow the same file to be uploaded to S3 more than once?
No. We enforce uniqueness on the SHA of the doc contents.
Is it possible within a query to resolve a lookup ref to the db/id of the entity, and to use that resolved db/id scalar value elsewhere in the query?
Unfortunately, I have no referring attribute to these entities, so there is no way to unify on a V value.
a lookup ref is an attribute+value pair tho?
Correct. But that doesn't mean that the entity in question is also the target of some parent's ref attribute.
you can use it however you want. that was just an example.
[?e :lookup/atter ?lookup-rev] [?e :other/attr ?v] works
I don't think I'm explaining the problem adequately, You are presuming there is an entity P that has a ref attribute pointing at my entity C. That is not the case. Therefore, your technique, which relies on some :parent/attr is unsuitable. FWIW, if there were such a :db.type/ref attr pointing at my entity C then your technique does work by unifying a V with an E.
To perhaps clarify, imagine my DB contains only one schema attribute and it is a :db/unique identity attribute. With lookup ref in hand, how can I extract the :db/id of the an entity within the context of the query? Getting the value outside the query is easy -just pull :db/id on the entity. Inside the query is a much different problem.
(BTW, this problem occurs in the context of querying across two "databases", where the second database has been contrived to hold the E (:db/id) values but of course there are no cross-db refs.
ok I see the problem
(d/q '[:find ?e
:in $ ?v
:where
[?e :lookup/attr ?v]]
db
"id")(d/q '[:find ?e
:in $ ?id
:where
[?id :some/attr ?o]
[?e :some/attr ?o]]
db
[:lookup/attr "id"])both of those appear to work (though the latter does feel like a hack, and I'm not certain it's the best way)
Actually, more like
(d/q '[:find ?e2
:in $ $p ?e
:where
[?e]
[$p ?e2 :some/attr ?e]
db, db2
[:unique/attr "unique-value"])yeah the second thing I wrote there has problems. don't do that^
I think I did not clearly specify an additional requirement: the entity identifier that is a lookup ref might be a db id or a db ident ... IOW, no cheating and knowing it's a lookup ref.
huh, well ya got me for now lol. I gotta think about other stuff.
gl!
OK, thanks for the input.
I bet @favila would be able to answer this though.
the resident datalog wizard
I have a nagging feeling he did answer this question for me once before, but in a different context.
probable, given he's the wizard
(d/q '[:find ?e2
:in % $ $p ?entity-ref
:where
($ resolve-entity-ref ?entity-ref ?e)
[$p ?e2 :some/attr ?e]
'[[(resolve-entity-ref [?entity-ref] ?e)
[(long? ?entity-ref)]
[(identity ?entity-ref) ?e]]
[(resolve-entity-ref [?entity-ref] ?e)
[(keyword? ?entity-ref)]
[?e :db/ident ?entity-ref]]
[(resolve-entity-ref [?entity-ref] ?e)
[(vector? ?entity-ref)]
[(untuple ?entity-ref) [?attr ?v]]
[?e ?attr ?v]]
]
db, db2
[:unique/attr "unique-value"])There may be bugs that prevent this from working
but essentially, reimplement datomic.api/entid as a rule (since this is cloud--in pro just call it)
Exactly... in pro this is easy. But your solution seems like a good alternative for us cloud users.
Note ident resolution is not exactly the same as what pro does
Because of negative values for tempids?
no, because entid remembers "old" ident values and resolves them; this won't
I'm going to try to wrap my head around such a change....
https://docs.datomic.com/schema/schema-change.html#changing-db-ident
Docs are context of changing an attribute name, but this is a facility of ident resolution not attributes per-se
Ah... so an old ident will resolve to the new entid (in pro) but the rules approach won't see that?
right
[?e :db/ident ?ident]
no special handling of ident: matches according to datoms in the datasource
(datomic.api/entid db ident)
Resolves to most recent entity that asserted a :db/ident value of ident , regardless of db filtering (history, as-of, etc) or whether any entity asserts it now....thus illustrating why the docs recommend not re-purposing an ident... filtered queries will not see previous holders of the ident.
either one of these should work:
(d/q '[:find ?p
:in $ ?lookup-ref
:where
[?e :lookup/attr ?lookup-ref]
[?p :parent/attr ?e]]
db
"id")
(d/q '[:find ?p
:in $ ?e
:where
[?p :parent/attr ?e]]
db
[:lookup/attr "id"])either use the value of the ref and add a query clause to resolve the entity, or use the lookup ref tuple as if it were an id.