datomic 2020-04-16 | Slack Archive

Hey, does anybody know what to do if datomic cloud returns "Busy indexing" constantly on write attempt (solo env) ?

what version of datomic

and what could cause that

Hi all, thinking about crux/datomic as a solution for my company - is it possible to invalidate a bunch of datoms at once? (Say loading a batch from an external source daily)

favila13:04:28

What do you mean by “invalidate”?

favila13:04:17

if you mean “retract”, you can retract in large batches; but maybe you mean “mark the transactions as invalidated” with transaction metadata? or reify each import job, link imported entities to that, and mark that import entity somehow?

Uri14:04:59

for example i want to input this json: [{age: 40, name: "john", children: ["joe"]}, ...] get some id for this batch (e.g. "1") then tomorrow i want to add a new batch with this info: [{age: 40, name: "john", children: []}, ...] and invalidate batch "1" and for the query "who are john's children" get an empty response

favila14:04:53

[:db/retract john :children "joe"] ? How is this represented in datomic?

favila14:04:15

I think I need more information on how you are planning to encode and use this information. datomic’s unit of information (the datom) is quite granular, so it doesn’t make sense to “invalidate” them at a document level

favila14:04:36

maybe you really want a document store, e.g. like crux

favila14:04:58

what is important for you to know about the connection between a datom and a batch?

Uri16:04:21

you can treat "joe" as its own entity and children -> person/child relationship

Uri16:04:40

again, the scenario is of loading a bunch of facts from an external source

Uri16:04:52

these "facts" get invalidated every day

Uri16:04:57

and new facts replace them

Uri16:04:09

I assume this is modeled as retraction in datomic

Uri16:04:23

but one would have to retract all facts associated with a specific batch

Uri16:04:51

as for document store or facts store - this is all the same to me, I'm eventually modeling a big graph, which both solutions do afaiu. I do however want to be able to take a subgraph and remove it. what I had in mind is that all datoms associated with a batch will have this fact written somewhere, and then use this association to find each of them and retract them, but this sounds like a heavy load on the db (?)

favila16:04:59

it’s just odd. it makes more sense (to me) to only assert/retract the delta between batches; or alternatively to reify each batch’s data separately so they are present simultaneously alongside each other

favila16:04:39

either makes more sense to me than retracting everything from a batch and reasserting a new batch

favila16:04:54

but it all depends on your goals

Uri16:04:04

so let's say i do #2

Uri16:04:14

so i want to run datalog queries but only on batch #7

Uri16:04:26

i can do that easily? this kind of higher order relationship query

favila16:04:41

each entity would have to be distinguished by batch

Uri16:04:50

entity or datom?

favila16:04:01

both are possible

Uri16:04:14

i want to query "who are joe's children according to batch #7"

favila16:04:31

I highlight entity because you can’t make use of unique ids that don’t have batch info in them somehow

favila16:04:43

“who are joe’s children according to batch #7” => “who are batch#7-joe’s children”

Uri16:04:02

ah so recreate all the entities

favila16:04:06

each would be a different joe

Uri16:04:12

got it

favila16:04:19

yeah, that’s the “each present simultaneously” scenario

Uri16:04:52

sounds like that would be very confusing ("batch-7-teacher")

Uri16:04:08

if i had an entity for profession

favila16:04:11

the “compute the delta” scenario is to make transaction entities themselves the batch marker

favila16:04:37

add a batch marker to a tx after you finish a batch

favila16:04:06

then find that tx, and query (d/as-of that-tx) to say “what did that batch look like”

Uri16:04:06

yeah that could work :thinking_face: but i'm wondering how easy the diff computation is

favila16:04:21

however there’s a lot of assumptions here: batches only replace each other; batches have an order that correspond to tx order; you never backfill batches; you compute the delta of adds/retracts correctly when you add a new batch

favila16:04:37

This is a variant of the problem of time-of-record vs domain time

favila16:04:52

is a batch a domain-time concept, or merely an audit concept

favila16:04:09

https://vvvvalvalval.github.io/posts/2017-07-08-Datomic-this-is-not-the-history-youre-looking-for.html

Uri16:04:25

so imagine i have my DB on datomic, and there's an external PLOP style database I'm scraping every day is that a domain-time or audit?

favila16:04:47

what is a “PLOP” style database?

Uri16:04:56

mutable db

Uri16:04:59

such as mongodb

favila16:04:15

it depends on your use case for that data

Uri16:04:59

my use case: I have a function that gets a batch # and a query, and it needs to return the answer as if that batch were the only one loaded into the datomic db

Uri16:04:25

it's not very different from "time travel"

Uri16:04:31

just on a subset of the database

favila16:04:39

can you ever backfill a batch?

Uri16:04:49

no - batches invalidate previous ones

Uri16:04:57

(iiuc what you're asking)

favila16:04:46

so, you will never ever be in a situation where somehow a day’s batch got skipped by accident, and now you need to put that day’s data in so someone can query as if it were there?

Uri16:04:21

right

Uri16:04:22

never

Uri16:04:32

(i think your link above is relevant, skimming through it)

favila16:04:57

honestly if all you want to do is snapshot a mongodb crux might be a better fit

Uri16:04:00

not familiar with this but i like the automatic join feature - i really do have a graph

favila16:04:38

> thinking about crux/datomic as a solution for my company

favila16:04:49

you’re not familiar with crux?

Uri16:04:51

juxt-crux

Uri16:04:06

https://github.com/juxt/crux

yes I’m aware of it

ahh

now i understand 🙂

what do you mean by “not familiar with this”?

Uri16:04:38

missing a comma there

Uri16:04:47

i thought you meant mongodb-crux

favila16:04:53

Uri16:04:01

nm i understand now

Uri16:04:15

yeah that's what seems to me too

Uri16:04:29

just wanted to make sure because i'm completely a newbie to graph dbs

favila16:04:30

yeah, so, if all you are doing is dumping docs from mongo into another db that has better querying, crux may be a better fit

Uri16:04:42

cool

favila16:04:48

it’s less work because you don’t have to translate the docs into a graph

favila16:04:01

and you don’t have to have a plan for inconsistent data

favila16:04:23

and you can use crux’s “valid time” to model your “batch number” concept

Uri16:04:48

so what they said in the crux channel is that i can actually remove or invalidate a transaction

Uri16:04:57

which is arbitrarily big json essentially iiuc

Uri16:04:02

(my graph)

favila16:04:34

crux has bi-temporality vs datomic, but it gives up being a referentially-consistent graph and has a larger unit of truth (the document)

Uri16:04:00

hmm interesting, what does "referentially-consistent" mean?

favila16:04:45

crux doesn’t have references, in your json example, you need to manually know how to make “children” values join to something else

favila16:04:05

datomic has a ref type, the thing on the other end is an entity

favila16:04:48

datomic can also add/retract individual datoms: crux can only add a new document

favila16:04:27

IMO datomic is better as your “source of truth” primary db, and crux is better for dealing with “other people’s” messy data which you may not understand or have a full schema for

Uri16:04:36

i will ask them about it - sounds important i mean, i do want to be able to identify between entities across my json objects

Uri16:04:12

it's not so much other people's data, but more like a scrape of their data that i make, so i'm in control of everything ingested

favila16:04:12

you can with datalog, but it’s by value (a soft-reference) not an actual reference

favila16:04:25

I mean, that’s all mongo is doing

Uri16:04:32

so in crux everything is a string/int/date etc'? there's no ref?

favila16:04:34

mongo doesn’t have refs either right?

Uri16:04:39

right

Uri16:04:44

mongo/json

Uri16:04:20

but in crux if i load an object, which is essentially a lot of triplets, does crux automatically assign an id to symbolize the entity?

Uri16:04:25

or i have to manage it myself somehow

Uri16:04:17

by having an id field in the json objects i load in?

Uri17:04:21

{id: '123', age: 40} then i add {name: "joe", id: '123'} so i can say in datalog "get me the age and name of things that have id=123"

favila17:04:52

you need to assign an id to a document when you create the object

favila17:04:05

if you have refs to something other than documents, you have to figure something out yourself

favila17:04:30

crux will injest any EDN and decompose it into triples for query purposes, so you can still do arbitrary joins

favila17:04:50

but it doesn’t know the meaning of those attributes so it can’t enforce anything

favila17:04:58

in fact, it doesn’t even enforce types

Uri17:04:14

so everything is values, and only the document (i.e. what i load in) have a reference

favila17:04:19

correct

Uri17:04:27

got it - wow that's good to know

favila17:04:30

https://opencrux.com/docs#transactions-valid-ids

favila17:04:49

crux has four transaction operations

favila17:04:35

:crux.db/id is magic

favila17:04:41

it’s required by every document

favila17:04:55

and there’s some limit to the kinds of values it can have

favila17:04:23

honestly this property, though scary for a primary data store, is absolutely freeing for data ingestion

favila17:04:50

I don’t need to write a complex ETL pipeline before I can use other people’s document-shaped data (and most of it is document-shaped)

favila17:04:01

I can figure out the joins later; I can retain broken data, etc

favila17:04:40

but I can always faithfully retain what they said, and transform/normalize/clean-up before moving into a primary datastore that isn’t so sloppy

Uri17:04:07

in some sense this is something i was missing in datomic - the "who"

Uri17:04:12

who knows what

Uri17:04:18

kind of a theory of mind layer over the db

favila17:04:49

you can kind of do this by using transaction metadata, but you are subject to the limitations on transactions

favila17:04:11

datomic is built with a close-world assumption--it is the source of truth

favila17:04:58

other systems like rdf (which datomic is heavily inspired by) have open world assumptions and need complicated reification schemes to use datoms themselves as the subject or object of a predicate

favila17:04:47

crux takes a different approach by just letting you join on anything you want and working hard to make it fast

favila17:04:29

I think it’s best suited to cases where the provenance of the data you put into it is not yourself

Uri17:04:40

ideally i would just want to treat transactions as entities themselves and associating them (e.g. with a batch #)

Uri17:04:16

because the crux way is also limiting in some sense

Uri17:04:03

and do datalog queries on a subset of transactions

favila17:04:11

sure, but think through what the loading code would look like for crux vs datomic here

Uri17:04:36

for my current problem - i agree it sounds like i have to compromise

favila17:04:02

also, you can’t have a single tx for a batch in datomic--that tx is too big

favila17:04:26

you should aim for ~1000 datoms per tx

favila17:04:45

you can go over, it’s fine, but you shouldn’t have tens of thousands of datoms in a tx

Uri17:04:09

ah so i meant - treat datoms as entities and do queries on a subset of datoms*

Uri17:04:26

like time travel lets you do it over the time axis

favila17:04:49

oh, so the “batch-7-joe” solution?

Uri17:04:49

i think computationally this would be intractable to do generally

favila17:04:35

you can do this with tuple refs, if each entity has a batch attribute and whatever their native id attribute is

favila17:04:56

but you have the same problem of needing to ingest the data in a topological-ish order so your refs work

Uri17:04:00

i'm thinking of something maybe simpler - imagine that each datom (not tx) had its own id - i think it's the instant today (?) then i could say datoms 1, 2 and 7 belong to batch #8, and i would like a higher order datalog query that first chosses a subset of datoms, then runs the internal query

Uri17:04:58

i mean - again computationally i don't see how you could do that generally, but if you had infinite cpu

favila17:04:59

you can do it with indexing

favila17:04:48

{:entity/batch 7 :entity/id "foo" :entity/batch-id [7 "foo"]} where :entity/batch-id is a tuple attr

favila17:04:30

you only have to start your query from there; the refs outward should all be references to batch-7 entities anyway

favila17:04:02

this is in the “all batches available simultaneously” approach

favila17:04:11

in the “transact deltas” approach, you can put the batch onto the tx metadata; then as-of time travel accomplishes the same thing

favila17:04:20

(assuming you didn’t make a mistake with your deltas)

Uri10:04:08

what if i use this: > You can add additional attributes to a transaction entity to capture other useful information, such as the purpose of the transaction, the application that executed it, the provenance of the data it added, or the user who caused it to execute, or any other information that might be useful for auditing purposes. and my batch is many transactions all labled, then use this to retract the previous batch: https://stackoverflow.com/a/25389808/378594 then when I want to query on a certain batch, I use the point in time where it was inserted (that's actually the semantics I want - the state of the database beyond my batch at a certain point) would that work?

favila12:04:06

Will each batch consist only of new entities?

favila12:04:07

Batch-6 vs batch-7 joe?

favila12:04:19

If so, this is the same as our each-batch-available-simultaneously scenario discussed earlier, but with the additional unnecessary deletion step

favila12:04:47

If instead joe is the same entity across batches: when you retract old batches, are you carefully not retracting datoms which are still valid? If so, you aren’t deleting previous batches but transacting the delta between the current db and latest batch.

favila12:04:34

If you are deleting everything from a batch, this is both not what you want and unnecessary, as you are just replicating the d/since feature

favila12:04:58

Maybe what you are missing is that “reasserting” a datom with a new batch doesn’t add new datoms—the previous datom is kept (it’s still valid!) so it will always have the tx of the first batch where it became true, not the last batch

onetom05:04:25

This was a very interesting conversation! I'm also ingesting data regularly from a MySQL database and face similar problems you discussed. However, is it necessary to persist many earlier batches? Do the batches reference any other data, which doesn't change over time? I'm asking because maybe you don't want to put your batches into the same DB. You can create a new DB for every day maybe. Alternatively, you can also just keep some of the daily snapshots in memory and instead of persisting them with d/transact, you can use d/with to virtually combine your batch-of-the-day onto the rest of the data in some kind of base Datomic DB. what do you think, @U011WV5VD0V?

Uri09:04:51

very interesting. first of all @U09R86PA4 I see your point. I really do want to keep the same entity id. If my newly added edges never intersect with my base db then retracting everything would work, but this is dangerous and might not be true at some point in the future. @U086D6TBN yes it would be preferable to keep this foreign info / copy in a separate place and compose the base db (at a certain instsance) and a version of the foreign db ad hoc. In memory would work today but is not future proof (near-future...). This is a bit like namespacing I think, but with composition. So I guess these feature don't exist yet?

onetom09:04:06

How long would you need to keep older days snapshots? Based on how you described "invalidation" it sounded like you wouldn't need to access yesterday's import even today anymore.

onetom09:04:38

Also, how big is your dataset, and how long does it take to import it?

onetom09:04:50

I'm working with ~4million entities, each with 2-4 attributes only. That takes me around 5mins to import on an 8core i9, with 80GB RAM. Not sure which of my java processes my app or my transactor, but none of them consume more than 16GB RAM

onetom09:04:44

Also, I'm directly querying my data from MySQL with jdbc.next fully into memory and then transact it from there

onetom09:04:13

I found that json parsing can have a quite serious performance impact, so it's better if u cut that step out of your data processing pipeline

Uri10:04:13

the only reason to return to older snapshots is for debugging and analytics purposes

Uri10:04:18

so it does happen sometime

Uri10:04:47

as for size - it's not nearly as big, I'd say 100k entities

Uri10:04:15

(might be bigger in the future)

Uri10:04:20

(probably)

Uri10:04:01

(I'm not working with clojure so would need another component to handle this ad hoc transacting)

favila14:04:14

> I really do want to keep the same entity id. If my newly added edges never intersect with my base db then retracting everything would work, but this is dangerous and might not be true at some point in the future. @U011WV5VD0V No, it’s guaranteed not to work because it’s not just edges, it’s every datom. Eg batch 1 transacts [entity :doc-id “joe”] (an identifier not a ref/edge). Batch 2 attempts to transact the same—but since that fact already exists (by definition—it is an identifier) datomic does not add the datom and the tx of [entity :doc-id “joe”] is still a batch 1 tx. If you then delete all batch 1 datoms, you have removed the “joe” doc identifier. The only thing left in the db is whatever datoms were first asserted by batch 2

favila14:04:03

> I’m not working with clojure Really? What are you using?

favila14:04:04

Adding a new db per day is not a bad idea

Uri16:04:53

python and javascript

favila17:04:11

So how are you interfacing with datomic? Graalvm?

Uri23:04:54

i'm not (yet) i need a graph database with some versioning features and evaluating different solutions

favila00:04:10

Datomic without a jvm is going to be a bad time

👍 4

vlaaad12:04:48

(d/q '[:find ?k ?v
       :in $ ?q
       :where
       [(.getClass ?q) ?c]
       [(.getClassLoader ?c) ?cl]
       [(.loadClass ?cl "java.lang.System") ?sys]
       [(.getDeclaredMethod ?sys "getProperties" nil) ?prop]
       [(.invoke ?prop nil nil) [[?k ?v]]]]
     db {})

vlaaad12:04:05

fun stuff with interop on datomic cloud ^

vlaaad12:04:51

didn’t expect query to provide full jvm access though..

vlaaad13:04:33

Or just read-string with read-eval:

(d/q '[:find ?v
       :in $ ?form
       :where
       [(read-string ?form) ?v]]
     db "#=(java.lang.System/getProperties)")

Joe Lane13:04:46

Come on now Vlad, what's the first rule of hash-equals club!?

vlaaad13:04:13

Yeah, right 😄

vlaaad13:04:31

it’s just absense of eval gives a false sense of security

Ben Hammond13:04:09

I see the error

Execution error (ExceptionInfo) at datomic.client.api.async/ares (async.clj:58).
Only find-rel elements are allowed in client find-spec, see

when attempting to query a scalar value like

(client/q {:query
           '[:find ?uid .
             :in $ ?eid
             :where
             [?eid :user/uuid ?uid]]
           :args [(client/db datomic-user-conn)
                  17592186045418]})

is there a way to query datomic client for single values? is this just fundamentally not possible?

marshall13:04:25

@ben.hammond (ffirst …

marshall13:04:53

result ‘shape’ specifications in the :find clause do not affect the work done by the query

marshall13:04:55

in client or in peer

Ben Hammond14:04:05

yeah so I take that as a not possible

Ben Hammond14:04:10

thanks

marshall14:04:17

they only define what is returned to you

Ben Hammond14:04:39

I guess I could reduce the chunksize to 1

marshall14:04:43

in your case, you have a single clause

Ben Hammond14:04:49

but I don't think I care all that much

marshall14:04:50

you could just use a datoms lookup directly

marshall14:04:14

or better even still, in that example you already have an entity ID

marshall14:04:18

you should use pull

Ben Hammond14:04:49

https://docs.datomic.com/on-prem/best-practices.html#prefer-query ?

Ben Hammond14:04:58

this advice is marked as 'on-prem', but I presume is equally valid for cloud?

favila14:04:47

prefer query vs datoms or d/pull|d/entity + manual join and filtering

favila14:04:05

i.e. query for “where” work, not “find” work

marshall14:04:50

(d/pull (d/db conn) '[:user/uuid] eid)

👍 4

Ben Hammond14:04:19

oh I like the look of that

marshall14:04:14

onprem or cloud?

marshall14:04:51

See: https://docs.datomic.com/cloud/best.html#use-pull-to-retrieve-attribute-values

marshall14:04:08

“You should use the `:where` clauses to identify entites of interest, combined with a `pull` expression to navigate to attribute values for those entities. An example:”

👍 4

marshall14:04:26

so if you already have your entity identifier, use pull

Ben Hammond14:04:40

thankyou

Drew Verlee14:04:41

I never noticed this before but it seems like their isn't parity between the find specs between cloud and on-prem cloud: https://docs.datomic.com/cloud/query/query-data-reference.html#find-specs on-prem: https://docs.datomic.com/on-prem/query.html Does anything highlight other api differences?

marshall14:04:28

@drewverlee https://docs.datomic.com/on-prem/clients-and-peers.html

Drew Verlee14:04:23

Thanks. ill have a look.

2020-04-16

Channels