This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2020-04-16
Channels
- # announcements (1)
- # babashka (23)
- # beginners (157)
- # boot (3)
- # calva (2)
- # chlorine-clover (12)
- # cider (14)
- # clara (5)
- # clj-kondo (6)
- # cljs-dev (61)
- # cljsrn (30)
- # clojure (65)
- # clojure-argentina (8)
- # clojure-berlin (2)
- # clojure-europe (13)
- # clojure-france (9)
- # clojure-germany (2)
- # clojure-italy (4)
- # clojure-nl (6)
- # clojure-portugal (2)
- # clojure-romania (2)
- # clojure-uk (76)
- # clojurescript (56)
- # conjure (52)
- # core-async (37)
- # datomic (209)
- # duct (17)
- # emacs (17)
- # exercism (1)
- # fulcro (26)
- # graalvm (5)
- # instaparse (2)
- # jackdaw (9)
- # jobs-discuss (27)
- # joker (2)
- # juxt (23)
- # leiningen (4)
- # malli (11)
- # midje (3)
- # pedestal (2)
- # quil (2)
- # re-frame (78)
- # reagent (8)
- # reitit (18)
- # remote-jobs (1)
- # ring (2)
- # ring-swagger (1)
- # shadow-cljs (29)
- # sql (11)
- # test-check (12)
- # tools-deps (5)
- # xtdb (16)
- # yada (4)
Hey, does anybody know what to do if datomic cloud returns "Busy indexing" constantly on write attempt (solo env) ?
Hi all, thinking about crux/datomic as a solution for my company - is it possible to invalidate a bunch of datoms at once? (Say loading a batch from an external source daily)
if you mean “retract”, you can retract in large batches; but maybe you mean “mark the transactions as invalidated” with transaction metadata? or reify each import job, link imported entities to that, and mark that import entity somehow?
for example i want to input this json:
[{age: 40, name: "john", children: ["joe"]}, ...]
get some id for this batch (e.g. "1")
then tomorrow i want to add a new batch with this info:
[{age: 40, name: "john", children: []}, ...]
and invalidate batch "1"
and for the query "who are john's children" get an empty response
I think I need more information on how you are planning to encode and use this information. datomic’s unit of information (the datom) is quite granular, so it doesn’t make sense to “invalidate” them at a document level
as for document store or facts store - this is all the same to me, I'm eventually modeling a big graph, which both solutions do afaiu. I do however want to be able to take a subgraph and remove it. what I had in mind is that all datoms associated with a batch will have this fact written somewhere, and then use this association to find each of them and retract them, but this sounds like a heavy load on the db (?)
it’s just odd. it makes more sense (to me) to only assert/retract the delta between batches; or alternatively to reify each batch’s data separately so they are present simultaneously alongside each other
either makes more sense to me than retracting everything from a batch and reasserting a new batch
I highlight entity because you can’t make use of unique ids that don’t have batch info in them somehow
the “compute the delta” scenario is to make transaction entities themselves the batch marker
however there’s a lot of assumptions here: batches only replace each other; batches have an order that correspond to tx order; you never backfill batches; you compute the delta of adds/retracts correctly when you add a new batch
so imagine i have my DB on datomic, and there's an external PLOP style database I'm scraping every day is that a domain-time or audit?
my use case: I have a function that gets a batch # and a query, and it needs to return the answer as if that batch were the only one loaded into the datomic db
so, you will never ever be in a situation where somehow a day’s batch got skipped by accident, and now you need to put that day’s data in so someone can query as if it were there?
yeah, so, if all you are doing is dumping docs from mongo into another db that has better querying, crux may be a better fit
so what they said in the crux channel is that i can actually remove or invalidate a transaction
crux has bi-temporality vs datomic, but it gives up being a referentially-consistent graph and has a larger unit of truth (the document)
crux doesn’t have references, in your json example, you need to manually know how to make “children” values join to something else
IMO datomic is better as your “source of truth” primary db, and crux is better for dealing with “other people’s” messy data which you may not understand or have a full schema for
i will ask them about it - sounds important i mean, i do want to be able to identify between entities across my json objects
it's not so much other people's data, but more like a scrape of their data that i make, so i'm in control of everything ingested
but in crux if i load an object, which is essentially a lot of triplets, does crux automatically assign an id to symbolize the entity?
{id: '123', age: 40}
then i add {name: "joe", id: '123'}
so i can say in datalog "get me the age and name of things that have id=123"
if you have refs to something other than documents, you have to figure something out yourself
crux will injest any EDN and decompose it into triples for query purposes, so you can still do arbitrary joins
honestly this property, though scary for a primary data store, is absolutely freeing for data ingestion
I don’t need to write a complex ETL pipeline before I can use other people’s document-shaped data (and most of it is document-shaped)
but I can always faithfully retain what they said, and transform/normalize/clean-up before moving into a primary datastore that isn’t so sloppy
you can kind of do this by using transaction metadata, but you are subject to the limitations on transactions
other systems like rdf (which datomic is heavily inspired by) have open world assumptions and need complicated reification schemes to use datoms themselves as the subject or object of a predicate
crux takes a different approach by just letting you join on anything you want and working hard to make it fast
I think it’s best suited to cases where the provenance of the data you put into it is not yourself
ideally i would just want to treat transactions as entities themselves and associating them (e.g. with a batch #)
sure, but think through what the loading code would look like for crux vs datomic here
you can go over, it’s fine, but you shouldn’t have tens of thousands of datoms in a tx
you can do this with tuple refs, if each entity has a batch attribute and whatever their native id attribute is
but you have the same problem of needing to ingest the data in a topological-ish order so your refs work
i'm thinking of something maybe simpler - imagine that each datom (not tx) had its own id - i think it's the instant today (?) then i could say datoms 1, 2 and 7 belong to batch #8, and i would like a higher order datalog query that first chosses a subset of datoms, then runs the internal query
i mean - again computationally i don't see how you could do that generally, but if you had infinite cpu
{:entity/batch 7 :entity/id "foo" :entity/batch-id [7 "foo"]}
where :entity/batch-id is a tuple attr
you only have to start your query from there; the refs outward should all be references to batch-7 entities anyway
in the “transact deltas” approach, you can put the batch onto the tx metadata; then as-of time travel accomplishes the same thing
what if i use this: > You can add additional attributes to a transaction entity to capture other useful information, such as the purpose of the transaction, the application that executed it, the provenance of the data it added, or the user who caused it to execute, or any other information that might be useful for auditing purposes. and my batch is many transactions all labled, then use this to retract the previous batch: https://stackoverflow.com/a/25389808/378594 then when I want to query on a certain batch, I use the point in time where it was inserted (that's actually the semantics I want - the state of the database beyond my batch at a certain point) would that work?
If so, this is the same as our each-batch-available-simultaneously scenario discussed earlier, but with the additional unnecessary deletion step
If instead joe is the same entity across batches: when you retract old batches, are you carefully not retracting datoms which are still valid? If so, you aren’t deleting previous batches but transacting the delta between the current db and latest batch.
If you are deleting everything from a batch, this is both not what you want and unnecessary, as you are just replicating the d/since feature
Maybe what you are missing is that “reasserting” a datom with a new batch doesn’t add new datoms—the previous datom is kept (it’s still valid!) so it will always have the tx of the first batch where it became true, not the last batch
This was a very interesting conversation!
I'm also ingesting data regularly from a MySQL database and face similar problems you discussed.
However, is it necessary to persist many earlier batches?
Do the batches reference any other data, which doesn't change over time?
I'm asking because maybe you don't want to put your batches into the same DB.
You can create a new DB for every day maybe.
Alternatively, you can also just keep some of the daily snapshots in memory and instead of persisting them with d/transact
, you can use d/with
to virtually combine your batch-of-the-day onto the rest of the data in some kind of base Datomic DB.
what do you think, @U011WV5VD0V?
very interesting. first of all @U09R86PA4 I see your point. I really do want to keep the same entity id. If my newly added edges never intersect with my base db then retracting everything would work, but this is dangerous and might not be true at some point in the future. @U086D6TBN yes it would be preferable to keep this foreign info / copy in a separate place and compose the base db (at a certain instsance) and a version of the foreign db ad hoc. In memory would work today but is not future proof (near-future...). This is a bit like namespacing I think, but with composition. So I guess these feature don't exist yet?
How long would you need to keep older days snapshots? Based on how you described "invalidation" it sounded like you wouldn't need to access yesterday's import even today anymore.
I'm working with ~4million entities, each with 2-4 attributes only. That takes me around 5mins to import on an 8core i9, with 80GB RAM. Not sure which of my java processes my app or my transactor, but none of them consume more than 16GB RAM
Also, I'm directly querying my data from MySQL with jdbc.next fully into memory and then transact it from there
I found that json parsing can have a quite serious performance impact, so it's better if u cut that step out of your data processing pipeline
(I'm not working with clojure so would need another component to handle this ad hoc transacting)
> I really do want to keep the same entity id. If my newly added edges never intersect with my base db then retracting everything would work, but this is dangerous and might not be true at some point in the future. @U011WV5VD0V No, it’s guaranteed not to work because it’s not just edges, it’s every datom. Eg batch 1 transacts [entity :doc-id “joe”] (an identifier not a ref/edge). Batch 2 attempts to transact the same—but since that fact already exists (by definition—it is an identifier) datomic does not add the datom and the tx of [entity :doc-id “joe”] is still a batch 1 tx. If you then delete all batch 1 datoms, you have removed the “joe” doc identifier. The only thing left in the db is whatever datoms were first asserted by batch 2
i'm not (yet) i need a graph database with some versioning features and evaluating different solutions
(d/q '[:find ?k ?v
:in $ ?q
:where
[(.getClass ?q) ?c]
[(.getClassLoader ?c) ?cl]
[(.loadClass ?cl "java.lang.System") ?sys]
[(.getDeclaredMethod ?sys "getProperties" nil) ?prop]
[(.invoke ?prop nil nil) [[?k ?v]]]]
db {})
Or just read-string with read-eval:
(d/q '[:find ?v
:in $ ?form
:where
[(read-string ?form) ?v]]
db "#=(java.lang.System/getProperties)")
I see the error
Execution error (ExceptionInfo) at datomic.client.api.async/ares (async.clj:58).
Only find-rel elements are allowed in client find-spec, see
when attempting to query a scalar value like
(client/q {:query
'[:find ?uid .
:in $ ?eid
:where
[?eid :user/uuid ?uid]]
:args [(client/db datomic-user-conn)
17592186045418]})
is there a way to query datomic client for single values?
is this just fundamentally not possible?@ben.hammond (ffirst …
result ‘shape’ specifications in the :find
clause do not affect the work done by the query
yeah so I take that as a not possible
thanks
I guess I could reduce the chunksize to 1
but I don't think I care all that much
this advice is marked as 'on-prem', but I presume is equally valid for cloud?
oh I like the look of that
“You should use the `:where` clauses to identify entites of interest, combined with a `pull` expression to navigate to attribute values for those entities. An example:”
thankyou
I never noticed this before but it seems like their isn't parity between the find specs between cloud and on-prem cloud: https://docs.datomic.com/cloud/query/query-data-reference.html#find-specs on-prem: https://docs.datomic.com/on-prem/query.html Does anything highlight other api differences?
Thanks. ill have a look.