This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2017-06-26
Channels
- # aws (1)
- # beginners (50)
- # boot (32)
- # chestnut (2)
- # cider (14)
- # clara (23)
- # cljs-dev (131)
- # cljsrn (44)
- # clojure (133)
- # clojure-belgium (3)
- # clojure-denmark (4)
- # clojure-dev (6)
- # clojure-italy (4)
- # clojure-nl (2)
- # clojure-russia (95)
- # clojure-spec (59)
- # clojure-uk (14)
- # clojurescript (157)
- # cursive (26)
- # data-science (1)
- # datomic (160)
- # devops (5)
- # dirac (80)
- # emacs (2)
- # graphql (2)
- # jobs (2)
- # lein-figwheel (1)
- # lumo (9)
- # off-topic (151)
- # onyx (2)
- # parinfer (2)
- # pedestal (5)
- # perun (2)
- # re-frame (60)
- # reagent (3)
- # remote-jobs (1)
- # test-check (3)
- # uncomplicate (11)
- # yada (1)
greetings! Again about datoms limit: does it include history datoms as well, or is it just about current db "snapshot" size?
it’s about how big the roots of the tree get, @misha, which means history too. eventually they’ll get so big that peer ram size can’t contain it + space for queries
thanks, Robert. Is there anything to read about working around this?
I have a 2-fold use case:
1. classic system of record, e.g. brands/food/nutrition info.
2. consumption log of the above
trying to assess how to deal with the 2
keeping it connected with the 1
at the same time.
@jaret or @marshall or @stuarthalloway may be able to direct you to some literature. all i have is anecdotes from here 🙂
Is that limit includes all partitions within the same db? or is per partition? or even per db within a "server" (transactor?)?
all partitions (partitions merely control overall sort order). if you have two 10bn datom databases, you’ll need twice the ram as with one 10bn datom database - in all peers, of which the transactor is one
because the peer is considered part of the database — i.e. it’s ‘inside’, unlike a client, which is ‘outside’
So if I'd want to keep sys of record in one db, and log in another to "save the datoms" – it would need to be 2 different transactors, not 2 dbs served by single transactor, right?
yes — but if you have a peer that connects to both databases, it’ll need capacity for both
I am adding Datomic to one of our application but so far my :find
+`pull` queries run in an average of 20ms which is not exactly what I was expecting
The application has 7CPU, 8.5G RAM (that's for the whole app, not just for the pear obviously)
We use Cassandra for storage but everything looks normal here: queries on the table for Datomic are fast
I have had a look at the queries to datomic: a simple :find
is around 1ms but any pull
on the result adds from 15ms to 20ms
I guess the :find
manages to uses one of the index which is expected but I guess the pull part does not
Last thing is : metrics for the transactor show that queries hit cache at quite a good rate 75%+
@gax I started reading about Datomic yesterday so I can’t really help you, but in a talked I watched I heard Datomic attempts to cache data that is “close to your query”. The speaker mentioned “pull” as an example, and said relations marked as “components” would be fetched as well (iirc)
Caching and components are orthogonal concepts.
Just to get an idea, here is my query:
(defn find-model [db subject-type subject-id optimization]
(let [query '[:find ?e .
:in $ [?type ?id ?optim]
:where [?e :model/subject-id ?id]
[?e :model/subject-type ?type]
[?e :model/optimization ?optim]
eid (d/q query db [subject-type subject-id optimization])]
(when eid
(let [pull-res (d/pull db "[*]" eid)
entity (resolve-model-enums db pull-res)]
entity))]))
@danielstockton how so?
@gax is that a directy copy paste? you don’t seem to have a close bracket on your query
@hmaurer It sounded like you were conflating the two ideas. Caching data 'that is close to your query' just means that whole segments are cached (which contain 1000s of datoms, possibly more than your query requires).
It's always on and shouldn't get in the way of performance.
why not do the pull in the :find ? Also, are you sure the part taking a while is the pull and not the query or the resolve-model-enums call?
@danielstockton Oh. I don’t know, I was just quoting (possibly misquoting) a talk which mentioned that Datomic tried its best to cache data that you might want to access after running your query, and iirc he mentioned components being part of that heuristic
@marshall yup : I timed the :find
part separartly from the pull
and the pull
is really the culprit here
and I have also tried including the pull
inside the :find
. On my machine – yes I know this is not perfect – it takes up to 30ms
I pull a "model" that has 6 attributes but the 6th is a many that hase usually 150 children with say 5 to 10 attributes each
I am currently implementing a version where :find
the parent model and a second :find
on the returned children ids
EAVT and AEVT contain all datoms; VAET is reference types only But any query or pull is going to use an index
Question unrelated to the current discussion: is it possible to get multiple Datomic Pro “Starter” licenses? (for multiple systems)
@marshall do I understand it correctly that the cache is actually tried against before indexes are?
@gax parts of the index are in the cache; the query engine knows where in the index ‘tree’ it needs to look. it first looks for those segments in the local cache, then memcached, then finally storage
@hmaurer Yes that is possible. Alternatively we have Enterprise licensing options that may make sense for use cases with multiple system requirements
@marshall Thanks for the quick reply! In my case we are considering to use Datomic at my company (in which case we would get a Pro license), but I have a few non-profit projects on the side that could make use of Datomic but don’t have the budget for a 5k/year license
Gotcha. Yes, you can certainly get a Starter license for a non-profit side project. As far as multiple individual licenses, it might be best to have a call to discuss - you can shoot me an email and we can set something up (<mailto:[email protected]|[email protected]>)
@marshall related question: let’s say I want to write an infrastructure test which spins up a Datomic system, runs some tests against it, and tears it down. I assume I can use the same license as the “prod” system?
@marshall Also since you are around, I asked a question earlier about backups. I know Datomic has a utility to store backups incrementally to S3 or similar, but I was wondering if backing up the underlying storage would also work
The preferred solution seems to be, unsurprisingly, to use Datomic’s backup procedure
but I am nonetheless curious as to whether backing up the underlying storage would do the job
Last question, to which I also got an answer by a community member but not by a Datomic dev: is it “ok” performance-wise to do a lot of “asOf” / history / “since” queries at arbitrary points in time?
e.g. to provide users with a feature to see the state of a document at any point in the past
yep; depending on how ‘deep’ your history is they may or may not be more expensive than “current”, but generally the performance is quite good and lots of customers use it for exactly that purpose
@marshall earlier you mentioned you would be curious to have a look at the cache metrics on the peer. Given what we said, would you still be looking towards the cache?
I came up with some code using callback to send metrics from the peers so I have some metrics but unfortunately nothing about the cache – though I have this metric for the transactor.
@gax it might be somewhat illustrative, but those numbers indicate ~ 0.01msec per value retrieved
@marshall Thanks. Last but not least: would you recommend the Client API or the Peer API for a new application? From what I understand the Client API cannot do cross-database (or cross-points-in-time) joins, which seems like a big feature-loss, but I am not quite sure since I haven’t used it yet
@hmaurer Depends on your needs; Your system overall could use both, mixing and matching as necessary : http://docs.datomic.com/clients-and-peers.html
@marshall do you think performing a :find
directly on the children would speed up the query?
it might; worth a test certainly. The other option to try would be to get all the children’s entity IDs directly in query then doing a pull-many on them
Also : as said, I came up with some code to get metrics out of the peer but it won't show this ObjectCache
metric that looks veeeery interesting: is it normal the peer won't show this metric? Should it?
@gax you mentioned a q-explain
earlier. what did you mean by this?
@robert-stuttaford I was referring to this project : https://github.com/dwhjames/datomic-q-explain
neat, hadn’t seen that before, thanks!
@gax hard to assess whether it’s storage latency and/or whether memcached would help without some metrics (i.e. storageGetMsec numbers, cache numbers)
hmm. abandoned
@gax they wouldn’t provide info about the query/pull of interest. all that work happens on the peer
@marshall Hi! Another question… I read that it is highly recommended not to make “breaking” changes in the schema or change the semantics of an attribute. However, it seems you cannot completely exclude the possiblity that a poor design decision was made in the past and all the facts of some type X in the database history need to be updated to match a new schema. In those rare cases, is it doable?
Roughly speaking this would mean traverse the whole log and make arbitrary edits to any transaction
Actually now that I think of it, this could be done by re-building a new database and copying everything over, setting “txInstant” manually to keep the timeline
have you read this: http://blog.datomic.com/2017/01/the-ten-rules-of-schema-growth.html
(p.s. I read and understood http://blog.datomic.com/2017/01/the-ten-rules-of-schema-growth.html ; only talking about rare scenarios here)
a ‘less drastic’ option would be somethin like you suggest with creating a new attribute (of a different type say) and migrating the data over
Yet another question…: is it possible to follow the transaction log from a remote service? For example keep an elastic search instance in sync
Ah, actually I guess this can be built on top of the Log API: http://docs.datomic.com/log.html
hidden in this post is the fact that we use the tx-report-queue to loosely couple our web services to our worker services. no need for a separate queue at all
everything just talks talks to / watches storage
@robert-stuttaford this is awesome. Sounds like event sourcing without the pain of implementing an event-sourced system from scratch
that’s certainly how we use it
just the other day i had to find out why something went missing. turns out someone wrote an overzealous hand-written transaction and cut 4000ish important datoms from ‘now’. had a ‘revert’ transaction transacted in 10 minutes, via remote repl
immutability is the gift that keeps on giving. it’s actually astonishing how it’s such a given that we should all use source control, when source is actually mostly a liability. but most folks use a forget-by-default database for their data, which is undeniably an asset. no one talks about Big Source, after all 🙂
@robert-stuttaford Yes, immutability is (mostly) a blessing to work with. I can’t complain so far 🙂
@robert-stuttaford either forget-by-default, or try to implement a broken subset of immutability via log tables at great cost!
There is something like :db.type/edn
(propose, workaround, future plans... )?
I have two key cases (store graph and queryies) and I dont know exactly how to handle...
@souenzzo : use string + pr-str / clojure.edn/read-string. works just fine
Yeah, I'm planning on using this. But wanted to know if there were more people with the same problem and if there is any expectation of having edn like type in the datomic
they've promised custom types from the beginning, and fressian is extensible enough to support it, but nothing has materialized
string or binary blob is how we handle it now, or for smaller types encode them into existing types somehow