Fork me on GitHub
#datomic
<
2017-06-26
>
misha10:06:05

greetings! Again about datoms limit: does it include history datoms as well, or is it just about current db "snapshot" size?

robert-stuttaford10:06:43

it’s about how big the roots of the tree get, @misha, which means history too. eventually they’ll get so big that peer ram size can’t contain it + space for queries

misha10:06:41

thanks, Robert. Is there anything to read about working around this? I have a 2-fold use case: 1. classic system of record, e.g. brands/food/nutrition info. 2. consumption log of the above trying to assess how to deal with the 2 keeping it connected with the 1 at the same time.

robert-stuttaford10:06:36

@jaret or @marshall or @stuarthalloway may be able to direct you to some literature. all i have is anecdotes from here 🙂

misha10:06:08

Is that limit includes all partitions within the same db? or is per partition? or even per db within a "server" (transactor?)?

robert-stuttaford10:06:23

all partitions (partitions merely control overall sort order). if you have two 10bn datom databases, you’ll need twice the ram as with one 10bn datom database - in all peers, of which the transactor is one

robert-stuttaford10:06:50

because the peer is considered part of the database — i.e. it’s ‘inside’, unlike a client, which is ‘outside’

misha10:06:56

same for databases, right?

misha10:06:41

So if I'd want to keep sys of record in one db, and log in another to "save the datoms" – it would need to be 2 different transactors, not 2 dbs served by single transactor, right?

robert-stuttaford10:06:37

yes — but if you have a peer that connects to both databases, it’ll need capacity for both

misha10:06:17

oh, that's true harold

Galaux14:06:35

Hi everyone!

Galaux14:06:49

I am adding Datomic to one of our application but so far my :find+`pull` queries run in an average of 20ms which is not exactly what I was expecting

Galaux14:06:04

The application has 7CPU, 8.5G RAM (that's for the whole app, not just for the pear obviously)

Galaux14:06:18

We use Cassandra for storage but everything looks normal here: queries on the table for Datomic are fast

Galaux14:06:42

We haven't configured a Memcached as storage does not seem to be the bottleneck

Galaux14:06:01

I have had a look at the queries to datomic: a simple :find is around 1ms but any pull on the result adds from 15ms to 20ms

Galaux14:06:38

I guess the :find manages to uses one of the index which is expected but I guess the pull part does not

Galaux14:06:31

Last thing is : metrics for the transactor show that queries hit cache at quite a good rate 75%+

Galaux14:06:49

(I just can't get cache metrics for the peer unfortunately)

hmaurer14:06:55

@gax I started reading about Datomic yesterday so I can’t really help you, but in a talked I watched I heard Datomic attempts to cache data that is “close to your query”. The speaker mentioned “pull” as an example, and said relations marked as “components” would be fetched as well (iirc)

hmaurer14:06:01

maybe something similar is happening?

Galaux14:06:17

Something similar to what exactly?

hmaurer14:06:44

Datomic over-fetching data

Galaux14:06:06

In my case, I have only one "component" this is precisely what I am looking for

danielstockton14:06:20

Caching and components are orthogonal concepts.

Galaux14:06:49

Just to get an idea, here is my query:

(defn find-model [db subject-type subject-id optimization]
  (let [query '[:find ?e .
                :in $ [?type ?id ?optim]
                :where [?e :model/subject-id ?id]
                [?e :model/subject-type ?type]
                [?e :model/optimization ?optim]
                eid (d/q query db [subject-type subject-id optimization])]
        (when eid
          (let [pull-res (d/pull db "[*]" eid)
                entity (resolve-model-enums db pull-res)]
            entity))]))

Galaux14:06:02

Pretty standard :find it seems…

Galaux14:06:23

The dataset must be quite small also

Galaux14:06:33

even though metrics say I have 20M datoms

Galaux14:06:32

(was a bit worried by this "[*]" as it sounds like a SELECT * … 🙂 )

marshall14:06:35

@gax is that a directy copy paste? you don’t seem to have a close bracket on your query

Galaux14:06:55

oups… no: edited it but the query runs in PROD

Galaux14:06:08

must have deleted something…

danielstockton14:06:12

@hmaurer It sounded like you were conflating the two ideas. Caching data 'that is close to your query' just means that whole segments are cached (which contain 1000s of datoms, possibly more than your query requires).

Galaux14:06:28

(ah yes : read about the segments being cached)

danielstockton14:06:38

It's always on and shouldn't get in the way of performance.

marshall14:06:38

why not do the pull in the :find ? Also, are you sure the part taking a while is the pull and not the query or the resolve-model-enums call?

hmaurer14:06:20

@danielstockton Oh. I don’t know, I was just quoting (possibly misquoting) a talk which mentioned that Datomic tried its best to cache data that you might want to access after running your query, and iirc he mentioned components being part of that heuristic

Galaux14:06:44

@marshall yup : I timed the :find part separartly from the pull and the pull is really the culprit here

Galaux14:06:30

and I have also tried including the pull inside the :find. On my machine – yes I know this is not perfect – it takes up to 30ms

marshall14:06:35

i would want to look at the cache metrics on the peer

marshall14:06:47

how many attributes are you pulling?

Galaux14:06:52

I pull a "model" that has 6 attributes but the 6th is a many that hase usually 150 children with say 5 to 10 attributes each

marshall14:06:25

so you’re pulling 1500 values

Galaux14:06:26

The children are components

marshall14:06:35

normally a “simple” pull is very very fast

marshall14:06:50

but having pull+component entities will take a bit longer

marshall14:06:04

since it will have to traverse those links and pull their attributes

Galaux14:06:28

so I thought maybe I could directly :find the children

Galaux15:06:17

I am currently implementing a version where :find the parent model and a second :find on the returned children ids

Galaux15:06:29

I expect this to use indexes

marshall15:06:35

everything in Datomic uses indexes

Galaux15:06:09

Well… not exactly *everything* if I understand correctly …?

Galaux15:06:17

For instance the AVET indexes only index :index and :unique datoms

marshall15:06:37

EAVT and AEVT contain all datoms; VAET is reference types only But any query or pull is going to use an index

Galaux15:06:16

(was about to use q-explain to check that)

marshall15:06:22

in a real sense, Datomic is a set of indexes

marshall15:06:35

there’s no way to get something out of it other than to use an index

Galaux15:06:48

@marshall what is the cache usefull for then?

marshall15:06:59

index segments are immutable

marshall15:06:26

so every ‘chunk’ of the index is a value that can be cached

hmaurer15:06:03

Question unrelated to the current discussion: is it possible to get multiple Datomic Pro “Starter” licenses? (for multiple systems)

Galaux15:06:03

@marshall do I understand it correctly that the cache is actually tried against before indexes are?

marshall15:06:54

@gax parts of the index are in the cache; the query engine knows where in the index ‘tree’ it needs to look. it first looks for those segments in the local cache, then memcached, then finally storage

marshall15:06:35

@hmaurer Yes that is possible. Alternatively we have Enterprise licensing options that may make sense for use cases with multiple system requirements

marshall15:06:18

@hmaurer are they related systems (i.e. sharding) or totally independent?

hmaurer15:06:34

@marshall Thanks for the quick reply! In my case we are considering to use Datomic at my company (in which case we would get a Pro license), but I have a few non-profit projects on the side that could make use of Datomic but don’t have the budget for a 5k/year license

hmaurer15:06:43

Which is why I was asking 🙂

Galaux15:06:23

@marshall ok thanks for that clarification!

marshall15:06:26

Gotcha. Yes, you can certainly get a Starter license for a non-profit side project. As far as multiple individual licenses, it might be best to have a call to discuss - you can shoot me an email and we can set something up (<mailto:[email protected]|[email protected]>)

hmaurer15:06:45

@marshall related question: let’s say I want to write an infrastructure test which spins up a Datomic system, runs some tests against it, and tears it down. I assume I can use the same license as the “prod” system?

marshall15:06:13

Yes, all licenses provide unlimited testing/staging/dev instances

hmaurer15:06:22

Thanks! I’ll definitely shoot you an email at some point!

hmaurer15:06:26

Ok, awesome

hmaurer15:06:03

@marshall Also since you are around, I asked a question earlier about backups. I know Datomic has a utility to store backups incrementally to S3 or similar, but I was wondering if backing up the underlying storage would also work

hmaurer15:06:29

The preferred solution seems to be, unsurprisingly, to use Datomic’s backup procedure

marshall15:06:43

it depends on the storage

hmaurer15:06:43

but I am nonetheless curious as to whether backing up the underlying storage would do the job

hmaurer15:06:57

Let’s say SQL or DynamoDB

hmaurer15:06:33

Ah I see; makes sense.

marshall15:06:46

so SQL yes, Dynamo nope

hmaurer15:06:01

Last question, to which I also got an answer by a community member but not by a Datomic dev: is it “ok” performance-wise to do a lot of “asOf” / history / “since” queries at arbitrary points in time?

hmaurer15:06:18

e.g. to provide users with a feature to see the state of a document at any point in the past

hmaurer15:06:27

or show them a changelog for that document

marshall15:06:44

yep; depending on how ‘deep’ your history is they may or may not be more expensive than “current”, but generally the performance is quite good and lots of customers use it for exactly that purpose

hmaurer15:06:03

Awesome, thank you! 🙂

Galaux15:06:29

@marshall earlier you mentioned you would be curious to have a look at the cache metrics on the peer. Given what we said, would you still be looking towards the cache?

Galaux15:06:50

I came up with some code using callback to send metrics from the peers so I have some metrics but unfortunately nothing about the cache – though I have this metric for the transactor.

marshall15:06:18

@gax it might be somewhat illustrative, but those numbers indicate ~ 0.01msec per value retrieved

hmaurer15:06:21

@marshall Thanks. Last but not least: would you recommend the Client API or the Peer API for a new application? From what I understand the Client API cannot do cross-database (or cross-points-in-time) joins, which seems like a big feature-loss, but I am not quite sure since I haven’t used it yet

marshall15:06:58

@hmaurer Depends on your needs; Your system overall could use both, mixing and matching as necessary : http://docs.datomic.com/clients-and-peers.html

hmaurer15:06:58

@marshall I see. I should read the doc fully before bothering you again. Thanks!

marshall15:06:01

no problem 🙂

Galaux15:06:42

@marshall do you think performing a :find directly on the children would speed up the query?

marshall15:06:03

it might; worth a test certainly. The other option to try would be to get all the children’s entity IDs directly in query then doing a pull-many on them

marshall15:06:09

not sure whether that’ll be faster or not

Galaux15:06:17

Ok! I will try this.

Galaux15:06:27

Also : as said, I came up with some code to get metrics out of the peer but it won't show this ObjectCache metric that looks veeeery interesting: is it normal the peer won't show this metric? Should it?

marshall15:06:52

it should yes

robert-stuttaford15:06:32

@gax you mentioned a q-explain earlier. what did you mean by this?

marshall15:06:35

also, are you using memcached?

Galaux15:06:19

@marshall no as Cassandra did not seem to be the bottleneck

robert-stuttaford15:06:09

neat, hadn’t seen that before, thanks!

marshall15:06:23

@gax hard to assess whether it’s storage latency and/or whether memcached would help without some metrics (i.e. storageGetMsec numbers, cache numbers)

Galaux15:06:19

@marshall would these metrics from the transactor be ok?

marshall15:06:52

@gax they wouldn’t provide info about the query/pull of interest. all that work happens on the peer

Galaux15:06:28

will try to fix my reporter then

hmaurer16:06:59

@marshall Hi! Another question… I read that it is highly recommended not to make “breaking” changes in the schema or change the semantics of an attribute. However, it seems you cannot completely exclude the possiblity that a poor design decision was made in the past and all the facts of some type X in the database history need to be updated to match a new schema. In those rare cases, is it doable?

hmaurer16:06:17

E.g. change the type of an attribute and migrate all the data accordingly, etc

hmaurer16:06:24

Roughly speaking this would mean traverse the whole log and make arbitrary edits to any transaction

hmaurer16:06:37

and have Datomic update all its indices accordingly

hmaurer16:06:18

Actually now that I think of it, this could be done by re-building a new database and copying everything over, setting “txInstant” manually to keep the timeline

hmaurer16:06:03

Just wondering if there is a less “nuclear” option for those scenarios

hmaurer16:06:00

(p.s. I read and understood http://blog.datomic.com/2017/01/the-ten-rules-of-schema-growth.html ; only talking about rare scenarios here)

hmaurer16:06:05

ah, you were faster than me

marshall16:06:06

yes, you can definitely rebuild a database if necessary

marshall16:06:11

the way you indicate

marshall16:06:34

a ‘less drastic’ option would be somethin like you suggest with creating a new attribute (of a different type say) and migrating the data over

marshall16:06:50

and Datomic does allow you to rename attributes if necessary for that sort of thing

hmaurer16:06:21

Ok, great, thank you

hmaurer16:06:54

Yet another question…: is it possible to follow the transaction log from a remote service? For example keep an elastic search instance in sync

hmaurer16:06:08

Ah, actually I guess this can be built on top of the Log API: http://docs.datomic.com/log.html

marshall16:06:37

combination of the log and the tx report queue would be what you want

hmaurer16:06:02

@marshall I am thoroughly impressed

robert-stuttaford16:06:16

hidden in this post is the fact that we use the tx-report-queue to loosely couple our web services to our worker services. no need for a separate queue at all

robert-stuttaford16:06:39

everything just talks talks to / watches storage

hmaurer16:06:57

@robert-stuttaford this is awesome. Sounds like event sourcing without the pain of implementing an event-sourced system from scratch

robert-stuttaford16:06:12

that’s certainly how we use it

robert-stuttaford16:06:19

just the other day i had to find out why something went missing. turns out someone wrote an overzealous hand-written transaction and cut 4000ish important datoms from ‘now’. had a ‘revert’ transaction transacted in 10 minutes, via remote repl

robert-stuttaford16:06:48

immutability is the gift that keeps on giving. it’s actually astonishing how it’s such a given that we should all use source control, when source is actually mostly a liability. but most folks use a forget-by-default database for their data, which is undeniably an asset. no one talks about Big Source, after all 🙂

hmaurer16:06:18

@robert-stuttaford Yes, immutability is (mostly) a blessing to work with. I can’t complain so far 🙂

spieden18:06:21

@robert-stuttaford either forget-by-default, or try to implement a broken subset of immutability via log tables at great cost!

souenzzo20:06:59

There is something like :db.type/edn (propose, workaround, future plans... )? I have two key cases (store graph and queryies) and I dont know exactly how to handle...

robert-stuttaford20:06:10

@souenzzo : use string + pr-str / clojure.edn/read-string. works just fine

souenzzo20:06:49

Yeah, I'm planning on using this. But wanted to know if there were more people with the same problem and if there is any expectation of having edn like type in the datomic

favila20:06:09

they've promised custom types from the beginning, and fressian is extensible enough to support it, but nothing has materialized

favila20:06:48

string or binary blob is how we handle it now, or for smaller types encode them into existing types somehow

hmaurer23:06:34

> publicly display or communicate the results of internal performance testing or other benchmarking or performance evaluation of the Software;

hmaurer23:06:47

May I ask why this is a clause in the T&Cs?