Fork me on GitHub
#datomic
<
2016-08-18
>
danielstockton07:08:55

@cezar I'm not sure the hash index is WAL logged so it might not be reliable enough

val_waeselynck07:08:35

@cezar: just out of curiosity, what is the issue you see with BTree indexes?

danielstockton07:08:17

i think its a bit faster for KV type lookups and quite significantly faster for writes

danielstockton07:08:39

unless you're really concerned with writes, I don't think it's an issue, just cache reads as much as possible so you never have to go to storage

danielstockton07:08:51

or infrequently at least

danielstockton07:08:02

if writes are the bottleneck, datomic probably isn't the best fit anyway

val_waeselynck08:08:50

@danielstockton: ah ok. I agree, I would add that data transfer time will likely dominate the btree read lookup time, and that indexing time will likely dominate the btree write time 🙂

danielstockton08:08:49

true, log index is also a b-tree though and needs to be written to before a transaction is committed (not via background indexing)

cezar12:08:11

My concern is not so much with speed but with data volume. If I have a bunch of Datomic databases managed by a single transactor the sole datomic_kvs table will become massive and the corresponding B-Tree index will be very slow for new inserts. In my experience anything over 100M entries in a BTree is just not performant for most applications. Again, I'm more concerned over inserts than reads. Also to preempt some, yes, I realize there is an option to use Dynamo, Couchbase etc but within an organization it's always easier to deploy on infrastructure that's already in place

val_waeselynck12:08:27

@cezar: I doubt you'll reach 100M entries (would mean 100M segments, each of which contains from 1000 to 20000 datoms according to the docs - http://docs.datomic.com/capacity.html#sec-6), whereas we know the practical limit of Datomic is 10G datoms.

cezar12:08:53

@val_waeselynck: the limit is per database not per transactor

val_waeselynck12:08:57

So theoretically, you'll stay 1 order of magnitude below the 100M limit I guess

cezar12:08:06

I want to blow way past that limit by having many databases

val_waeselynck12:08:17

hmm I see what you men

robert-stuttaford12:08:27

10billion datoms is the theoretical upper limit

robert-stuttaford12:08:38

due to the size of the index roots in peer memory

cezar12:08:30

ok let's work with my actual numbers: ~1000 databases (only a handful used at any one time) 1 transactor up to 500M datoms per database

robert-stuttaford12:08:36

at 64k datoms per segment, that gives you 156 250 segments

cezar12:08:01

there is not 64,000 datoms per segment

cezar12:08:08

each segment is about 64Kbytes

cezar12:08:11

that's far fewer datoms

robert-stuttaford12:08:26

oh, you're right! feel the learn!

cezar12:08:49

usually a couple of thousand in my experience

robert-stuttaford12:08:07

at 20k a seg, that's a lower bound of half a million segments

val_waeselynck12:08:14

pessimistically, assuming you have 1000 datoms per segment, that's about 500k segments per database

cezar12:08:04

yeah, but times 1000 databases total I'm looking at 500M+ rows in postgres

cezar12:08:18

not to mention we have to remember there are at least three indexes

robert-stuttaford12:08:20

and you can't use more than one transactor?

cezar12:08:35

I can... I just don't want to because my traffic isn't very concurrent

robert-stuttaford12:08:38

license limits only apply at the txor level. you can run as many as you want

cezar12:08:05

i will have spikes of heavy writes to a couple of database at a time and then they go dormant for a long time

cezar12:08:27

but I can't excise or archive them. they have to be theoretically accessible due to SLA

robert-stuttaford12:08:48

then perhaps psql isn't the right storage for you

val_waeselynck12:08:49

@cezar: If you don't want your BTrees to get too deep you could maybe create several datomic_kvs tables

cezar12:08:16

@val_waeselynck: but how do I set up the transactor to write to a bunch of tables vs just one?

cezar12:08:23

or do you mean use N transactors?

val_waeselynck12:08:57

I'm not sure you can share a transactor between several databases actually

cezar12:08:36

yes you can... I tested that with no ill effects

cezar12:08:44

I believe it is officially supported

val_waeselynck12:08:15

oh you're right, that's even how dev storage works

robert-stuttaford12:08:18

multiple databases on a transactor? definitely

robert-stuttaford12:08:46

multiple storages on a transactor? nope

val_waeselynck12:08:01

I think for this kind of advanced stuff I should definitely leave you in the good hands of Cognitect support 🙂

cezar12:08:44

I hope they could pipe in here 🙂 I don't have a contract with them yet (though we are currently 90% committed to Datomic for this project)

cezar12:08:15

but I do have to resolve the BTree growth issue or get the buy in to use a proper KV store like Cassandra or Couchbase

robert-stuttaford12:08:21

@marshall and @bkamphaus can likely offer useful info

stuarthalloway12:08:53

@cezar Datomic does not currently provide any way to remove the “dormant” dbs from a transactor, you would have to fail over to another transactor to do that

cezar12:08:22

@stuarthalloway: that's not really my issue. Scroll up for the start of this conversation

stuarthalloway12:08:51

@cezar: maybe it is an issue you don’t know about yet 🙂

stuarthalloway12:08:20

transactors are intended to manage a small number of dbs

stuarthalloway12:08:30

if you have lots of dbs, you will need lots of transactors

stuarthalloway12:08:51

separately from any storage implications

cezar12:08:13

why is this? What if I only access a handful of dbs at any one time?

stuarthalloway12:08:28

they stay in memory forever

stuarthalloway12:08:04

I am not saying it has to be that way — nobody has requested a use case like yours before

cezar12:08:20

how do folks like Nubank who eventually plan to use datomic at scale? tons of transactors?

stuarthalloway12:08:41

s /tons /tens 🙂

cezar12:08:50

I see 🙂

cezar12:08:19

well, I was hoping to neatly subdivide my data into a database per customer account (we have a couple of thousand customers)

cezar12:08:29

and they will rarely access that data concurrently

stuarthalloway12:08:41

@cezar how quickly do your need to bring up a db for a mostly dormant customer?

stuarthalloway12:08:34

e.g. you could make a process manager that spins up an appropriate peer/transactor pair on demand, and then have your own external logic to spin them back down

cezar12:08:34

depends on the request, but usually if it's dormant the a couple of minutes might be OK... but I'd have to consult product managers on this

robert-stuttaford12:08:23

you'd have to shard storages with that approach, right, @stuarthalloway ?

stuarthalloway12:08:28

rough guess, you should be able to spin up a system in about a minute

cezar12:08:57

operationally that may be a hard sell for me

cezar12:08:30

to have all this infrastructure around standing up/shutting down transactors

stuarthalloway12:08:17

@robert-stuttaford: transactors cannot share the same storage, but can cohabit in the same storage engine under different table names

robert-stuttaford12:08:44

that's what i thought, which does solve the original problem cezar mentioned

cezar12:08:47

@stuarthalloway: how many dbs per transactor are "reasonable"?

cezar12:08:10

more? fewer?

stuarthalloway12:08:02

Datomic is fundementally a Cloud architecture, built for a world where processes are cheap and isolation is a Good Thing

stuarthalloway12:08:21

@cezar transactor should only handle a tiny number of dbs that are both (a) large and (b) have ongoing write volume, and by tiny I mean <10, probably closer to 1-3

stuarthalloway12:08:46

lots of customers at scale shard by time, so only 1 db has ongoing write volume

cezar12:08:48

what if only a couple are written to concurrently

stuarthalloway12:08:04

right, see above

stuarthalloway12:08:06

in e.g. the AWS cloud, the answer is clear — you just do 1 transactor per db and be done with it

stuarthalloway12:08:49

for people running their own data centers, this can be more of a challenge because they lack something as polished as CloudFormation, ASGs, etc.

cezar12:08:13

this is my use case unfortunately 😕

robert-stuttaford12:08:18

i'd love to know how they accomplish that time based sharding

cezar12:08:21

ie internal data center

stuarthalloway12:08:33

@robert-stuttaford it is easy if your queries are time-scoped by the nature of the domain. Just start a new transactor+db on each domain time boundary

stuarthalloway12:08:49

is it not easy otherwise 🙂

robert-stuttaford12:08:16

i guess i'm more curious about the boundary between the shards and the control database

robert-stuttaford12:08:25

but i suppose i could figure it out if i thought it through!

robert-stuttaford12:08:37

i hear you, though. it has to make sense for the domain

stuarthalloway12:08:45

@cezar I understand, and Datomic may not be a great fit. What was your aggregate data size across all customers, in datoms?

robert-stuttaford12:08:55

@stuarthalloway do you guys have any tools for rebuilding databases? what was mentioned as 'decanting' on the last cognicast episode. i'm gearing up to do so at the moment, and i'd love to leverage any shortcuts that may exist, if you have any 🙂

robert-stuttaford12:08:39

reason is to get rid of all the accumulated cruft over 4 years - badly named schema, unwanted data (in the 100,000s datoms range), no-op transactions, etc

cezar13:08:25

@stuarthalloway: Datomic is a very good fit otherwise. Plus we already started building on it. It never occurred to us that the limit of DBs per transactor was so small. We might still manage somehow but it's certainly making our lives a lot harder. I don't have an "aggregate" figure now but the data (like most data) will be cumulative over time. I forecast about 100B datoms per year (spread across many separate DBs)

cezar13:08:52

very rough estimate

stuarthalloway13:08:52

@robert-stuttaford: several customers have written tools, some with our help. Some planned to open source but not sure any have.

stuarthalloway13:08:36

@cezar we should have @marshall give you a call and talk through options

cezar13:08:59

I'll PM him with my phone number

robert-stuttaford13:08:35

@stuarthalloway forgive my cheekiness, but is it perhaps possible for you to put me in touch with those who planned to open source theirs? it's a big job i'm tackling, and i'd love an independent perspective on this, as i may save myself some time and effort

robert-stuttaford13:08:56

totally cool, if not possible

stuarthalloway13:08:10

@robert-stuttaford: I will defer to @marshall to sort that out

pesterhazy13:08:29

@robert-stuttaford I have a tool like that useful to make a subset db, based on your work

pesterhazy13:08:52

seems like everyone has a tool like that

robert-stuttaford13:08:41

yeah. this time i care about maintaining the transaction order, and not losing the original timestamps. the one i shared with you before is just a 'now' snapshot, which is a lot simpler to produce

marshall14:08:49

@robert-stuttaford I’ll do a bit of asking around

robert-stuttaford14:08:44

thank you, sir 🙂

kvlt15:08:11

Hey all... I've got kind of a difficult query. The issue is that the data set is rather large. I have an event of a specific type that I"m trying to tie back to another entity based on related ref's they each have. I'm finding that I'm running out of memory before this query completes. I was wondering if anyone had any tips

robert-stuttaford15:08:16

@petr datalog queries are set based; the whole result needs to fit in ram

kvlt15:08:46

So you'd suggest breaking them up?

robert-stuttaford15:08:06

you could instead use d/datoms -- which is lazy -- to walk one entity kind and use pull / entity to discover the rest

robert-stuttaford15:08:17

this at least allows you to do partial query or batched query

robert-stuttaford15:08:08

if you need to do this query often, you could write cache refs into the database that shorten the path from one to the other

kvlt15:08:13

Woud'nt that require there to be indexes?

kvlt15:08:18

This is a once off query

robert-stuttaford15:08:27

everything's indexed already 🙂

kvlt15:08:31

The idea is to create this cache ref

kvlt15:08:42

Thats teh reasoning behind the query

robert-stuttaford15:08:59

have you used d/datoms before?

kvlt15:08:10

I have but I think only with the :avet index

kvlt15:08:25

It's been a while

robert-stuttaford15:08:34

right. you can use it with others eavt vaet aevt

kvlt15:08:48

I know 🙂 JUst trying to refresh my memory

robert-stuttaford15:08:33

the event type - is that expressed as an attr?

robert-stuttaford15:08:48

perhaps as a ref to an enum?

kvlt15:08:49

I can show you the query I've written

robert-stuttaford15:08:58

e.g. one of several entities with db/ident values?

kvlt15:08:18

(d/q '[:find ?ue
         :in $
         :where
         [?ue :user-event/type :user-event.type/create-share]
         [?ue :user-event/asset ?asset]
         [?share :share/assets ?asset]
         [?share :share/created-at ?t]
         [?ue :user-event/occurred-at ?t]
         [?user :user/events ?ue]
         [?user :user/shares ?share]]

kvlt15:08:54

The first line refers toa n enum

robert-stuttaford15:08:56

if so, then you can cheat: (seq (d/datoms db :vaet (d/entid db :user-event.type/create-share) :user-event/type)) all the :e values on this seq will give you ?ue

kvlt15:08:57

The others are refs

robert-stuttaford15:08:16

you can then craft a pull spec which expresses the rest of your clauses, or perhaps several normal clojure operations

robert-stuttaford15:08:24

see where i'm going with this?

kvlt15:08:35

Yep I do

kvlt15:08:04

I don't see how this would be much better though

robert-stuttaford15:08:06

not sure if you know, but pull does support reverse ref traversal

kvlt15:08:14

I did knwo that

robert-stuttaford15:08:08

well, this way, you have the option of batching results and transacting those cache refs every so often

kvlt15:08:09

At some point I"m going to evaluate this expression anyway. Now I could chunk it

robert-stuttaford15:08:26

which allows you to do a small portion of the work at a time

kvlt15:08:02

Yep, I was originally just thinking of doing the first line adn then using partitions to chunk the data into smaller pieces

kvlt15:08:54

Thanks rob

danielstockton18:08:44

Did Datomic used to have attributes which were later removed? I'm wondering why there seem to be gaps (e.g. no entities 5-7) and nil entries in (:elements db)

jdkealy21:08:40

Does anyone here use Datomic in tests with Circle CI ? I can't seem to figure out if this is possible

zentrope22:08:44

Say you regularly receive a broadcast entity (say, a User profile) that probably hasn’t changed.

zentrope22:08:02

If you save it to the DB every time, you get a bunch of empty transactions.

zentrope22:08:37

If you make a db/tx function to check if it needs to actually be transacted, you also get a bunch of empties (because you return []) most of the time.

zentrope22:08:06

If you query the DB to resolve the entity, then compare it to the one coming over the wire, you’re not transactionally safe.

zentrope22:08:12

Is there a story for this?

zentrope22:08:12

If you’ve only got one client, you can serialize I guess.

zentrope22:08:02

And if you have more than one, perhaps the occasional empty if the “value of the db” you’re querying is updated elsewhere. Hm.

zentrope22:08:30

Or you could have your db/tx throw a specific “short-circuit” exception you don’t have to log as an error.

val_waeselynck22:08:54

@zentrope you could also batch them to reduce the number of empty transactions

zentrope22:08:25

Hm. Makes sense. Or even put a cache/memoize in there somewhere. Store incoming message checksums.

val_waeselynck22:08:16

@zentrope of a Bloom filter or whatnut, but you may run into cache invalidation issues

val_waeselynck22:08:09

you can also serialize externally even with several peers using e.g HornetQ with Message Grouping

zentrope23:08:08

Yeah. All techniques outside of datomic itself. Perhaps the “throw a special exception” idea is the least amount of work.

val_waeselynck23:08:03

whatever floats your boat 🙂 what's the frequency ?

zentrope23:08:46

Well, I’m working on a POC, so it’s about every 10, 15, 30 seconds or so.

zentrope23:08:24

Regardless, I was mainly interested in if there was an obvious Datomic answer.

zentrope23:08:58

For instance, with RDBMS, you can use a .rollback if you discover things don’t need to be done. That kind of thing.

zentrope23:08:50

Even if I do the naive thing and just query the database right before deciding to write, if I do overwrite something, I’ve always go the history. ;)

val_waeselynck23:08:31

hum i guess in your case problems arise if you decide not to write

zentrope23:08:37

Oops. Yep. Right.