This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2016-08-18
Channels
- # alda (6)
- # architecture (1)
- # bangalore-clj (3)
- # beginners (39)
- # boot (292)
- # braveandtrue (1)
- # cider (7)
- # clara (2)
- # cljs-dev (20)
- # cljsjs (9)
- # cljsrn (42)
- # clojure (127)
- # clojure-chennai (1)
- # clojure-dev (96)
- # clojure-india (1)
- # clojure-russia (175)
- # clojure-spec (56)
- # clojure-uk (11)
- # clojureindia (1)
- # clojurescript (82)
- # core-async (7)
- # cursive (21)
- # data-science (1)
- # datomic (173)
- # funcool (4)
- # hoplon (8)
- # instaparse (1)
- # jobs (7)
- # jobs-discuss (1)
- # jobs-rus (30)
- # lambdaisland (1)
- # lein-figwheel (8)
- # off-topic (5)
- # om (51)
- # onyx (79)
- # other-languages (7)
- # planck (8)
- # re-frame (95)
- # reagent (6)
- # rum (8)
- # specter (4)
- # untangled (54)
- # yada (5)
@cezar I'm not sure the hash index is WAL logged so it might not be reliable enough
https://www.postgresql.org/docs/9.5/static/indexes-types.html I'm looking at the warning here
@cezar: just out of curiosity, what is the issue you see with BTree indexes?
i think its a bit faster for KV type lookups and quite significantly faster for writes
unless you're really concerned with writes, I don't think it's an issue, just cache reads as much as possible so you never have to go to storage
or infrequently at least
if writes are the bottleneck, datomic probably isn't the best fit anyway
@danielstockton: ah ok. I agree, I would add that data transfer time will likely dominate the btree read lookup time, and that indexing time will likely dominate the btree write time 🙂
true, log index is also a b-tree though and needs to be written to before a transaction is committed (not via background indexing)
My concern is not so much with speed but with data volume. If I have a bunch of Datomic databases managed by a single transactor the sole datomic_kvs table will become massive and the corresponding B-Tree index will be very slow for new inserts. In my experience anything over 100M entries in a BTree is just not performant for most applications. Again, I'm more concerned over inserts than reads. Also to preempt some, yes, I realize there is an option to use Dynamo, Couchbase etc but within an organization it's always easier to deploy on infrastructure that's already in place
@cezar: I doubt you'll reach 100M entries (would mean 100M segments, each of which contains from 1000 to 20000 datoms according to the docs - http://docs.datomic.com/capacity.html#sec-6), whereas we know the practical limit of Datomic is 10G datoms.
@val_waeselynck: the limit is per database not per transactor
So theoretically, you'll stay 1 order of magnitude below the 100M limit I guess
hmm I see what you men
10billion datoms is the theoretical upper limit
due to the size of the index roots in peer memory
ok let's work with my actual numbers: ~1000 databases (only a handful used at any one time) 1 transactor up to 500M datoms per database
at 64k datoms per segment, that gives you 156 250 segments
oh, you're right! feel the learn!
at 20k a seg, that's a lower bound of half a million segments
pessimistically, assuming you have 1000 datoms per segment, that's about 500k segments per database
and you can't use more than one transactor?
license limits only apply at the txor level. you can run as many as you want
i will have spikes of heavy writes to a couple of database at a time and then they go dormant for a long time
but I can't excise or archive them. they have to be theoretically accessible due to SLA
then perhaps psql isn't the right storage for you
@cezar: If you don't want your BTrees to get too deep you could maybe create several datomic_kvs
tables
@val_waeselynck: but how do I set up the transactor to write to a bunch of tables vs just one?
I'm not sure you can share a transactor between several databases actually
interesting!
oh you're right, that's even how dev storage works
multiple databases on a transactor? definitely
my mistake
multiple storages on a transactor? nope
I think for this kind of advanced stuff I should definitely leave you in the good hands of Cognitect support 🙂
I hope they could pipe in here 🙂 I don't have a contract with them yet (though we are currently 90% committed to Datomic for this project)
but I do have to resolve the BTree growth issue or get the buy in to use a proper KV store like Cassandra or Couchbase
@marshall and @bkamphaus can likely offer useful info
@cezar Datomic does not currently provide any way to remove the “dormant” dbs from a transactor, you would have to fail over to another transactor to do that
@stuarthalloway: that's not really my issue. Scroll up for the start of this conversation
or Stu 🙂
@cezar: maybe it is an issue you don’t know about yet 🙂
transactors are intended to manage a small number of dbs
if you have lots of dbs, you will need lots of transactors
separately from any storage implications
they stay in memory forever
I am not saying it has to be that way — nobody has requested a use case like yours before
how do folks like Nubank who eventually plan to use datomic at scale? tons of transactors?
s /tons /tens 🙂
well, I was hoping to neatly subdivide my data into a database per customer account (we have a couple of thousand customers)
@cezar how quickly do your need to bring up a db for a mostly dormant customer?
e.g. you could make a process manager that spins up an appropriate peer/transactor pair on demand, and then have your own external logic to spin them back down
depends on the request, but usually if it's dormant the a couple of minutes might be OK... but I'd have to consult product managers on this
you'd have to shard storages with that approach, right, @stuarthalloway ?
rough guess, you should be able to spin up a system in about a minute
@robert-stuttaford: transactors cannot share the same storage, but can cohabit in the same storage engine under different table names
that's what i thought, which does solve the original problem cezar mentioned
@stuarthalloway: how many dbs per transactor are "reasonable"?
Datomic is fundementally a Cloud architecture, built for a world where processes are cheap and isolation is a Good Thing
@cezar transactor should only handle a tiny number of dbs that are both (a) large and (b) have ongoing write volume, and by tiny I mean <10, probably closer to 1-3
lots of customers at scale shard by time, so only 1 db has ongoing write volume
right, see above
in e.g. the AWS cloud, the answer is clear — you just do 1 transactor per db and be done with it
for people running their own data centers, this can be more of a challenge because they lack something as polished as CloudFormation, ASGs, etc.
i'd love to know how they accomplish that time based sharding
@robert-stuttaford it is easy if your queries are time-scoped by the nature of the domain. Just start a new transactor+db on each domain time boundary
is it not easy otherwise 🙂
i guess i'm more curious about the boundary between the shards and the control database
but i suppose i could figure it out if i thought it through!
i hear you, though. it has to make sense for the domain
@cezar I understand, and Datomic may not be a great fit. What was your aggregate data size across all customers, in datoms?
@stuarthalloway do you guys have any tools for rebuilding databases? what was mentioned as 'decanting' on the last cognicast episode. i'm gearing up to do so at the moment, and i'd love to leverage any shortcuts that may exist, if you have any 🙂
reason is to get rid of all the accumulated cruft over 4 years - badly named schema, unwanted data (in the 100,000s datoms range), no-op transactions, etc
@stuarthalloway: Datomic is a very good fit otherwise. Plus we already started building on it. It never occurred to us that the limit of DBs per transactor was so small. We might still manage somehow but it's certainly making our lives a lot harder. I don't have an "aggregate" figure now but the data (like most data) will be cumulative over time. I forecast about 100B datoms per year (spread across many separate DBs)
@robert-stuttaford: several customers have written tools, some with our help. Some planned to open source but not sure any have.
@cezar we should have @marshall give you a call and talk through options
thanks Stu
@stuarthalloway forgive my cheekiness, but is it perhaps possible for you to put me in touch with those who planned to open source theirs? it's a big job i'm tackling, and i'd love an independent perspective on this, as i may save myself some time and effort
totally cool, if not possible
@robert-stuttaford: I will defer to @marshall to sort that out
thank you 🙂
@robert-stuttaford I have a tool like that useful to make a subset db, based on your work
seems like everyone has a tool like that
yeah. this time i care about maintaining the transaction order, and not losing the original timestamps. the one i shared with you before is just a 'now' snapshot, which is a lot simpler to produce
@robert-stuttaford I’ll do a bit of asking around
thank you, sir 🙂
Hey all... I've got kind of a difficult query. The issue is that the data set is rather large. I have an event of a specific type that I"m trying to tie back to another entity based on related ref's they each have. I'm finding that I'm running out of memory before this query completes. I was wondering if anyone had any tips
@petr datalog queries are set based; the whole result needs to fit in ram
you could instead use d/datoms -- which is lazy -- to walk one entity kind and use pull / entity to discover the rest
this at least allows you to do partial query or batched query
if you need to do this query often, you could write cache refs into the database that shorten the path from one to the other
everything's indexed already 🙂
have you used d/datoms before?
right. you can use it with others eavt vaet aevt
the event type - is that expressed as an attr?
perhaps as a ref to an enum?
e.g. one of several entities with db/ident values?
(d/q '[:find ?ue
:in $
:where
[?ue :user-event/type :user-event.type/create-share]
[?ue :user-event/asset ?asset]
[?share :share/assets ?asset]
[?share :share/created-at ?t]
[?ue :user-event/occurred-at ?t]
[?user :user/events ?ue]
[?user :user/shares ?share]]
if so, then you can cheat: (seq (d/datoms db :vaet (d/entid db :user-event.type/create-share) :user-event/type))
all the :e
values on this seq will give you ?ue
you can then craft a pull spec which expresses the rest of your clauses, or perhaps several normal clojure operations
see where i'm going with this?
not sure if you know, but pull does support reverse ref traversal
well, this way, you have the option of batching results and transacting those cache refs every so often
which allows you to do a small portion of the work at a time
Yep, I was originally just thinking of doing the first line adn then using partitions to chunk the data into smaller pieces
good luck 🙂
Did Datomic used to have attributes which were later removed? I'm wondering why there seem to be gaps (e.g. no entities 5-7) and nil entries in (:elements db)
Does anyone here use Datomic in tests with Circle CI ? I can't seem to figure out if this is possible
Say you regularly receive a broadcast entity (say, a User profile) that probably hasn’t changed.
If you make a db/tx function to check if it needs to actually be transacted, you also get a bunch of empties (because you return []) most of the time.
If you query the DB to resolve the entity, then compare it to the one coming over the wire, you’re not transactionally safe.
And if you have more than one, perhaps the occasional empty if the “value of the db” you’re querying is updated elsewhere. Hm.
Or you could have your db/tx throw a specific “short-circuit” exception you don’t have to log as an error.
@zentrope you could also batch them to reduce the number of empty transactions
Hm. Makes sense. Or even put a cache/memoize in there somewhere. Store incoming message checksums.
@zentrope of a Bloom filter or whatnut, but you may run into cache invalidation issues
you can also serialize externally even with several peers using e.g HornetQ with Message Grouping
Yeah. All techniques outside of datomic itself. Perhaps the “throw a special exception” idea is the least amount of work.
whatever floats your boat 🙂 what's the frequency ?
For instance, with RDBMS, you can use a .rollback if you discover things don’t need to be done. That kind of thing.
Even if I do the naive thing and just query the database right before deciding to write, if I do overwrite something, I’ve always go the history. ;)
hum i guess in your case problems arise if you decide not to write