This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2016-01-19
Channels
- # aatree (33)
- # admin-announcements (70)
- # alda (6)
- # aleph (2)
- # announcements (6)
- # aws (7)
- # beginners (40)
- # bitcoin (1)
- # boot (138)
- # cider (24)
- # cljs-dev (9)
- # cljsjs (18)
- # cljsrn (35)
- # clojars (4)
- # clojure (211)
- # clojure-art (4)
- # clojure-austria (2)
- # clojure-hamburg (8)
- # clojure-russia (66)
- # clojure-sg (3)
- # clojured (1)
- # clojurescript (73)
- # cursive (9)
- # datomic (124)
- # dirac (8)
- # editors (3)
- # emacs (13)
- # euroclojure (10)
- # hoplon (207)
- # jobs (4)
- # ldnclj (27)
- # lein-figwheel (3)
- # leiningen (10)
- # mount (5)
- # music (1)
- # off-topic (9)
- # om (92)
- # onyx (36)
- # perun (30)
- # proton (47)
- # re-frame (11)
- # reagent (11)
- # ring-swagger (7)
- # yada (2)
@yenda: No, it's not a bad idea. You can use a component to handle the Datomic connection if you want.
we use it with trapperkeeper. it uses http://github.com/rkneufeld/conformity on start to update schema
see https://github.com/robert-stuttaford/tk-app-dev/blob/master/src/tkad/services/datomic.clj (explained on http://www.stuttaford.me/2014/09/24/app-and-dev-services-with-trapperkeeper/)
quick question, I think I'm missing something with Datomic. I see everywhere people saying how it is horizontally read scalable especially when used with something like Dynamo or infinispan because all the querying is done on the client (peer). But this seems contradictory. With large datasets, sending say 80GB of data over the wire from say 50 machines to a single client for that one machine to process down to whatever tiny subset is needed is, even if you could manage to make it not oom going to be very slow. This gives a relatively small absolute limit on dataset size Whereas if the query was pushed down to the servers then the query can be distributed across the nodes allowing it to actually scale as nodes are added. Scale this up to multi-terabyte datasets and it very quickly becomes obvious this is impossible. What am I missing here. Can datomic actually handle large datasets; how much is actually done on the servers? Becuase other than scale concerns it sounds perfect for us.
@dexter: I'm not entirely sure what you mean by "sending 80GB of data from 50 machines to a single client" in the context of read scaling.
sending data from 50 machines sounds like putting data in, which is the job of the transactor (and so single-threaded), but that addresses write sacalability.
if I create a query and push it to the servers a-la h-base server side filters, SQL stored procedures etc then the task is subdivided across all 50 machines and each one conducts data-local ops where possible only sending the minimal amount needed over the wire to the client
this means as your data scales you simply divide it more and the overall data needing to be returned to the client as the answer remains constant
if however the client has a snapshot of all relevant data so it can perform the query client side that is fine at small scale but with large datasets you will be limited by a) network bandwidth which with a cluster of 50 machines throwing data at you to process will surely saturate the network and cause a serious bottleneck. Moving 80GB of data to a single machine takes a loooong time and b) once you finally get all the data you then have to perform the query but one machine will easily hit CPU limits long before it can efficiently process that much data
most horizontally read-scalable systems rely on pushing as much of the work as possible out for divide and conquer, so logically a system that does the inverse cannot scale to large datasets as it will take hours just to get all the source data for a query even before you manage a query
ok, I think I see... A couple of thoughts: 1) Peers need to see the transactor and storage. If that's not local, that's a huge bottleneck. (80GB is of course a lot on anything other than very fast networks.)
but you are still limited processing-wise even just streaming over 80GB on one machine will take hours
if you co-locate your peer with part of the dataset that bit can be fast but you still need to get data from the rest of the cluster and process it
1) Peer: Peer Library! Find me all redheads who know how to play the ukelele! 2) Peer Library: Yes yes! Right away sir! Database - give me your data! 3) (Peer Library retrieves the database and performs query on it) 4) Peer Library: Here are all ukelele-playing redheads! 5) Peer: Hooraaaaaaaay!
step 3 is a massive bottleneck that limits any datomic dataset to whatever a single machine can practically get its hands on in time
I think the Peer is generally smarter than that, and will only load the chunks it needs to get those datoms
true, but thats still the source so if I read from my 30T dataset, 80G datoms -> some filter -> 20G datoms -> some transform -> 20G datoms -> some filter -> 1G datoms -> some join -> 100 datoms
I only needed the 1G for the join, but I have to send all 80G and do all that parallelisable work on one very beefy machine
in particular consider http://internetmemory.org/en/index.php/synapse/on_the_power_of_hbase_filters
based on my understanding, most of what you would do with HBase filters is handled by the indexes in datomic.
doesn't that heavily limit flexibility in that you need to know every query in advance
so to find "all the redheads" it just needs to look at the AVET index, to get all the Entities with the Attribute :hair/color
and the value :hair.color/red
(hypothetically)
(disclaimer: I'm relatively new to datomic myself, so I'm only speaking from my understanding)
basically we have a large transaction dataset that we could never fit on a single box, and we want very fast reads for complex matrix manipulation / modelling purposes. People submit queries asking for models to be generated over large swathes of data and they want them sub-second. The problem with things like HBase is that they are too slow with their clunky HDFS backing stores. Datomic seems ideal but we really need to do the work data-local on the servers because no one machine could process all the data fast enough, more to the point you couldn't even send the source data to one box in-time. Fortunately the problem can be sub-divided and thus as long as each section completes fast enough we should be fine.
if i'm understanding right to use datomic in this context we may want a proxy process to conduct the queries datalocal on subsets of data then manually collect the results from the delegates?
these queries also are heavily time-series oriented hence why something like datomic is better than just say postgres
so you might collect all the transactions for one transactor using an index and shard by transactor
there is one transactor, and it runs single-threaded. period. (well, HA failover excepted)
so in this context, you could definitely partition the data set by legal entity. partitions are all about index locality
so in a datomic 'cluster' I could partition by legal entity while retaining the ability to join across partitions (after filtering to reduce the cost of course)
that said, the peers need to be sized for their working set, so if they really do need 80GB to answer a query, then they need that much space.
the thing is its divisible, so ideally the part that works on 80G would be farmed out to many machines, it pairs down to small sizes for actual result-sets
yeah, we are currently evaluating a whole bunch of things, if it could handle the scale datomic is the best fit by far
I mean, you could do it manually and do a join across results from a bunch of Peers, but that seems like re-inventing the wheel.
without knowing more about what the data & access patterns look like, it's hard to say.
my confusion here is just because people say reads scale horizontally here, but it seems like they very definitely don't number of clients is irrelevant to the datomic structure but the scale is limited by how much one machine can do
@dexter: yeah, what the Datomic folk mean is that in the traditional centralized OLTP database world, your read and write scalability is tightly bound.
if you have heavy read traffic- again, OLTP-like traffic- you can marry your load to your capacity very easily
for your problem, have you looked at Onyx? http://www.onyxplatform.org/
read scalability is always limited to what a single node can do... in a cluster/map-reduce database, there are mechanisms for breaking up small bits of work and putting them together. Datomic is not a map-reduce system, and using it that way (on its own) you're gonna have a bad time.
yeah, Onyx looks interesting, it's actually what reminded my of datomic. The question there is that you still need to store and get access to the data somehow
indeed, our feeling is that without an MR option our load cannot be handled but all the MR options are slow or 100% in memory which would cost a lot. VoltDB looks promising too
here's one Onyx+Datomic solution: http://yuppiechef.github.io/cqrs-server/
@dexter at what rate is your dataset changing, or is it fixed?
it varies, small data (100s MB) changes slowly but constantly, maybe 10M updates at a time. Big Data updates say minimum 30G /day come in overnight via our batch process (grab giant files from clients and normalise). Then people can submit arbitrary queries across almost any field which does e.g. montecarlo simulations on the streams of transactional data
really tiny data changes via users inputting on the site, fairly constantly, which is tiny data but causes many changes
apparently competitors manage thousands of these sorts of things in massive oracle deployments
but the lack of ability to pre-calc just due to the variance of the queries is a real pain
yes. a number of tricky problems in that.
at present we have basically bunged it all in Redis, but well, it's creaking, we've hit the limit
yes. you're definitely at a scale in terms of data and potential licenses that you should be talking to cognitect directly but a couple of other things come to mind
datomic has datom count limits within a single database
and likely peer count limits within a "cluster"
as long as you can join cross db that's not necessarily an issue, in that there are a lot of partitions, at least a few hundreds
like I say atm this is exploration work, we have a few options to consider, none are exactly perfect so we need to find what's best or maybe even divide the issue and have two solutions
are you able to partition your model executions into groups based on the amount or type of data they'd need to churn through?
amount no, type not quite, we can partition via which top level entity the data is about
right. cpu consumption per model may vary greatly as well, no?
#onyx can certainly used to scale out the reads. We've just shipped with a new scheduler implementation. One feature we're hoping to achieve soon, is better scheduling constraints to prefer Datomic peers be collocated on nodes. http://www.onyxplatform.org/jekyll/update/2016/01/16/Onyx-0.8.4-Colocation.html is a decent discussion of how the scheduler can be used for performance optimisation purposes
@dexter: looking at your requirements above, onyx may be a good fit. Sounds like you have some interesting issues at scale though. @michaeldrogalis and I would be happy to have a chat with you about it if you're interested.
@dexter I think you could likely scale out your reads OK, but it's worth mentioning that all writes must go through a single transactor so you may have write scalability issues, depending on your needs
@lucasbradstreet: onyx does indeed sound like a good fit
though I'm concerned about the likely stability onyx both operationally and api-stability given its sub v1 state
I don't think writes should be too much of an issue though we'd need to ensure the batch insert did not affect read performance
I think data being a bit behind at 2AM is less of an issue given that it just came out of FTP -> Hadoop
if datomic/onyx stays in our top options I'll take you up on that chat with some concrete examples of actual queries
Yup. I understand your position on the sub v1 state given where you’re at and what you’re looking for. We’re quickly stabilising with recent work, including jepsen testing, but it sounds like you need something very solid now.
batch datomic writes don’t affect read performance as far as I know, because the peers don’t interact with the transactor at all
I guess it could affect index creation, but I don’t understand the subtleties there
really sounds like a good option still if its stable enough. What's the roadmap for stability like
IIRC batch transactions might need some tuning (and back pressure). There's also at least a theoretical upper bound on bandwidth to Peers for the index updates. But again, the performance-at-scale questions are better answered by Cognitecticians. 😉
@dexter: It's in a pretty stable state right now. Datomic itself is < 1.0, for a comparison.
Anyhow - I don't want to encroach on the Datomic conversation. Over to #C051WKSP3 if you have more questions.
Anyone have any pointers for the best way to deal with transact errors?
@dexter Datomic does not try to position itself as a "big data" system. It's in a completely different category from something like HBase. "Horizontally-scalable reads" is to contrast to traditional relational databases where both queries and transactions are executed on a single machine.
Datomic centralizes the transaction load on one machine to get efficient ACID transactions, but each Peer can execute its own queries without affecting transaction throughput or queries on other Peers.
Each Peer may be interested in different subset of the whole database, but a single query is limited in scope to what a single Peer can hold in RAM.
Say, if I'm creating my own partition for some related entities, is it good style to use a :db.part/my-partition
identifier for it, or should I just use :my-partition
and then pass that keyword to (d/tempid)
? Trying to decide between this:
{:db/id #db/id[:db.part/db]
:db/ident :db.part/notifications
:db.install/_partition :db.part/db}
...and this:
{:db/id #db/id[:db.part/db]
:db/ident :notifications
:db.install/_partition :db.part/db}
Parenthetically, I wish the datomic docs around this had a few more examples, specifically ones where new data is asserted in the created partitions