This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2017-01-12
Channels
- # aws (21)
- # aws-lambda (8)
- # beginners (53)
- # boot (56)
- # braveandtrue (1)
- # cider (49)
- # cljs-dev (8)
- # cljsjs (1)
- # cljsrn (57)
- # clojure (403)
- # clojure-austin (17)
- # clojure-dusseldorf (10)
- # clojure-greece (9)
- # clojure-spec (57)
- # clojure-uk (144)
- # clojurescript (60)
- # datomic (149)
- # docker (1)
- # emacs (1)
- # hoplon (23)
- # humor (1)
- # jobs (1)
- # leiningen (2)
- # luminus (1)
- # off-topic (1)
- # om (24)
- # om-next (15)
- # onyx (23)
- # protorepl (2)
- # re-frame (58)
- # reagent (90)
- # remote-jobs (1)
- # ring-swagger (4)
- # slackpocalypse (1)
- # spacemacs (2)
- # specter (18)
- # untangled (4)
- # vim (1)
- # yada (27)
Would it be appropriate to call Datomic a graph database?
@sova Yes absolutely, you can walk along the edges to other entities from an entity object.
@rauh cool. thanks.
I never thought to call it that, but you're totally right
@sova FWIW: That's how I use datascript on the client. Get an entry point somewhere and then let my components walk along the graph (with entites) and let them decide what they need. Almost no queries necessary this way.
Could you tell me more about that?
Well I just get an entity out at some point, let's say (d/entity-by-av :post/id post-id)
, then pass this to a react component. I can then get out anything it wans :post/title
etc, or pass it to children (mapv display-comments (:post/comments post)
which again could pass it on to child components (display-user-small (:comment/user comment))
etc etc.
oh very cool! @rauh are you using Reagent?
Okay. Maybe it's time I take another look at rum.
I just wrote some pseudo-code for what I would want for ideal component+query handling on the ui side.. and that looks pretty close to what i've got going on
dayum github just went down x.x
at least in my neck of the woods
What do people do to return paginated, sorted large result sets?
For example, suppose the SQL statement
select order_number, product_name, price from orders sort by order_date desc limit 50
should be translated to Datomic. Also suppose there are 200,000 orders in the database.
The first approach would be
[:find ?order ?date :where [?order :order/date ?date]]
, followed by (->> results (sort-by second) (map first) (take 50) (d/pull-many '[my-pull-spec])
However, with 200,000 results, this is already relatively slow at >1000ms, with the equivalent SQL query taking <10ms.
What's more, query time will grow quickly along with the size of the result set.
Has anyone developed any patterns for this sort of use case?
i wouldn’t use Datalog, if you can generate the initial set on a single attribute. i’d use d/datoms, which at least would only cause me to seek to the end of the intended page, rather than realise the whole set
this assumes you can lean on the prevalent sort of the index for your order. if you need an alternate sort, you’d have to get the full set. there’s an open feature request to allow performant index traversal in reverse order, which will help with this sort of thing
prevalant sort for java.util.Date would be ascending I assume?
so I'd need to come up with some sort of "reverse date" (e.g. negative Unix timestamp)
two other problems
- I'd need a separate attribute for each entity type (`:order/sort-key`, :product/sort-key
)
- there's no easy way to further restrict the search (e.g. give me the last 50 orders that have ":order/status :shipped")
a separate "pagination" key for each entity isn't too bad I guess
any way to do filtering though?
@pesterhazy Please also vote for reverse seek on https://my.datomic.com/account -> "Suggest features" if you want this
all prevalent sorting is ascending, yes, @pesterhazy
you’d have to put filtering into your d/datoms processing pipeline ahead of drop
+ take
still going to be faster than realising the whole set with Datalog
can you elaborate on how to do filtering in the datoms pipeline, @robert-stuttaford ?
(->> (d/datoms) (filter (fn [[e _ v]] <go to pull or entity or datalog with e and/or v>) (drop) (take))
@pesterhazy One pragmatic approach is to keep a client side datastructure that stores the date of the (last - 100), (last - 200) etc date. This would allow you to seek quicker at the end. Looks like order_date is append only and immutable? And then just iterated to the end and take the last n
datoms.
Then refresh that data structure when you iterated above (or below if they can be removed) a threshold
yeah, this is taking the linked-list approach. you may be able to use d/seek-datoms to iterate through the raw index from some mid point
> Note that, unlike the datoms function, there need not be an exact match on the supplied components. The iteration will begin at or after the point in the index where the components would reside. …
@rauh, I voted for the "reverse index" feature
thanks for the pointer
@robert-stuttaford, ah I see, using entity or pull makes sense
I could even grab batches of, say, 1024 and run d/q
on each 🙂
you could 🙂
this is very much a do-it-yourself part of Datomic, though (which is great, because you’re in control) but i agree it would be good to establish some patterns. it’s very similar to yesterday’s discussion about arbitrary sort; the linked-list vs array CS question
@rauh, so basically the idea would be to have the api send back a next
token, rather than a page number?
I'd call it a sketch
of the index. A sorted-map
like {100 "some-date" 200 "some-date" 300 "some-date" ...}
which "approximatedly" seeks into the datoms
a (d/datoms :avet :order/date (get sketch 100))
and then seek to the end. The result, unless you removed orders, should be >= 100
datoms. Then just take the last 50 for your pagination. Then update the 100 key of the map to (nth datom-result (- (count datoms) 100))
kinda like google maps does when you zoom in. starts with a rough zoomed out blurred view, and fills boxes in with detail as you focus in
First time you do the seek, you won't have any info, so you have to iterate all datoms. Then keep that "approximate sketch" in a map.
i said that without thinking about it too much, i may be way off 😊
Obviously don't store all 200,000 / 100
but only {last-10,000 ... last}
since people seldomly paginate more that that
The whole thing becomes much more complicated (== brakes down) if the dates are edited/removed a lot and queried very infrequently.
And if you add a lot of new orders between the querying, then the datoms call will also be more expensive and might be very large. Though, you could listen on the datomic transactions and even keep it really updated all the time... Lots of room for optimizations.
so, if i understand correctly, you’re using the initial seek to set up bookmarks in the dataset to seek from via e.g. seek-datoms, for the data that users paginate infrequently
Yeah, order/date is immutable
Can't say I understand the approach completely but it sounds intriguing
I think there's room for a datomic patterns library that for provide auxiliary indexes and other helpers as functions
i agree, but we’d have to preface it with a big disclaimer: warning—here be opinions 🙂
It seems there's room for general solutions here
totally agree
I like the idea of bookmarks for pagination
Yeah I think a general lib could be useful to create such bookmarks/sketches. If the data is immutable you can even cover the entire index and just append to it at the end as data grows.
@pesterhazy @robert-stuttaford : Brainstorming: https://gist.github.com/rauhs/aa58d748abf851543d57ef3403f23edb
great!
"We keep a datastructure on the client code" -- what do you mean by "client" here?
Turns out my datomic backup db script was failing because of wrong java version. Oh well...
how are you supposed to backup to s3 from localhost? I thought the S3 backup used AMI
you can access S3 from local your machine, no problem
@rauh, I see. Can't we store the index itself in datomic?
@pesterhazy That's a great idea! That would get rid of a lot of the problems from the last section
Now I wonder how :nohistory
behaves when having multiple db
's from different timepoints of the connection
noHistory
means no history, meaning there are no guarantees that you will ever see anything but the most recent value of the attribute, even if you're looking in the "past"
bin/maven-install
bin/run -m datomic.peer-server -p 8998 -a myaccesskey,mysecret -d firstdb,datomic:
% bin/run -m datomic.peer-server -p 8998 -a myaccesskey,mysecret -d firstdb,datomic:
Exception in thread "main" java.io.FileNotFoundException: Could not locate datomic/peer_server__init.class or datomic/peer_server.clj on classpath. Please check that namespaces with dashes use underscores in the Clojure file name.
at clojure.lang.RT.load(RT.java:456)
at clojure.lang.RT.load(RT.java:419)
at clojure.core$load$fn__5677.invoke(core.clj:5893)
at clojure.core$load.invokeStatic(core.clj:5892)
at clojure.core$load.doInvoke(core.clj:5876)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at clojure.core$load_one.invokeStatic(core.clj:5697)
at clojure.core$load_one.invoke(core.clj:5692)
at clojure.core$load_lib$fn__5626.invoke(core.clj:5737)
at clojure.core$load_lib.invokeStatic(core.clj:5736)
at clojure.core$load_lib.doInvoke(core.clj:5717)
at clojure.lang.RestFn.applyTo(RestFn.java:142)
at clojure.core$apply.invokeStatic(core.clj:648)
at clojure.core$load_libs.invokeStatic(core.clj:5774)
at clojure.core$load_libs.doInvoke(core.clj:5758)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at clojure.core$apply.invokeStatic(core.clj:648)
at clojure.core$require.invokeStatic(core.clj:5796)
at clojure.main$main_opt.invokeStatic(main.clj:314)
at clojure.main$main_opt.invoke(main.clj:310)
at clojure.main$main.invokeStatic(main.clj:421)
at clojure.main$main.doInvoke(main.clj:384)
at clojure.lang.RestFn.invoke(RestFn.java:805)
at clojure.lang.Var.invoke(Var.java:455)
at clojure.lang.AFn.applyToHelper(AFn.java:216)
at clojure.lang.Var.applyTo(Var.java:700)
at clojure.main.main(main.java:37)
@stuartsierra Just to clarify: So a
(let [db (d/db conn)]
(change-no-hist-attr!)
(:acces-no-hist-attr (d/entity db some-ent)))
might see the new value?@rauh I'm not certain, but I would not be surprised if you saw the new value in that case.
@stuartsierra Could you find out? That would change some things for me.
@rauh I do not have a way to prove it without running a lot of tests. But I do know that noHistory
is defined to mean "I only ever care about the most recent value of this attribute."
@rauh you’re calling d/entity on a value of the DB from before you transacted a change
@marshall Well the use-case was from the above sketches (not sure if you read the conversation?). So it might be not like the code above
so, here's a question I get when I explain datomic's peer/transactor model: "So you mean that as our data grows, we have to scale the memory on every single one of our running application nodes, instead of just our data nodes?"
Part of the problem is that I haven't actually done the work of ETLing our domain data into datomic to be able to intelligently respond to the "that's too much data in memory" objections. But in general, is that statement accurate?
@tjtolton Your peer memory needs to account for its copy of the memory index, its own object cache, and whatever it needs to run its application. This does not necessarily scale as your data grows, but if you have a peer running a huge full system report and your data grows then you need to account for that in peer memory, but you would have to make that consideration in any system. Granted I am a little biased as I work at Cognitect, but being able to scale your peer memory is where you get bang for you buck specific to the peers. Being forced to scale your entire system for one use case is lame. So I am not sure if that helps you when you get this question, but I see this as a strength of Datomic's model.
Interesting. the memory index. right, i occasionally forget that datomic isnt just storing a naive list of datoms (or several of them)
@tjtolton @jaret is totally right - your peers (application nodes) have to scale if your queries scale (i.e. if you say give me 20% of the db and your DB is growing), but that’s arguably true of any database
one cool thing about the peer model, though, is that you can horizontally scale instead so add a peer - you’ve just increased both your system cache size and your compute (query) power
and if you want to further optimize, you can shard traffic to specific peers. So users 1-5 always hit peer A and users 6-10 always hit peer B This means that peer A’s cache is ‘hotter’ for users 1-5
That indicates that I don't fully understand the way datomic memory works. I thought that the entire database was locally cached on each peer
also, @marshall, isnt what you just said only applicable to peer severs, and not applications that are themselves peers?
if you happen to be running a small database that fits completely in your peer’s memory, then you can effectively have it all cached locally, but that isnt required
when a peer is answering a query it first looks in local cache, then in memcached, then goes to backend storage
the peer needs to have enough JVM heap space to hold the final query result in memory (as would any application realizing a full query result from any DB), but that’s the only ‘requirement'
interesting. and the transactor streams changes to only the parts of the data that are cached on each peer?
periodically, the transactor incorporates that novelty into the persistent disk index
if a datom in that segment is updated via a transaction, the resulting “new value” is either in mem index or a new segment
and like clojure data structures datomic uses structural sharing for all of this stuff, so a lot of the “new” index tree may still be the same segments you’ve already cached
huh, interesting. So, to review: * peer starts off "cold", doesn't have much of the database locally cached * queries ask the peer for info that it doesn't have, it pulls info from the storage service (or memecached) and caches the new info * subsequent queries for the same data are served from that peers warm cache. * the transactor knows about new information, and pushes it to all peers * subsequent queries will fold in that knowledge when serving from their cache
so a lot of use cases might ask for some data, then based on that ask about some related data and that second query may only need to use the same segment