datomic 2017-01-12 | Slack Archive

sova-soars-the-sora07:01:40

Would it be appropriate to call Datomic a graph database?

rauh08:01:17

@sova Yes absolutely, you can walk along the edges to other entities from an entity object.

sova-soars-the-sora08:01:32

@rauh cool. thanks.

sova-soars-the-sora08:01:44

I never thought to call it that, but you're totally right

rauh08:01:02

@sova FWIW: That's how I use datascript on the client. Get an entry point somewhere and then let my components walk along the graph (with entites) and let them decide what they need. Almost no queries necessary this way.

sova-soars-the-sora08:01:39

Could you tell me more about that?

rauh08:01:21

Well I just get an entity out at some point, let's say (d/entity-by-av :post/id post-id), then pass this to a react component. I can then get out anything it wans :post/title etc, or pass it to children (mapv display-comments (:post/comments post) which again could pass it on to child components (display-user-small (:comment/user comment)) etc etc.

sova-soars-the-sora08:01:46

oh very cool! @rauh are you using Reagent?

rauh08:01:59

No, rum.

sova-soars-the-sora08:01:10

Okay. Maybe it's time I take another look at rum.

sova-soars-the-sora08:01:52

I just wrote some pseudo-code for what I would want for ideal component+query handling on the ui side.. and that looks pretty close to what i've got going on

sova-soars-the-sora08:01:11

dayum github just went down x.x

sova-soars-the-sora08:01:31

at least in my neck of the woods

rauh08:01:48

Works fine here. The graph walking should also work for reagent the same way

pesterhazy10:01:02

What do people do to return paginated, sorted large result sets?

pesterhazy10:01:49

For example, suppose the SQL statement

select order_number, product_name, price from orders sort by order_date desc limit 50

pesterhazy10:01:20

should be translated to Datomic. Also suppose there are 200,000 orders in the database.

pesterhazy10:01:29

The first approach would be

[:find ?order ?date :where [?order :order/date ?date]]

, followed by

(->> results (sort-by second) (map first) (take 50) (d/pull-many '[my-pull-spec])

pesterhazy10:01:12

However, with 200,000 results, this is already relatively slow at >1000ms, with the equivalent SQL query taking <10ms.

pesterhazy10:01:04

What's more, query time will grow quickly along with the size of the result set.

pesterhazy10:01:31

Has anyone developed any patterns for this sort of use case?

robert-stuttaford11:01:52

i wouldn’t use Datalog, if you can generate the initial set on a single attribute. i’d use d/datoms, which at least would only cause me to seek to the end of the intended page, rather than realise the whole set

robert-stuttaford11:01:45

this assumes you can lean on the prevalent sort of the index for your order. if you need an alternate sort, you’d have to get the full set. there’s an open feature request to allow performant index traversal in reverse order, which will help with this sort of thing

pesterhazy11:01:25

prevalant sort for java.util.Date would be ascending I assume?

pesterhazy11:01:54

so I'd need to come up with some sort of "reverse date" (e.g. negative Unix timestamp)

pesterhazy11:01:09

two other problems

pesterhazy11:01:42

- I'd need a separate attribute for each entity type (`:order/sort-key`, :product/sort-key)

pesterhazy11:01:01

- there's no easy way to further restrict the search (e.g. give me the last 50 orders that have ":order/status :shipped")

pesterhazy11:01:02

a separate "pagination" key for each entity isn't too bad I guess

pesterhazy11:01:12

any way to do filtering though?

rauh11:01:55

@pesterhazy Please also vote for reverse seek on https://my.datomic.com/account -> "Suggest features" if you want this

robert-stuttaford12:01:08

all prevalent sorting is ascending, yes, @pesterhazy

robert-stuttaford12:01:54

you’d have to put filtering into your d/datoms processing pipeline ahead of drop + take

robert-stuttaford12:01:04

still going to be faster than realising the whole set with Datalog

pesterhazy12:01:10

can you elaborate on how to do filtering in the datoms pipeline, @robert-stuttaford ?

robert-stuttaford12:01:11

(->> (d/datoms) (filter (fn [[e _ v]] <go to pull or entity or datalog with e and/or v>) (drop) (take))

rauh12:01:45

@pesterhazy One pragmatic approach is to keep a client side datastructure that stores the date of the (last - 100), (last - 200) etc date. This would allow you to seek quicker at the end. Looks like order_date is append only and immutable? And then just iterated to the end and take the last n datoms.

rauh12:01:32

Then refresh that data structure when you iterated above (or below if they can be removed) a threshold

robert-stuttaford12:01:28

yeah, this is taking the linked-list approach. you may be able to use d/seek-datoms to iterate through the raw index from some mid point

robert-stuttaford12:01:32

http://docs.datomic.com/clojure/#datomic.api/seek-datoms

robert-stuttaford12:01:46

> Note that, unlike the datoms function, there need not be an exact match on the supplied components. The iteration will begin at or after the point in the index where the components would reside. …

pesterhazy12:01:51

@rauh, I voted for the "reverse index" feature

pesterhazy12:01:38

thanks for the pointer

pesterhazy12:01:44

@robert-stuttaford, ah I see, using entity or pull makes sense

pesterhazy12:01:10

I could even grab batches of, say, 1024 and run d/q on each 🙂

robert-stuttaford12:01:20

you could 🙂

robert-stuttaford12:01:08

this is very much a do-it-yourself part of Datomic, though (which is great, because you’re in control) but i agree it would be good to establish some patterns. it’s very similar to yesterday’s discussion about arbitrary sort; the linked-list vs array CS question

pesterhazy12:01:11

@rauh, so basically the idea would be to have the api send back a next token, rather than a page number?

rauh12:01:15

I'd call it a sketch of the index. A sorted-map like {100 "some-date" 200 "some-date" 300 "some-date" ...} which "approximatedly" seeks into the datoms a (d/datoms :avet :order/date (get sketch 100)) and then seek to the end. The result, unless you removed orders, should be >= 100 datoms. Then just take the last 50 for your pagination. Then update the 100 key of the map to (nth datom-result (- (count datoms) 100))

rauh12:01:33

Obviously, lots of details missing here. Rounding etc.

robert-stuttaford12:01:13

kinda like google maps does when you zoom in. starts with a rough zoomed out blurred view, and fills boxes in with detail as you focus in

rauh12:01:15

First time you do the seek, you won't have any info, so you have to iterate all datoms. Then keep that "approximate sketch" in a map.

robert-stuttaford12:01:23

i said that without thinking about it too much, i may be way off 😊

rauh12:01:00

Obviously don't store all 200,000 / 100 but only {last-10,000 ... last} since people seldomly paginate more that that

rauh12:01:13

That way your memory usage is bounded above.

rauh12:01:45

Though, come think of it, a map with 2k entries is probably tiny.

rauh12:01:31

The whole thing becomes much more complicated (== brakes down) if the dates are edited/removed a lot and queried very infrequently.

rauh12:01:11

If you end up implementing it, make sure to share some code 🙂

rauh12:01:09

And if you add a lot of new orders between the querying, then the datoms call will also be more expensive and might be very large. Though, you could listen on the datomic transactions and even keep it really updated all the time... Lots of room for optimizations.

robert-stuttaford12:01:16

so, if i understand correctly, you’re using the initial seek to set up bookmarks in the dataset to seek from via e.g. seek-datoms, for the data that users paginate infrequently

pesterhazy12:01:06

Yeah, order/date is immutable

pesterhazy12:01:44

Can't say I understand the approach completely but it sounds intriguing

pesterhazy12:01:42

I think there's room for a datomic patterns library that for provide auxiliary indexes and other helpers as functions

robert-stuttaford12:01:20

i agree, but we’d have to preface it with a big disclaimer: warning—here be opinions 🙂

pesterhazy12:01:08

True

pesterhazy12:01:32

It seems there's room for general solutions here

robert-stuttaford12:01:46

totally agree

pesterhazy12:01:17

I like the idea of bookmarks for pagination

rauh13:01:08

Yeah I think a general lib could be useful to create such bookmarks/sketches. If the data is immutable you can even cover the entire index and just append to it at the end as data grows.

rauh13:01:36

Though, if things get deleted only START and END bookmarks become usable.

rauh13:01:51

@pesterhazy @robert-stuttaford : Brainstorming: https://gist.github.com/rauhs/aa58d748abf851543d57ef3403f23edb

pesterhazy13:01:32

great!

pesterhazy13:01:48

"We keep a datastructure on the client code" -- what do you mean by "client" here?

rauh13:01:43

Just a (def histogram (sorted-map {}))

rauh13:01:03

Or maybe clojure/data.int-map

rauh13:01:31

Or maybe java.util.HashMap/etc

rauh14:01:06

In an atom.

jdkealy14:01:20

Turns out my datomic backup db script was failing because of wrong java version. Oh well...

jdkealy14:01:10

how are you supposed to backup to s3 from localhost? I thought the S3 backup used AMI

pesterhazy14:01:45

you can access S3 from local your machine, no problem

pesterhazy14:01:58

@rauh, I see. Can't we store the index itself in datomic?

rauh14:01:56

@pesterhazy That's a great idea! That would get rid of a lot of the problems from the last section

rauh14:01:02

Now I wonder how :nohistory behaves when having multiple db's from different timepoints of the connection

Lambda/Sierra14:01:44

noHistory means no history, meaning there are no guarantees that you will ever see anything but the most recent value of the attribute, even if you're looking in the "past"

plexus14:01:52

hi everyone, I'm trying to get datomic running, and I'm completely stuck

plexus14:01:19

I registered and downloaded the "Datomic Pro Starter Edition", and unzipped it

plexus14:01:38

now according to the docs I should do

plexus14:01:50

bin/maven-install
bin/run -m datomic.peer-server -p 8998 -a myaccesskey,mysecret -d firstdb,datomic:

plexus14:01:07

which results in

plexus14:01:20

% bin/run -m datomic.peer-server -p 8998 -a myaccesskey,mysecret -d firstdb,datomic:
Exception in thread "main" java.io.FileNotFoundException: Could not locate datomic/peer_server__init.class or datomic/peer_server.clj on classpath. Please check that namespaces with dashes use underscores in the Clojure file name.
        at clojure.lang.RT.load(RT.java:456)
        at clojure.lang.RT.load(RT.java:419)
        at clojure.core$load$fn__5677.invoke(core.clj:5893)
        at clojure.core$load.invokeStatic(core.clj:5892)
        at clojure.core$load.doInvoke(core.clj:5876)
        at clojure.lang.RestFn.invoke(RestFn.java:408)
        at clojure.core$load_one.invokeStatic(core.clj:5697)
        at clojure.core$load_one.invoke(core.clj:5692)
        at clojure.core$load_lib$fn__5626.invoke(core.clj:5737)
        at clojure.core$load_lib.invokeStatic(core.clj:5736)
        at clojure.core$load_lib.doInvoke(core.clj:5717)
        at clojure.lang.RestFn.applyTo(RestFn.java:142)
        at clojure.core$apply.invokeStatic(core.clj:648)
        at clojure.core$load_libs.invokeStatic(core.clj:5774)
        at clojure.core$load_libs.doInvoke(core.clj:5758)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at clojure.core$apply.invokeStatic(core.clj:648)
        at clojure.core$require.invokeStatic(core.clj:5796)
        at clojure.main$main_opt.invokeStatic(main.clj:314)
        at clojure.main$main_opt.invoke(main.clj:310)
        at clojure.main$main.invokeStatic(main.clj:421)
        at clojure.main$main.doInvoke(main.clj:384)
        at clojure.lang.RestFn.invoke(RestFn.java:805)
        at clojure.lang.Var.invoke(Var.java:455)
        at clojure.lang.AFn.applyToHelper(AFn.java:216)
        at clojure.lang.Var.applyTo(Var.java:700)
        at clojure.main.main(main.java:37)

plexus14:01:36

ok never mind... now it seems to be running

plexus15:01:08

I just spent a couple hours on this... but ok 😆

rauh15:01:28

@stuartsierra Just to clarify: So a

(let [db (d/db conn)] 
  (change-no-hist-attr!)
  (:acces-no-hist-attr (d/entity db some-ent)))

might see the new value?

Lambda/Sierra15:01:57

@rauh I'm not certain, but I would not be surprised if you saw the new value in that case.

rauh15:01:18

@stuartsierra Could you find out? That would change some things for me.

Lambda/Sierra15:01:53

@rauh I do not have a way to prove it without running a lot of tests. But I do know that noHistory is defined to mean "I only ever care about the most recent value of this attribute."

marshall16:01:44

@rauh you’re calling d/entity on a value of the DB from before you transacted a change

marshall16:01:57

that database value is immutable

rauh16:01:09

@marshall Well the use-case was from the above sketches (not sure if you read the conversation?). So it might be not like the code above

rauh16:01:38

But more nested... The db value might be few hundred ms old.

marshall16:01:24

The DB value is immutable

tjtolton19:01:42

so, here's a question I get when I explain datomic's peer/transactor model: "So you mean that as our data grows, we have to scale the memory on every single one of our running application nodes, instead of just our data nodes?"

tjtolton19:01:54

Part of the problem is that I haven't actually done the work of ETLing our domain data into datomic to be able to intelligently respond to the "that's too much data in memory" objections. But in general, is that statement accurate?

tjtolton19:01:14

That application nodes have to scale in memory as the database grows

tjtolton19:01:45

i guess logically it is

jaret19:01:18

@tjtolton Your peer memory needs to account for its copy of the memory index, its own object cache, and whatever it needs to run its application. This does not necessarily scale as your data grows, but if you have a peer running a huge full system report and your data grows then you need to account for that in peer memory, but you would have to make that consideration in any system. Granted I am a little biased as I work at Cognitect, but being able to scale your peer memory is where you get bang for you buck specific to the peers. Being forced to scale your entire system for one use case is lame. So I am not sure if that helps you when you get this question, but I see this as a strength of Datomic's model.

jaret19:01:27

http://docs.datomic.com/capacity.html#peer-memory

tjtolton19:01:43

Interesting. the memory index. right, i occasionally forget that datomic isnt just storing a naive list of datoms (or several of them)

tjtolton19:01:03

I'll take a look at that info, thanks!

marshall19:01:41

@tjtolton @jaret is totally right - your peers (application nodes) have to scale if your queries scale (i.e. if you say give me 20% of the db and your DB is growing), but that’s arguably true of any database

marshall19:01:40

one cool thing about the peer model, though, is that you can horizontally scale instead so add a peer - you’ve just increased both your system cache size and your compute (query) power

marshall19:01:35

and if you want to further optimize, you can shard traffic to specific peers. So users 1-5 always hit peer A and users 6-10 always hit peer B This means that peer A’s cache is ‘hotter’ for users 1-5

marshall19:01:55

or use Peer A for the web app and Peer B for an analytics process

marshall19:01:04

each cache automatically tunes itself for that workload

tjtolton19:01:16

That indicates that I don't fully understand the way datomic memory works. I thought that the entire database was locally cached on each peer

marshall19:01:06

ah. no, datomic is a “larger than memory” database

tjtolton19:01:11

also, @marshall, isnt what you just said only applicable to peer severs, and not applications that are themselves peers?

marshall19:01:52

if you happen to be running a small database that fits completely in your peer’s memory, then you can effectively have it all cached locally, but that isnt required

marshall19:01:09

Datomic caches segments of the DB that it uses to answer queries

marshall19:01:34

when a peer is answering a query it first looks in local cache, then in memcached, then goes to backend storage

marshall19:01:27

the peer needs to have enough JVM heap space to hold the final query result in memory (as would any application realizing a full query result from any DB), but that’s the only ‘requirement'

tjtolton19:01:28

interesting. and the transactor streams changes to only the parts of the data that are cached on each peer?

marshall19:01:52

almost. it’s a bit subtle. The transactor streams novelty to all connected peers

marshall19:01:59

specifically, that’s the memory index

marshall19:01:18

periodically, the transactor incorporates that novelty into the persistent disk index

marshall19:01:23

via an indexing job

marshall19:01:31

which happens as a separate process

marshall19:01:41

and when it finishes that, it notifies the peers that there is a new disk index

marshall19:01:57

so they know where to go if they need to retrieve segments

tjtolton19:01:10

thats pretty slick

marshall19:01:16

but the segments that are cached are never updated

marshall19:01:22

all segments are immutable

marshall19:01:37

so once a segment is cached on a peer it’s always that same value

tjtolton19:01:48

right, update wasn't the right word

marshall19:01:58

if a datom in that segment is updated via a transaction, the resulting “new value” is either in mem index or a new segment

tjtolton19:01:08

i suppose upsert

tjtolton19:01:08

is the new cannonical term

marshall19:01:39

and like clojure data structures datomic uses structural sharing for all of this stuff, so a lot of the “new” index tree may still be the same segments you’ve already cached

tjtolton20:01:57

huh, interesting. So, to review: * peer starts off "cold", doesn't have much of the database locally cached * queries ask the peer for info that it doesn't have, it pulls info from the storage service (or memecached) and caches the new info * subsequent queries for the same data are served from that peers warm cache. * the transactor knows about new information, and pushes it to all peers * subsequent queries will fold in that knowledge when serving from their cache

marshall20:01:12

yep

marshall20:01:23

and caching / retrieval from storage happens at the segment level

marshall20:01:29

so it doesnt just fetch the datom you ask for

marshall20:01:37

it’s the whole segment that contains that datom

marshall20:01:08

where segments contain 100s or 1000s of datoms

marshall20:01:15

(i.e. a chunk of the index tree)

marshall20:01:45

so a lot of use cases might ask for some data, then based on that ask about some related data and that second query may only need to use the same segment

marshall20:01:03

the idea is that you effectively amortize the n+1 problem away

marshall20:01:25

so you don’t have to do everything in a single query the way you would with a traditional client/server db

tjtolton20:01:20

gotcha. The n+1 problem is a word I've heard many times

2017-01-12

Channels