datalog 2020-08-03 | Slack Archive

Asami is also still work in progress, Paula is still planning to add a storage layer for instance.

Am I doing something wrong here or are Asami queries really 100x faster than with Datomic/DataScript?

https://github.com/lambdaisland/datalog-benchmarks/blob/main/src/datalog_benchmarks/scratch.clj#L128-L150

The first query is literally just a get-in operation followed by a projection. The second query does the planning (at a guess, it will likely pick the provided ordering of the query) and it will then turn into a set of get operations in a loop for however many people there are named “Ivan”. I would it expect it to be trivially fast as well

plexus14:08:48

I'll try to find some more interesting queries :)

plexus08:08:38

I know one has to be careful with microbenchmarks, I'd absolutely love for people to challenge these results. PRs to that repo are also welcome, would be nice if it could grow into a comprehensive comparison of the different implementations.

jumar08:08:40

I don't know much about either of these implementations but you could check where the time is spent e.g. via clj-async-profiler

quoll11:08:51

It would be unfair to compare Asami to Datomic, because that system is designed for its durable provisioning. There’s a lot going on. Datascript is more comparable. Asami indexing is simpler, and it doesn’t really keep things like deletion datoms. That may help Asami to be faster here? Also, a micro benchmark will be hitting the cache on Asami’s query planner. I don’t believe that the others have a query optimizer, while Asami does. There is a small cost to using it, but if you keep running the same query without updating the database then it will just return the same query plan without getting re-executed.

plexus11:08:16

that explains a lot, because this benchmark does a warmup first

quoll16:08:56

I just realized… the query planner doesn’t execute on the first query, because there’s nothing to plan. I’m honestly confused why Datascript would be so slow for that query

quoll22:08:19

I just looked at the code that would be executed for the first query. It’s several if /`cond` statements that figure out the structure of the query, followed by a get-in from the index. Then comes the expensive bit… For every result, it does a reduce over the 1 column defined in the :find clause, doing a map get to find the offset of the bound variable ?e (which is 0) and doing: (assoc '[?e] 0 result) to create each result row. Literally, that’s it. I can’t see why it would take multiple milliseconds?

Huahai20:08:33

what’s the index of asami looks like?

Huahai20:08:19

because I think the performance difference mostly comes from the the differences in index data structures

quoll20:08:19

3 maps

Huahai20:08:15

of?

quoll20:08:55

The indexes are: {entity {attribute #{value}}} {attribute {value #{entity}}} {value {entity #{attribute}}}

Huahai20:08:11

so it’s all nested maps?

quoll20:08:18

yes

Huahai20:08:39

where’s datascript has flat maps

Huahai20:08:06

so it has to join much more data

quoll20:08:51

The indexes are pluggable. I’m building a new one at the moment, which is quite different (very similar to the Mulgara indices)

Huahai20:08:15

what’s the new one look like?

quoll20:08:28

It’s a tree backed by a buffer. In ClojureScript I want that buffer to be stored in the browser’s persistent storage. In Clojure it’s backed by a MappedByteBuffer on a file

quoll20:08:54

same indexing as above though

Huahai20:08:19

i see

quoll20:08:03

As I said, it’s based on Mulgara. I have not maintained it for some time, but the last I heard it was still being used commercially: http://mulgara.org/

quoll20:08:35

I stopped working on Mulgara because I wanted to redo a lot of it in Clojure (it’s a Java project)

Huahai20:08:46

cool

Huahai20:08:53

wouldn’t it be easier to just use it as backend rather than re-implement in clojure?

Huahai20:08:37

performance wise, it would be better too, guess you want it to be usable in bowser?

quoll20:08:06

I want to revisit the architecture. We made decisions about it in 2001 that involved tradeoffs that we don’t need to make anymore.

quoll20:08:22

And yes… I want it to run on both platforms

quoll20:08:50

The Clojure code is dramatically simpler

quoll20:08:43

and the main overhead for the trees is I/O, so the speed improvement gained from Java isn’t a big deal

Huahai20:08:37

right

Huahai20:08:24

very cool, looking forward to it

quoll20:08:39

Still working on it 🙂

quoll20:08:59

I have the memory mapping working (it’s tricky to manage files over 2GB)

quoll21:08:10

Still building the trees

Huahai21:08:58

what’s your plan for asami? compete with datomic?

quoll21:08:26

quoll21:08:09

It’s a hybrid plan… • We use it at Cisco. I’ve been expanding on its capabilities as needed. • It has supplanted Mulgara as my personal project. I’m trying to get back to what Mulgara can do

quoll21:08:29

But, to a certain extend, the fact that Datomic was not open source has been frustrating. So I thought it would be nice to have an option. It will fill a different niche though, since it isn’t peer-based, and it’s based on an OWA like RDF/SPARQL

quoll21:08:05

There are new architectures that I want to try with triple indices for different use-cases. Mulgara’s complexity has been too hard to make this change, and I’m hoping that Asami gives me the flexibility to do what I’ve wanted to there

quoll21:08:07

It’s been slow going, because until yesterday no one else had ever submitted a PR 🙂

Huahai21:08:20

I think people are just now noticing asami

Huahai21:08:42

it’s good to have more options, so far i counted 7 options that is doing datomic-like queries in the clojure world

Huahai21:08:01

Asami the fastest in term of query speed, so people are starting to notice

Huahai21:08:18

Datomic, Datascript, Asami, Crux, Datahike, Eva and mine Datalevin

Huahai21:08:24

7 in total

quoll21:08:57

Well, I’m aiming for Datomic-like, rather than exactly copying Datomic 🙂

Huahai21:08:14

i think these 7 are all different

Huahai21:08:01

Asami is the most different, actually, for it’s not a

Huahai21:08:38

Datomic inspired, whereas all others are

quoll21:08:29

Well, it was partly inspired by Datomic. But also RDF

Huahai21:08:17

datomic is insperied by RDF too, but yous is earlier than that

Huahai21:08:33

the biggest difference is the indexing structure, and I think you got it right

Huahai21:08:32

you seems to have more experience in RDF like databases than Datomic people

Huahai21:08:42

are you open to alternative drualable storage than what you are working on?

quoll21:08:12

Well one thing I haven’t done is to keep the full history of assertions and retractions. There’s a lot of complexity if I want to fully index that

quoll21:08:18

definitely

quoll21:08:22

This is why it’s pluggable

quoll21:08:35

It just needs to follow a simple protocol

quoll21:08:38

https://github.com/threatgrid/asami/blob/main/src/asami/graph.cljc#L9

quoll21:08:05

And the graph-diff function in there is optional

Huahai21:08:48

graph here means what?

Huahai21:08:53

a set of triples?

quoll21:08:59

yes

quoll22:08:11

That protocol may get tweaked (around transactions), but that’s basically the level of abstraction I need

Huahai22:08:13

what does resolve-triple do?

quoll22:08:49

It takes a graph and 3 elements (that come from a constraint pattern). Each element can either be a scalar value (e.g. a keyword, string, long, etc), or else a symbol that starts with ? . These symbols are treated as variables. It then returns bindings for all of those variables, based on the triples that match the provided scalar values.

quoll22:08:22

The simple index (there’s another index as well) does this using asami.index/get-from-index https://github.com/threatgrid/asami/blob/main/src/asami/index.cljc#L33

quoll22:08:18

There’s a simplify function that converts values into :v and variables into ?. That’s used to determine the dispatch

quoll22:08:07

You can see there are 8 variations, each with its own code to return the appropriate result

quoll22:08:45

The lettering of SPO is based on RDF: Subject-Predicate-Object. This is equivalent to EAV

Huahai22:08:46

so instead of returning all the matched full datoms like in datascript, this returns the bound values only?

quoll22:08:56

yes

quoll22:08:21

The other columns are fixed by the request, so they’re redundant

Huahai22:08:57

right

Huahai22:08:17

that’s the biggest differences in term of performance though

quoll22:08:32

Well, I thought about returning all 3 columns, and at the time I thought that I would just need to project them away anyway, so why would I?

quoll22:08:40

It would be creating work for myself

Huahai22:08:21

so the indexing has to be more complex than a simple set of datoms, like in datomic or datascript

quoll22:08:18

Well, have a look at that file

quoll22:08:23

how complex does it look?

100 lines :0

🙂

I am thinking of storage on disk

quoll22:08:21

there’s an index-add function, an index-delete function and then just a bit of code in the graph to call them

quoll22:08:37

storage on disk is trickier. I won’t be using maps

Huahai22:08:03

so what kind of trees you are implementing?

quoll22:08:04

instead, trees. They’ll be sorting triples based on the 3 orderings

quoll22:08:07

AVL

quoll22:08:45

each node will refer to a block of triples. From memory, 512 of them

quoll22:08:26

B-trees are nice too, but I’ve had great success with AVL for both upload speed and read performance

quoll22:08:54

Because the nodes are small, the whole top of the tree (down many levels) gets cached

quoll22:08:11

So even though it’s a deeper tree, it’s not as slow to search as it seems at first glance. Also, each node handles a range, which means it’s not purely AVL

Huahai22:08:11

right, but how do you maintain the nested nature of the maps in that tree?

quoll22:08:05

by the sorting

Huahai22:08:20

i see, so it sounds to me the storage could be the exactly the same as datascript, only need add a step to filter to return the bunded values only

Huahai22:08:25

then your query engine can be bring to bear

quoll22:08:28

consider these triples:

a m 1
a m 2
a n 1
a o 3
b m 2
b m 3
c m 2

This is equivalent to:

The second is the nested-map view

Huahai22:08:39

right

quoll22:08:02

on disk you find the boundaries of what you’re looking for by tree searches O(log(n)). When you have maps, then it’s O(1), which is obviously nicer.

Huahai22:08:07

so you are only storing [abc] [momm] …, not [[am1]…]?

Huahai22:08:09

or you still storing the later, but use the bounds to avoid scan the whole range? instead, scan a small range, then move to the next?

quoll22:08:26

I’m sorry. I didn’t quite follow that

Huahai22:08:23

i am trying to figure what the on-disk storage looks like in your design

quoll22:08:05

quoll22:08:29

definitely storing the full triples. So there is redundancy

Huahai22:08:38

my question is do you store [a m 2] in your example?

quoll22:08:58

There are ways to reduce the amount being stored, but given the tradeoffs, I’m going with the simplest approach first

Huahai22:08:02

oh i see, so you still store the full triples, but also store the bounds

quoll22:08:10

yes

quoll22:08:22

No. Store just the triples

Huahai22:08:24

that’s what’s missing from datascript, it doesn’t store bounds

quoll22:08:41

If I’m looking for [a ?x ?y] then I need to search twice. I’ll find the start, and then find the point between [a o 3] and [b m 2]

Huahai22:08:30

i see, bounds are not stored, but found by seeking

quoll22:08:03

searching… to me, “seeking” implies a linear search forward

Huahai22:08:26

sure, you have a binary tree

Huahai22:08:36

so my understanding is the exact same storage that datascript uses can be used in your query engine

quoll22:08:51

yes

quoll22:08:37

You can even just store flat triples and do everything via filter. The performance would be terrible, but it would work fine

Huahai22:08:54

right

Huahai22:08:39

i think it is possible for these projects to be merged

Huahai22:08:53

what I did in Datalevin is to port datascript to LMDB, which becomes faster than Datascript, even though datalevin is on disk and datascript is in memory

Huahai22:08:32

the same thing can be done with Asami, the only thing missing is the filtering step we discussed above

Huahai22:08:01

once resolve-triple is implemented as your api requires, your query engine can be used, i think this combination will yield the best possible implementation

quoll22:08:32

It’s open source 🙂

quoll22:08:01

sorry, I’ve been distracted. It seems that ClojureScript does not support find on transient maps

quoll22:08:04

sigh

Huahai22:08:41

i don’t care about clojurescript, i only care about clojure, so it’s good to have choices

Huahai22:08:15

thank you for talking with me, I admire what you did

quoll22:08:26

Thank you! 🙂

quoll22:08:12

@U07FP7QJ0 created the channel #asami so feel free to ask things in there

Huahai22:08:55

Yes, I am in that channel 🙂

Huahai22:08:14

Back to work. Talk to your later!

quoll11:08:27

It’s also a very simple conjunction, so it will be naturally fast

quoll12:08:13

There’s not a lot for the optimizer to do 🙂

lilactown16:08:56

asami looks really cool

❤️ 3

lilactown16:08:53

how does transacting references to other entities work?

quoll16:08:50

May be better to ask in #asami? But it also depends on what you’re referring to specifically. In general, it does a lookup to find the entity and get it’s internal node value

2020-08-03

Channels