Fork me on GitHub
Ben Sless11:01:44

Any advice on choosing an appropriate datalog-ish store? I'm indexing external data, so I had no control over defining the initial data model. It can get quite nested at times, and there are different kinds of entities in the system which can be related and have some hierarchical structure

Ben Sless11:01:16

Entities can be IDed based on unique identifiers or an additional index and the unique identifier of their parent

Ben Sless11:01:52

Leaning towards Asami at the moment because I don't need to fight with the data to flatten it, requiring familiarity with all possible options, and I won't have to fight with the schema when things become polymorphic

Ben Sless11:01:03

Would also be interested in tooling for schema inference


the main problem with asami is that work on it will not be continuing anytime soon, but that would have to be asked @quoll directly


but as it is in clojure, it will probably work until the end of the world without any problems


if clj, I can't recommend datalevin enough, which is being developed all the time at an amazing pace, and @huahaiy is doing an amazing job ❤️

Ben Sless12:01:15

clj. I mostly need to deal with nested data I don't own, the best tool for the job will be derived from there


basically all datalog db's normalize the data, so you probably need to choose a different selection criterion, because that doesn't help much 🙃

Ben Sless13:01:10

and nested maps always expand to references?

Ben Sless13:01:43

I think xtdb doesn't expand maps


I'm still working on Asami, it's just not my job. That's OK… it hasn't been my job for most of its life anyway. That said, I haven't been looking at a computer for the past 2 weeks. I plan to this week though (I still have 1 more week of vacation)


@ben.sless datascript automatically unfolds nested maps, and so all the things that are based on it, like datalevin and datahike


xtbd is a document data store and allows attr/val to be a map


This reminds me… I need to expose this feature in Asami. The (intentional) lack of schemas can be awkward sometimes!


One possible criteria is whether you need the db to store arrays. I've found Datomic-likes constraining when modelling certain kind of data, but not Asami.


Yes, this is the same thing I was threading about earlier.


Should I focus on exposing this soon?


Storage for it works fine, but arrays and maps are usually interpreted into triples, which is why this feature is not exposed


It would certainly be useful, but I wouldn't want to take time away from a faster entity/`pull` implementation (which I know you're working towards) 🙂


Well I’m working on a dozen things, and for the past few weeks: nothing 😳


I was caught up trying to write the re:clojure talk, and I was procrastinating over that so I wrote cljs-math, and then I went on a Christmas break… my Asami coding is falling behind. That's why I was hoping to spend this week coding. Although, I have 2 other (hopefully short) projects I want to work on this week as well

Ben Sless13:01:54

And I happen to need arrays


Just “arrays”, or “arrays as single values”?

👍 1
Ben Sless13:01:27

Arrays as single value, the sequence of data in them has no meaning in isolation


OK. Then I’d better get to it 🙂 Meanwhile, other stores have this already

Ben Sless14:01:17

I thought it was working fine?


There's some subtleties, whether the array will get stored as a single value or converted into Asami's linked list.

Ben Sless14:01:59

Thank you :)

🙂 1

datalevin has no problem storing an array as a single value , in fact, if you have single value with huge size regardless the type, datalevin is probably your best option among the alternatives, because LMDB Is faster than file system when dealing with large blobs, as it doesn't incur cost of context switch of system calls, that's why machine learning people use it for storing images for computer vision training.


Less important, but other differences that come to mind: • Asami and Datahike keep histories, Datascript and Datalevin do not. • pull for extracting entities/subtrees – Asami has entity but it is not as powerful, and – as of DataScript 1.3.0 – not as fast as pull. (But @quoll is working on it!) • Asami doesn't support namespaced ids – there are only :db/id and :id (you can certainly include an attribute like :person/id, but it won't be treated as an identifier). • Datalevin includes (or should soon) full-text search. • If I'm not mistaken, Asami's query is fastest. And supports some graph features. • Some considerations if you need durable storage. You can also look at the (Not sure how up-to-date it is. See comment below.)


I’ve not been a huge fan of that table. It focuses on the features of XTDB (hence, why it's green on most features) and doesn't consider features that are only supported by other DBs. Then again, I would say that, given that Asami doesn't intersect greatly with XTDB features

👍 2
Ben Sless16:01:12

But asami supposedly has best performance characteristics catjam


if nothing has changed, no


it is worth noting when testing that asami returns lazy-seq


practically no one who has tested the datalog dbs has noticed this

👍 1

I don't know how, but doing a doall on the asami results makes it no longer the fastest

Ben Sless17:01:17

I think it's time for a PR


asami is faster than datascript just as map is faster than mapv 🙃


you can a fork this bench and add asami


I did check it in the past, but while cleaning up I deleted the repo



|                      |   q1 |   q2 |   q3 |    q4 | qpred1 | qpred2 |
| latest-datascript    | 1.30 | 3.60 | 5.10 |  7.80 |   5.50 |  11.70 |
| latest-datalevin     | 0.57 | 2.40 | 2.90 |  4.70 |   5.30 |   6.60 |
| latest-asami         | 2.20 | 9.00 | 9.50 | 12.80 |  34.20 |  46.40 |
| latest-datahike-mem  | 0.74 | 3.00 | 4.20 |  7.60 |  18.40 |  18.40 |
| latest-datahike-file | 0.80 | 3.10 | 4.30 |  7.20 |  18.20 |  18.20 |

gratitude 1

if I remember correctly, previously datahike was the slowest, but we can see guys have made a lot of progress


which is quite interesting anyway, and surprising that datalevin is so much faster, and both solutions have datascript underneath


How are results for in-memory and on-disk Datahike so close?


I have no idea, note that datalevin uses lmdb and yet is the fastest


That might be due to different caching implementations, etc. But, supposedly in Datahike examples everything else is equal but type of storage?


here is all the code, I just added asami


in general, I don't know how datahike works, it probably keeps some data in memory and the benchmark generates so little data that everything fits in memory


Not familiar with Datahike too. BTW, this is in-memory or on-disk Asami?


Quite surprised to see Asami last. I remember conversations discussing its superior algorithm, for example, in @huahaiy's plans for


I myself used to tell everyone that asami is the fastest


repeatedly in many benchmarks it came out, which I posted at the beginning


but not knowing why, no one, including me, has noticed that asami returns lazy-seq


the query engine itself may be fastest, but its implementation may not be


writing in clojure, you can very quickly get bogged down using functions that are horribly slow and kill all performance


probably if @ben.sless sits down and does some PR, asami will be 10x faster than all the rest 🙃

Ben Sless20:01:42

If Paula would like me to take a poke at it


The implementation of asami is mostly idiomatic clojure, so there is large room for improvement. In general, all the existing datalog offerings in the clojure world has large room for performance improvement. I plan to finish datalevin’s query engine rewrite this year, hopefully to address some of the performance issues so it performs similarly to a row store (I.e. any of the sql dbs), and still retains the flexibility of an eav store. Stay tuned.

gratitude 1

Datalevin does implement some simple optimizations within the current framework of Datascript, which does make a difference.


Datahike is working on query performance optimization as well. We hope to get to that soon. Help always appreciated. Chime in on the discussion on GitHub if you are interested on Datahike's features:


@U4GEXTNGZ I am impressed with the progress datahike has made, congratulations to you guys

❤️ 1

less than a year ago datahike was on average 3x slower than datascript, now it is marginally faster, but faster


Thanks @U0BBFDED7. This kind of comment makes it worth it. More and more people and companies are relying on Datahike and that encourages going the extra mile. And another great thing of extended interest are the minor and major contributions coming in.


@U4GEXTNGZ, any thoughts on this? > How are results for in-memory and on-disk Datahike so close?


Afaik the queries are memoized and repeatedly querying the same thing is essentially from memory. It seems to me that is what the benchmark does.

gratitude 2

For the record, Datalevin now has a cost based query optimizer

catjam 1