Any advice on choosing an appropriate datalog-ish store? I'm indexing external data, so I had no control over defining the initial data model. It can get quite nested at times, and there are different kinds of entities in the system which can be related and have some hierarchical structure
Entities can be IDed based on unique identifiers or an additional index and the unique identifier of their parent
Leaning towards Asami at the moment because I don't need to fight with the data to flatten it, requiring familiarity with all possible options, and I won't have to fight with the schema when things become polymorphic
Would also be interested in tooling for schema inference
@ben.sless clj or cljs ?
the main problem with asami is that work on it will not be continuing anytime soon, but that would have to be asked @quoll directly
but as it is in clojure, it will probably work until the end of the world without any problems
if clj, I can't recommend datalevin enough, which is being developed all the time at an amazing pace, and @huahaiy is doing an amazing job ❤️
clj. I mostly need to deal with nested data I don't own, the best tool for the job will be derived from there
basically all datalog db's normalize the data, so you probably need to choose a different selection criterion, because that doesn't help much 🙃
and nested maps always expand to references?
I think xtdb doesn't expand maps
I'm still working on Asami, it's just not my job. That's OK… it hasn't been my job for most of its life anyway. That said, I haven't been looking at a computer for the past 2 weeks. I plan to this week though (I still have 1 more week of vacation)
@ben.sless datascript automatically unfolds nested maps, and so all the things that are based on it, like datalevin and datahike
xtbd is a document data store and allows attr/val to be a map
This reminds me… I need to expose this feature in Asami. The (intentional) lack of schemas can be awkward sometimes!
@quoll great!
One possible criteria is whether you need the db to store arrays. I've found Datomic-likes constraining when modelling certain kind of data, but not Asami.
Yes, this is the same thing I was threading about earlier.
Should I focus on exposing this soon?
Storage for it works fine, but arrays and maps are usually interpreted into triples, which is why this feature is not exposed
It would certainly be useful, but I wouldn't want to take time away from a faster entity/`pull` implementation (which I know you're working towards) 🙂
Well I’m working on a dozen things, and for the past few weeks: nothing 😳
I was caught up trying to write the re:clojure talk, and I was procrastinating over that so I wrote cljs-math, and then I went on a Christmas break… my Asami coding is falling behind. That's why I was hoping to spend this week coding. Although, I have 2 other (hopefully short) projects I want to work on this week as well
And I happen to need arrays
Just “arrays”, or “arrays as single values”?
Arrays as single value, the sequence of data in them has no meaning in isolation
OK. Then I’d better get to it 🙂 Meanwhile, other stores have this already
I thought it was working fine?
There's some subtleties, whether the array will get stored as a single value or converted into Asami's linked list.
There was a discussion on this https://clojurians.slack.com/archives/C018H97E02D/p1636986148121300.
Thank you :)
datalevin has no problem storing an array as a single value , in fact, if you have single value with huge size regardless the type, datalevin is probably your best option among the alternatives, because LMDB Is faster than file system when dealing with large blobs, as it doesn't incur cost of context switch of system calls, that's why machine learning people use it for storing images for computer vision training.
Less important, but other differences that come to mind:
• Asami and Datahike keep histories, Datascript and Datalevin do not.
• pull for extracting entities/subtrees – Asami has entity but it is not as powerful, and – as of DataScript 1.3.0 – not as fast as pull. (But @quoll is working on it!)
• Asami doesn't support namespaced ids – there are only :db/id and :id (you can certainly include an attribute like :person/id, but it won't be treated as an identifier).
• Datalevin includes (or should soon) full-text search.
• If I'm not mistaken, Asami's query is fastest. And supports some graph features.
• Some considerations if you need durable storage.
You can also look at the https://clojurelog.github.io/. (Not sure how up-to-date it is. See comment below.)
I’ve not been a huge fan of that table. It focuses on the features of XTDB (hence, why it's green on most features) and doesn't consider features that are only supported by other DBs. Then again, I would say that, given that Asami doesn't intersect greatly with XTDB features
For the record, Datalevin now has a cost based query optimizer
@timok, any thoughts on this? > How are results for in-memory and on-disk Datahike so close?
Afaik the queries are memoized and repeatedly querying the same thing is essentially from memory. It seems to me that is what the benchmark does.
But asami supposedly has best performance characteristics catjam
if nothing has changed, no
it is worth noting when testing that asami returns lazy-seq
https://github.com/lambdaisland/datalog-benchmarks/blob/main/src/datalog_benchmarks/scratch.clj
practically no one who has tested the datalog dbs has noticed this
How
I don't know how, but doing a doall on the asami results makes it no longer the fastest
I think it's time for a PR
asami is faster than datascript just as map is faster than mapv 🙃
clojure-spin
you can a fork this bench https://github.com/joinr/datalevinbench and add asami
I did check it in the past, but while cleaning up I deleted the repo
😞
| | q1 | q2 | q3 | q4 | qpred1 | qpred2 |
|----------------------+------+------+------+-------+--------+--------|
| latest-datascript | 1.30 | 3.60 | 5.10 | 7.80 | 5.50 | 11.70 |
| latest-datalevin | 0.57 | 2.40 | 2.90 | 4.70 | 5.30 | 6.60 |
| latest-asami | 2.20 | 9.00 | 9.50 | 12.80 | 34.20 | 46.40 |
| latest-datahike-mem | 0.74 | 3.00 | 4.20 | 7.60 | 18.40 | 18.40 |
| latest-datahike-file | 0.80 | 3.10 | 4.30 | 7.20 | 18.20 | 18.20 |if I remember correctly, previously datahike was the slowest, but we can see guys have made a lot of progress
which is quite interesting anyway, and surprising that datalevin is so much faster, and both solutions have datascript underneath
How are results for in-memory and on-disk Datahike so close?
I have no idea, note that datalevin uses lmdb and yet is the fastest
That might be due to different caching implementations, etc. But, supposedly in Datahike examples everything else is equal but type of storage?
here is all the code, I just added asami
in general, I don't know how datahike works, it probably keeps some data in memory and the benchmark generates so little data that everything fits in memory
Not familiar with Datahike too. BTW, this is in-memory or on-disk Asami?
in memory
Quite surprised to see Asami last. I remember conversations discussing its superior algorithm, for example, in @huahaiy's plans for https://github.com/juji-io/datalevin/issues/11.
I myself used to tell everyone that asami is the fastest
repeatedly in many benchmarks it came out, which I posted at the beginning
but not knowing why, no one, including me, has noticed that asami returns lazy-seq
the query engine itself may be fastest, but its implementation may not be
writing in clojure, you can very quickly get bogged down using functions that are horribly slow and kill all performance
probably if @ben.sless sits down and does some PR, asami will be 10x faster than all the rest 🙃
If Paula would like me to take a poke at it
The implementation of asami is mostly idiomatic clojure, so there is large room for improvement. In general, all the existing datalog offerings in the clojure world has large room for performance improvement. I plan to finish datalevin’s query engine rewrite this year, hopefully to address some of the performance issues so it performs similarly to a row store (I.e. any of the sql dbs), and still retains the flexibility of an eav store. Stay tuned.
Datalevin does implement some simple optimizations within the current framework of Datascript, which does make a difference.
Datahike is working on query performance optimization as well. We hope to get to that soon. Help always appreciated. Chime in on the discussion on GitHub if you are interested on Datahike's features: https://github.com/replikativ/datahike/discussions/categories/ideas
@timok I am impressed with the progress datahike has made, congratulations to you guys
less than a year ago datahike was on average 3x slower than datascript, now it is marginally faster, but faster
Thanks @huxley. This kind of comment makes it worth it. More and more people and companies are relying on Datahike and that encourages going the extra mile. And another great thing of extended interest are the minor and major contributions coming in.
😊