datalog

Ben Sless 2022-01-02T11:46:44.044700Z

Any advice on choosing an appropriate datalog-ish store? I'm indexing external data, so I had no control over defining the initial data model. It can get quite nested at times, and there are different kinds of entities in the system which can be related and have some hierarchical structure

Ben Sless 2022-01-02T11:47:16.045300Z

Entities can be IDed based on unique identifiers or an additional index and the unique identifier of their parent

Ben Sless 2022-01-02T11:48:52.046700Z

Leaning towards Asami at the moment because I don't need to fight with the data to flatten it, requiring familiarity with all possible options, and I won't have to fight with the schema when things become polymorphic

Ben Sless 2022-01-02T11:51:03.047300Z

Would also be interested in tooling for schema inference

2022-01-02T12:20:51.047700Z

@ben.sless clj or cljs ?

2022-01-02T12:28:01.048300Z

the main problem with asami is that work on it will not be continuing anytime soon, but that would have to be asked @quoll directly

2022-01-02T12:28:35.048500Z

but as it is in clojure, it will probably work until the end of the world without any problems

2022-01-02T12:30:33.049100Z

if clj, I can't recommend datalevin enough, which is being developed all the time at an amazing pace, and @huahaiy is doing an amazing job ❤️

Ben Sless 2022-01-02T12:34:15.049900Z

clj. I mostly need to deal with nested data I don't own, the best tool for the job will be derived from there

2022-01-02T12:49:56.050300Z

basically all datalog db's normalize the data, so you probably need to choose a different selection criterion, because that doesn't help much 🙃

Ben Sless 2022-01-02T13:02:10.050600Z

and nested maps always expand to references?

Ben Sless 2022-01-02T13:02:43.050900Z

I think xtdb doesn't expand maps

quoll 2022-01-02T13:08:58.054500Z

I'm still working on Asami, it's just not my job. That's OK… it hasn't been my job for most of its life anyway. That said, I haven't been looking at a computer for the past 2 weeks. I plan to this week though (I still have 1 more week of vacation)

2022-01-02T13:11:29.055200Z

@ben.sless datascript automatically unfolds nested maps, and so all the things that are based on it, like datalevin and datahike

2022-01-02T13:12:36.055600Z

xtbd is a document data store and allows attr/val to be a map

quoll 2022-01-02T13:14:35.057700Z

This reminds me… I need to expose this feature in Asami. The (intentional) lack of schemas can be awkward sometimes!

2022-01-02T13:12:52.055800Z

@quoll great!

zeitstein 2022-01-02T13:50:19.058800Z

One possible criteria is whether you need the db to store arrays. I've found Datomic-likes constraining when modelling certain kind of data, but not Asami.

quoll 2022-01-02T13:51:37.059500Z

Yes, this is the same thing I was threading about earlier.

quoll 2022-01-02T13:51:50.060100Z

Should I focus on exposing this soon?

quoll 2022-01-02T13:53:05.062300Z

Storage for it works fine, but arrays and maps are usually interpreted into triples, which is why this feature is not exposed

zeitstein 2022-01-02T14:12:19.064900Z

It would certainly be useful, but I wouldn't want to take time away from a faster entity/`pull` implementation (which I know you're working towards) 🙂

quoll 2022-01-02T16:58:30.078200Z

Well I’m working on a dozen things, and for the past few weeks: nothing 😳

quoll 2022-01-02T17:00:51.081900Z

I was caught up trying to write the re:clojure talk, and I was procrastinating over that so I wrote cljs-math, and then I went on a Christmas break… my Asami coding is falling behind. That's why I was hoping to spend this week coding. Although, I have 2 other (hopefully short) projects I want to work on this week as well

Ben Sless 2022-01-02T13:51:54.060400Z

And I happen to need arrays

quoll 2022-01-02T13:54:36.063300Z

Just “arrays”, or “arrays as single values”?

👍 1
Ben Sless 2022-01-02T13:55:27.063600Z

Arrays as single value, the sequence of data in them has no meaning in isolation

quoll 2022-01-02T13:57:14.064600Z

OK. Then I’d better get to it 🙂 Meanwhile, other stores have this already

Ben Sless 2022-01-02T14:27:17.070900Z

I thought it was working fine?

zeitstein 2022-01-02T14:39:19.075300Z

There's some subtleties, whether the array will get stored as a single value or converted into Asami's linked list.

zeitstein 2022-01-02T14:41:03.075600Z

There was a discussion on this https://clojurians.slack.com/archives/C018H97E02D/p1636986148121300.

Ben Sless 2022-01-02T14:47:59.076100Z

Thank you :)

🙂 1
Huahai 2022-01-03T15:15:05.116400Z

datalevin has no problem storing an array as a single value , in fact, if you have single value with huge size regardless the type, datalevin is probably your best option among the alternatives, because LMDB Is faster than file system when dealing with large blobs, as it doesn't incur cost of context switch of system calls, that's why machine learning people use it for storing images for computer vision training.

zeitstein 2022-01-02T14:26:43.070800Z

Less important, but other differences that come to mind: • Asami and Datahike keep histories, Datascript and Datalevin do not. • pull for extracting entities/subtrees – Asami has entity but it is not as powerful, and – as of DataScript 1.3.0 – not as fast as pull. (But @quoll is working on it!) • Asami doesn't support namespaced ids – there are only :db/id and :id (you can certainly include an attribute like :person/id, but it won't be treated as an identifier). • Datalevin includes (or should soon) full-text search. • If I'm not mistaken, Asami's query is fastest. And supports some graph features. • Some considerations if you need durable storage. You can also look at the https://clojurelog.github.io/. (Not sure how up-to-date it is. See comment below.)

quoll 2022-01-02T14:34:08.075Z

I’ve not been a huge fan of that table. It focuses on the features of XTDB (hence, why it's green on most features) and doesn't consider features that are only supported by other DBs. Then again, I would say that, given that Asami doesn't intersect greatly with XTDB features

👍 2
Huahai 2024-04-06T00:05:12.754339Z

For the record, Datalevin now has a cost based query optimizer

2
zeitstein 2022-01-04T12:57:17.124100Z

@timok, any thoughts on this? > How are results for in-memory and on-disk Datahike so close?

timo 2022-01-04T13:15:21.124300Z

Afaik the queries are memoized and repeatedly querying the same thing is essentially from memory. It seems to me that is what the benchmark does.

2
Ben Sless 2022-01-02T16:08:12.077100Z

But asami supposedly has best performance characteristics catjam

2022-01-02T17:01:31.082100Z

if nothing has changed, no

2022-01-02T17:01:57.082300Z

it is worth noting when testing that asami returns lazy-seq

2022-01-02T17:06:19.082800Z

practically no one who has tested the datalog dbs has noticed this

👍 1
Ben Sless 2022-01-02T17:07:22.083Z

How

2022-01-02T17:09:49.083200Z

I don't know how, but doing a doall on the asami results makes it no longer the fastest

Ben Sless 2022-01-02T17:10:17.083400Z

I think it's time for a PR

2022-01-02T17:10:39.083600Z

asami is faster than datascript just as map is faster than mapv 🙃

Ben Sless 2022-01-02T17:10:54.083800Z

clojure-spin

2022-01-02T17:11:41.084Z

you can a fork this bench https://github.com/joinr/datalevinbench and add asami

2022-01-02T17:12:24.084300Z

I did check it in the past, but while cleaning up I deleted the repo

2022-01-02T17:12:27.084500Z

😞

2022-01-02T17:33:21.084700Z

@ben.sless

|                      |   q1 |   q2 |   q3 |    q4 | qpred1 | qpred2 |
|----------------------+------+------+------+-------+--------+--------|
| latest-datascript    | 1.30 | 3.60 | 5.10 |  7.80 |   5.50 |  11.70 |
| latest-datalevin     | 0.57 | 2.40 | 2.90 |  4.70 |   5.30 |   6.60 |
| latest-asami         | 2.20 | 9.00 | 9.50 | 12.80 |  34.20 |  46.40 |
| latest-datahike-mem  | 0.74 | 3.00 | 4.20 |  7.60 |  18.40 |  18.40 |
| latest-datahike-file | 0.80 | 3.10 | 4.30 |  7.20 |  18.20 |  18.20 |

1
2022-01-02T17:35:40.085700Z

2022-01-02T17:37:25.086100Z

if I remember correctly, previously datahike was the slowest, but we can see guys have made a lot of progress

2022-01-02T17:38:04.086300Z

which is quite interesting anyway, and surprising that datalevin is so much faster, and both solutions have datascript underneath

zeitstein 2022-01-02T18:40:20.086500Z

How are results for in-memory and on-disk Datahike so close?

2022-01-02T18:58:50.086800Z

I have no idea, note that datalevin uses lmdb and yet is the fastest

zeitstein 2022-01-02T19:03:55.087Z

That might be due to different caching implementations, etc. But, supposedly in Datahike examples everything else is equal but type of storage?

2022-01-02T19:09:25.087200Z

https://github.com/joinr/datalevinbench

2022-01-02T19:09:47.087500Z

here is all the code, I just added asami

2022-01-02T19:12:32.087700Z

in general, I don't know how datahike works, it probably keeps some data in memory and the benchmark generates so little data that everything fits in memory

zeitstein 2022-01-02T19:40:10.087900Z

Not familiar with Datahike too. BTW, this is in-memory or on-disk Asami?

2022-01-02T19:40:19.088100Z

in memory

zeitstein 2022-01-02T19:42:39.088300Z

Quite surprised to see Asami last. I remember conversations discussing its superior algorithm, for example, in @huahaiy's plans for https://github.com/juji-io/datalevin/issues/11.

2022-01-02T19:50:06.088700Z

I myself used to tell everyone that asami is the fastest

2022-01-02T19:50:26.088900Z

repeatedly in many benchmarks it came out, which I posted at the beginning

2022-01-02T19:51:17.089100Z

but not knowing why, no one, including me, has noticed that asami returns lazy-seq

2022-01-02T19:52:15.089300Z

the query engine itself may be fastest, but its implementation may not be

2022-01-02T19:52:52.089500Z

writing in clojure, you can very quickly get bogged down using functions that are horribly slow and kill all performance

2022-01-02T19:53:42.089700Z

probably if @ben.sless sits down and does some PR, asami will be 10x faster than all the rest 🙃

Ben Sless 2022-01-02T20:13:42.089900Z

If Paula would like me to take a poke at it

Huahai 2022-01-03T03:10:48.099300Z

The implementation of asami is mostly idiomatic clojure, so there is large room for improvement. In general, all the existing datalog offerings in the clojure world has large room for performance improvement. I plan to finish datalevin’s query engine rewrite this year, hopefully to address some of the performance issues so it performs similarly to a row store (I.e. any of the sql dbs), and still retains the flexibility of an eav store. Stay tuned.

1
Huahai 2022-01-03T03:20:50.102600Z

Datalevin does implement some simple optimizations within the current framework of Datascript, which does make a difference.

timo 2022-01-03T18:22:18.122100Z

Datahike is working on query performance optimization as well. We hope to get to that soon. Help always appreciated. Chime in on the discussion on GitHub if you are interested on Datahike's features: https://github.com/replikativ/datahike/discussions/categories/ideas

2022-01-03T19:46:02.123Z

@timok I am impressed with the progress datahike has made, congratulations to you guys

❤️ 1
2022-01-03T19:47:28.123300Z

less than a year ago datahike was on average 3x slower than datascript, now it is marginally faster, but faster

timo 2022-01-03T19:55:54.123500Z

Thanks @huxley. This kind of comment makes it worth it. More and more people and companies are relying on Datahike and that encourages going the extra mile. And another great thing of extended interest are the minor and major contributions coming in.

2022-01-03T19:56:26.123700Z

😊