datalog 2022-01-02 | Slack Archive

Ben Sless11:01:44

Any advice on choosing an appropriate datalog-ish store? I'm indexing external data, so I had no control over defining the initial data model. It can get quite nested at times, and there are different kinds of entities in the system which can be related and have some hierarchical structure

Ben Sless11:01:16

Entities can be IDed based on unique identifiers or an additional index and the unique identifier of their parent

Ben Sless11:01:52

Leaning towards Asami at the moment because I don't need to fight with the data to flatten it, requiring familiarity with all possible options, and I won't have to fight with the schema when things become polymorphic

Ben Sless11:01:03

Would also be interested in tooling for schema inference

ribelo12:01:51

@ben.sless clj or cljs ?

ribelo12:01:01

the main problem with asami is that work on it will not be continuing anytime soon, but that would have to be asked @quoll directly

ribelo12:01:35

but as it is in clojure, it will probably work until the end of the world without any problems

ribelo12:01:33

if clj, I can't recommend datalevin enough, which is being developed all the time at an amazing pace, and @huahaiy is doing an amazing job ❤️

Ben Sless12:01:15

clj. I mostly need to deal with nested data I don't own, the best tool for the job will be derived from there

ribelo12:01:56

basically all datalog db's normalize the data, so you probably need to choose a different selection criterion, because that doesn't help much 🙃

Ben Sless13:01:10

and nested maps always expand to references?

Ben Sless13:01:43

I think xtdb doesn't expand maps

quoll13:01:58

I'm still working on Asami, it's just not my job. That's OK… it hasn't been my job for most of its life anyway. That said, I haven't been looking at a computer for the past 2 weeks. I plan to this week though (I still have 1 more week of vacation)

ribelo13:01:29

@ben.sless datascript automatically unfolds nested maps, and so all the things that are based on it, like datalevin and datahike

ribelo13:01:36

xtbd is a document data store and allows attr/val to be a map

quoll13:01:35

This reminds me… I need to expose this feature in Asami. The (intentional) lack of schemas can be awkward sometimes!

ribelo13:01:52

@quoll great!

zeitstein13:01:19

One possible criteria is whether you need the db to store arrays. I've found Datomic-likes constraining when modelling certain kind of data, but not Asami.

quoll13:01:37

Yes, this is the same thing I was threading about earlier.

quoll13:01:50

Should I focus on exposing this soon?

quoll13:01:05

Storage for it works fine, but arrays and maps are usually interpreted into triples, which is why this feature is not exposed

zeitstein14:01:19

It would certainly be useful, but I wouldn't want to take time away from a faster entity/`pull` implementation (which I know you're working towards) 🙂

quoll16:01:30

Well I’m working on a dozen things, and for the past few weeks: nothing 😳

quoll17:01:51

I was caught up trying to write the re:clojure talk, and I was procrastinating over that so I wrote cljs-math, and then I went on a Christmas break… my Asami coding is falling behind. That's why I was hoping to spend this week coding. Although, I have 2 other (hopefully short) projects I want to work on this week as well

Ben Sless13:01:54

And I happen to need arrays

quoll13:01:36

Just “arrays”, or “arrays as single values”?

👍 1

Ben Sless13:01:27

Arrays as single value, the sequence of data in them has no meaning in isolation

quoll13:01:14

OK. Then I’d better get to it 🙂 Meanwhile, other stores have this already

Ben Sless14:01:17

I thought it was working fine?

zeitstein14:01:19

There's some subtleties, whether the array will get stored as a single value or converted into Asami's linked list.

zeitstein14:01:03

There was a discussion on this https://clojurians.slack.com/archives/C018H97E02D/p1636986148121300.

Ben Sless14:01:59

Thank you :)

🙂 1

Huahai15:01:05

datalevin has no problem storing an array as a single value , in fact, if you have single value with huge size regardless the type, datalevin is probably your best option among the alternatives, because LMDB Is faster than file system when dealing with large blobs, as it doesn't incur cost of context switch of system calls, that's why machine learning people use it for storing images for computer vision training.

zeitstein14:01:43

Less important, but other differences that come to mind: • Asami and Datahike keep histories, Datascript and Datalevin do not. • pull for extracting entities/subtrees – Asami has entity but it is not as powerful, and – as of DataScript 1.3.0 – not as fast as pull. (But @quoll is working on it!) • Asami doesn't support namespaced ids – there are only :db/id and :id (you can certainly include an attribute like :person/id, but it won't be treated as an identifier). • Datalevin includes (or should soon) full-text search. • If I'm not mistaken, Asami's query is fastest. And supports some graph features. • Some considerations if you need durable storage. You can also look at the https://clojurelog.github.io/. (Not sure how up-to-date it is. See comment below.)

quoll14:01:08

I’ve not been a huge fan of that table. It focuses on the features of XTDB (hence, why it's green on most features) and doesn't consider features that are only supported by other DBs. Then again, I would say that, given that Asami doesn't intersect greatly with XTDB features

👍 2

Ben Sless16:01:12

But asami supposedly has best performance characteristics

ribelo17:01:31

if nothing has changed, no

ribelo17:01:57

it is worth noting when testing that asami returns lazy-seq

ribelo17:01:47

https://github.com/lambdaisland/datalog-benchmarks/blob/main/src/datalog_benchmarks/scratch.clj

ribelo17:01:19

practically no one who has tested the datalog dbs has noticed this

👍 1

Ben Sless17:01:22

How

ribelo17:01:49

I don't know how, but doing a doall on the asami results makes it no longer the fastest

Ben Sless17:01:17

I think it's time for a PR

ribelo17:01:39

asami is faster than datascript just as map is faster than mapv 🙃

Ben Sless17:01:54

ribelo17:01:41

you can a fork this bench https://github.com/joinr/datalevinbench and add asami

ribelo17:01:24

I did check it in the past, but while cleaning up I deleted the repo

ribelo17:01:27

😞

ribelo17:01:21

@ben.sless

|                      |   q1 |   q2 |   q3 |    q4 | qpred1 | qpred2 |
|----------------------+------+------+------+-------+--------+--------|
| latest-datascript    | 1.30 | 3.60 | 5.10 |  7.80 |   5.50 |  11.70 |
| latest-datalevin     | 0.57 | 2.40 | 2.90 |  4.70 |   5.30 |   6.60 |
| latest-asami         | 2.20 | 9.00 | 9.50 | 12.80 |  34.20 |  46.40 |
| latest-datahike-mem  | 0.74 | 3.00 | 4.20 |  7.60 |  18.40 |  18.40 |
| latest-datahike-file | 0.80 | 3.10 | 4.30 |  7.20 |  18.20 |  18.20 |

ribelo17:01:40

ribelo17:01:25

if I remember correctly, previously datahike was the slowest, but we can see guys have made a lot of progress

ribelo17:01:04

which is quite interesting anyway, and surprising that datalevin is so much faster, and both solutions have datascript underneath

zeitstein18:01:20

How are results for in-memory and on-disk Datahike so close?

ribelo18:01:50

I have no idea, note that datalevin uses lmdb and yet is the fastest

zeitstein19:01:55

That might be due to different caching implementations, etc. But, supposedly in Datahike examples everything else is equal but type of storage?

ribelo19:01:25

https://github.com/joinr/datalevinbench

ribelo19:01:47

here is all the code, I just added asami

ribelo19:01:32

in general, I don't know how datahike works, it probably keeps some data in memory and the benchmark generates so little data that everything fits in memory

zeitstein19:01:10

Not familiar with Datahike too. BTW, this is in-memory or on-disk Asami?

ribelo19:01:19

in memory

zeitstein19:01:39

Quite surprised to see Asami last. I remember conversations discussing its superior algorithm, for example, in @huahaiy's plans for https://github.com/juji-io/datalevin/issues/11.

ribelo19:01:06

I myself used to tell everyone that asami is the fastest

ribelo19:01:26

repeatedly in many benchmarks it came out, which I posted at the beginning

ribelo19:01:17

but not knowing why, no one, including me, has noticed that asami returns lazy-seq

ribelo19:01:15

the query engine itself may be fastest, but its implementation may not be

ribelo19:01:52

writing in clojure, you can very quickly get bogged down using functions that are horribly slow and kill all performance

ribelo19:01:42

probably if @ben.sless sits down and does some PR, asami will be 10x faster than all the rest 🙃

Ben Sless20:01:42

If Paula would like me to take a poke at it

Huahai03:01:48

The implementation of asami is mostly idiomatic clojure, so there is large room for improvement. In general, all the existing datalog offerings in the clojure world has large room for performance improvement. I plan to finish datalevin’s query engine rewrite this year, hopefully to address some of the performance issues so it performs similarly to a row store (I.e. any of the sql dbs), and still retains the flexibility of an eav store. Stay tuned.

Huahai03:01:50

Datalevin does implement some simple optimizations within the current framework of Datascript, which does make a difference.

timo18:01:18

Datahike is working on query performance optimization as well. We hope to get to that soon. Help always appreciated. Chime in on the discussion on GitHub if you are interested on Datahike's features: https://github.com/replikativ/datahike/discussions/categories/ideas

ribelo19:01:02

@U4GEXTNGZ I am impressed with the progress datahike has made, congratulations to you guys

❤️ 1

ribelo19:01:28

less than a year ago datahike was on average 3x slower than datascript, now it is marginally faster, but faster

timo19:01:54

Thanks @U0BBFDED7. This kind of comment makes it worth it. More and more people and companies are relying on Datahike and that encourages going the extra mile. And another great thing of extended interest are the minor and major contributions coming in.

ribelo19:01:26

😊

zeitstein12:01:17

@U4GEXTNGZ, any thoughts on this? > How are results for in-memory and on-disk Datahike so close?

timo13:01:21

Afaik the queries are memoized and repeatedly querying the same thing is essentially from memory. It seems to me that is what the benchmark does.

Huahai00:04:12

For the record, Datalevin now has a cost based query optimizer

2022-01-02

Channels