Fork me on GitHub
#datahike
<
2022-12-23
>
whilo20:12:52

A little Christmas present and teaser for our pending 1.0 version: Datahike can now be compiled as a shared library with libdatahike/compile-libdatahike and we provide experimental Python bindings for it now here https://github.com/replikativ/pydatahike. There is also a fairly complete command line client that can be built with bin/build-native-image , which also should be easy to use from babashka/libsci (no pod needed, since it supports different edn serializations itself). We have tested the functionality so far on Linux, but would be curious to learn whether it works on other platforms as well (e.g. Mac, Windows, Android, iOS). The shared library is intended to be the default way to export our API to non-JVM/JS runtimes in the future, so help (bindings for other languages, stress testing, bug reporting, etc.) with it is generally appreciated. Also, as always, you can speed things up by supporting us on https://opencollective.com/datahike or https://github.com/sponsors/replikativ.

👍 4
🎉 2
whilo17:05:04

@U4GEXTNGZ did you see this discussion? we can write a pod, but the datahike command line binary itself already provides a similar interface and should be easy to wrap nicely with pure babashka code

borkdude20:12:35

@whilo note that while a tool can dump EDN, it's still useful to have it implemented as a pod as this improved the UX in babashka. See e.g. how datalevin can be used as a pod. There is no additional install necessary of datalevin, this is automatic when you execute this code: https://github.com/babashka/pod-registry/blob/master/examples/datalevin.clj The pod and the CLI can be (and usually are) the same binary. E.g. clj-kondo also is able to run as a pod. But of course this work can be done later. Great progress :)

whilo20:12:31

@borkdude I think it would be better to just have library (file) of sci code that wraps the CLI, it would be equivalent to sending messages back and forth with a pod, as Datahike is in effect stateless between RPCs.

borkdude20:12:20

A pod is just started once during the lifetime of a program, shelling out would start datahike for every query

borkdude20:12:43

Also it has to be installed separately

whilo20:12:03

Starting Datahike everytime is not necessarily a problem, unless you need warm caches. The startup time itself is a few miliseconds, not much more than the serialization overhead, I guess.

whilo20:12:45

I am happy to support pod development and care about babashka/libsci support, but I cannot prioritize it right now.

whilo20:12:18

I think what would be nicer though is to compile Datahike directly into libsci without the pod middleware.

borkdude20:12:38

libsci and babashka are two totally different things

whilo20:12:00

Ok, I thought babashka uses libsci.

borkdude20:12:26

it uses SCI, but it doesn't talk to external shared libs. libsci is just a demo of how you can make a shared library with sci

whilo20:12:43

Right, sorry, my bad. I meant sci.

whilo20:12:30

Marshalling data at the interface into edn/transit when the database is implemented in the same language and has the same memory model as your interpreter is suboptimal in my understanding.

borkdude20:12:31

datalevin has also done that (they have a REPL which runs SCI code for example) but exposing it to babashka via the pod interface doesn't force people to write scripts in a different environment with a different set of libs. I understand it's not a priority, that's ok. This work has already done by datalevin and can be more or less copied. The pod can also be developed externally from datahike proper

whilo20:12:14

Datalevin is not fully implemented in Clojure, but rather a stateful blackbox with a different memory model.

whilo20:12:40

Anyway, this is just my intuition as a language designer/compiler builder. I need to look into the details.

borkdude20:12:09

> Marshalling data at the interface into edn/transit when the database is implemented in the same language and has the same memory model as your interpreter is suboptimal in my understanding. Sure, it's a trade-off to still get some features which you otherwise don't and this is still better than wrapping a CLI with repeated startup. Pod calls are sub-millisecond whereas startup of a graal binary is 10-20 ms

borkdude20:12:22

It's sweet that datahike is pure clojure, but pods are language agnostic, it could be implement in go or haskell, it doesn't matter If you want to remove this intermediate layer, you would have to build datahike as a built-in library into babashka

whilo20:12:33

I understand, I looked into the examples. libdatahike has a similar message passing and error handling RPC interface from C, so I guess it could also be used for pod with our python bindings for example.

whilo20:12:06

Would this be a problem?

borkdude20:12:35

not a problem if it compiles with graal, but it's a matter of 1) how much extra strain does it put on CI, 2) how much extra binary size, 3) how often do people ask for this, etc it can be done as an optional build flag for sure https://github.com/babashka/babashka/blob/master/doc/build.md#feature-flags

whilo20:12:32

I see, you already provide support for DataScript there, nice! I guess it would be similar.

whilo20:12:18

The nice thing for instance would be that you could also get lazy index iterators over the Datoms in this case.

whilo20:12:37

It is fairly annoying to code that up with RPC.

borkdude20:12:52

I noticed that datahike DataScript was quite intensive to compile in graal. Sometimes dynamic requires / resolve etc can make compilation a lot slower and the binaries bigger. Does datahike has a dependency on datascript or does it use its own copy of it?

whilo20:12:29

It is a fork, so no dependency.

whilo20:12:54

But compilation is somewhat heavy and not optimized yet.

borkdude20:12:43

@whilo it's worth trying this out before you load any other namespaces to see if you have any of these dynamic things: https://github.com/babashka/babashka/blob/68a6e2451624be3640b43c6c2f4ad915b9ba423f/src/aaaa_this_has_to_be_first/because_patches.clj#L6-L30

whilo14:12:30

I removed one of the resolves in the transactor, and there is another one in the query namespace (which is somewhat convenient, but not necessary). When I remove this as well payload size and compilation time indeed shrink significantly, but weirdly the binary does not find clojure.core/init or something anymore. I think I should look into how babashka handles initialization more closely to fix things, lmk if you have any quickstart pointers. We still have "--initialize-at-build-time" set as a build flag for the native image as well.

whilo14:12:45

This is the exact error:

whilo14:12:49

christian@dyson:~/Development/datahike$ ./dhi --help
Exception in thread "main" java.lang.ExceptionInInitializerError
	at clojure.lang.Namespace.<init>(Namespace.java:34)
	at clojure.lang.Namespace.findOrCreate(Namespace.java:176)
	at clojure.lang.Var.internPrivate(Var.java:156)
	at datahike.cli.<clinit>(Unknown Source)
Caused by: java.io.FileNotFoundException: Could not locate clojure/core__init.class, clojure/core.clj or clojure/core.cljc on classpath.
	at clojure.lang.RT.load(RT.java:462)
	at clojure.lang.RT.load(RT.java:424)
	at clojure.lang.RT.<clinit>(RT.java:338)
	... 4 more

borkdude15:12:27

@whilo Have you tried using https://github.com/clj-easy/graal-build-time? And do you use any requiring-resolve or so in the cli namespace?

whilo11:01:15

Not yet. Thank you for the reference, I will check it out.

whilo21:12:04

Thank you!

whilo21:12:31

Did you derive a patch for DataScript from that?

borkdude22:12:31

No, I didn't, but I use it to scan stuff if I want to include a dependency in bb and the compilation takes much longer than necessary (and the binary grows from 80 to 110mb)

borkdude22:12:08

and occasionally I have to monkey-patch a library, like clojure.pprint which uses find-var which has a similar problem

whilo05:12:14

Ok, that makes sense. Thanks!

borkdude14:12:12

I haven't thought deeply about it but it would be kind of cool if babashka would have a C extension API like Ruby E.g. this is an extension for Ruby written in zig https://github.com/katafrakt/zig-ruby/blob/main/ext/zig_rb/src/main.zig#L2 Then we would be able to talk more efficiently without serializing things maybe

whilo14:12:19

Yes, similar to CPython which makes interacting with the interpreter form native code easy. It is not that I dislike the pod protocol per se, it is just reducing the shared semantics between both systems to imperative object oriented programming with RPC for method invocation, while Datahike shares the same memory semantics as Clojure/Babashka and that is the main reason for its existence from my pov.