datahike 2022-11-15 | Slack Archive

Hi, I would like some suggestions with the following, I have a data structure similar to this:

clojure
[{:db/id 10 :record/rows [20 30 40 50]}
   {:db/id 60 :record/rows [70 80]}
   {:db/id 20 :row/name "name1" :row/valid true :row/value "value1"}
   {:db/id 30 :row/name "name2" :row/valid false :row/value "value2"}
   {:db/id 40 :row/name "name3" :row/valid false :row/value "value3"}
   {:db/id 70 :row/name "name4" :row/valid true :row/value "value4"}
   {:db/id 80 :row/name "name5" :row/valid false :row/value "value5"}]

With several hundred thousand maps. I need to query this data structure to obtain for example: "Get :row/name of rows that have :row/valid false grouped by :record/rows" or "All rows that are in the same record as the row with :row/value value4" I can do this with a datalog query:

clojure
(d/q '[:find (pull ?r [* {:record/rows [:row/name]}])
       :where
       [?e :row/valid false]
       [?r :record/rows ?e]]
     @conn)

=> [[{:db/id 60, :record/rows [#:row{:name "name4"} #:row{:name "name5"}]}]
 [{:db/id 10, :record/rows [#:row{:name "name1"} #:row{:name "name2"} #:row{:name "name3"}]}]]



(d/q '[:find (pull ?r [* {:record/rows [:row/name]}])
       :where
       [?e :row/value "value4"]
       [?r :record/rows ?e]]
     @conn)

=> [[{:db/id 60, :record/rows [#:row{:name "name4"} #:row{:name "name5"}]}]]

But unfortunately the performance of datahike when inserting thousands of rows is not so good, (around 4 minutes per each 100000 rows). The intention is to use this as an internal data structure for a migration tool (from xml files to an sql database), so the datahike data is not meant to be permanent, I'm currently using it with an in-memory datastore. Is it worth using datahike for this purpose, or should instead use something more light weight, like pure clojure (by manipulating maps), [specter](https://github.com/redplanetlabs/specter) or [meander](https://github.com/noprompt/meander)? With pure clojure, it might require several functions to replicate the queries. And the two other libraries have a learning curve. I like the fact that with datahike the queries are concise. Any suggestions would be appreciated.

respatialized21:11:52

https://github.com/tonsky/datascript Datahike evolved out of a purely in-memory datalog DB, datascript. You could see if it is performs well enough for your migration; with no persistent storage transactions will have less overhead.

respatialized21:11:52

There is also https://github.com/djjolicoeur/datamaps But i don't have much experience with it.

respatialized21:11:11

You may also want to ask in #CQT1NFF4L and #CJ322KHNX.

nyor.tr21:11:01

@UFTRLDZEW Thanks for the suggestions! I'll take a look a those.

timo04:11:06

Hey @U0U2W7B71. The work on the persistent-sorted-set is almost done and should improve the performance. Maybe you want to try it out as a beta tester? It should speed up performance quite a bit.

nyor.tr19:11:04

@U4GEXTNGZ, sure I could do try that too, which version of Datahike should I use?

whilo20:11:10

@U0U2W7B71 0.6.1523

whilo20:11:16

By default the persistent-sorted-set index is now used, you should be able to see this when checking (:config @conn) . Write performance should be much better now, just try to transact not in too small chunks, but big batches. We will improve this also in the future.

👍 2

whilo01:11:28

Also we are looking into improving query engine performance, so please point out issues you run into so we can consider what people need.

nyor.tr21:11:25

#Also sent to the channel

Reporting on my performance tests of Datahike 0.6.1523 Vs. 0.5.1517 Using Datahike configured with a memory store. Inserting two collection, one with 37210 maps and a second one with 2423563 maps, of 9 items each (Tested 3 times on each version): Datahike version 0.6.1523:

"inserted" 37210
"inserted" 2423563
"Elapsed time: 520334.781115 msecs"

"inserted" 37210
"inserted" 2423563
"Elapsed time: 535225.000643 msecs"

"inserted" 37210
"inserted" 2423563
"Elapsed time: 541459.163161 msecs"

Datahike version 0.5.1517:

"inserted" 37210
"inserted" 2423563
"Elapsed time: 667409.285893 msecs"

"inserted" 37210
"inserted" 2423563
"Elapsed time: 697995.542002 msecs"

"inserted" 37210
"inserted" 2423563
"Elapsed time: 697827.084664 msecs"

Performance has definitely improved.

1

nyor.tr21:11:26

#Also sent to the channel

But unfortunately with version 0.6.1523, I cannot use datahike-jdbc.core. I get an error: Execution error (IllegalAccessError) at datahike-jdbc.core/eval78578$loading (core.clj:1). scheme->index does not exist

1

timo10:11:50

ohoh, datahike-jdbc has a dependency on hitchhiker-tree. means we need to put some work into it. thanks for bringing this up.

👍 1

whilo06:11:24

Can we just bump versions?

timo15:11:27

no, unfortunately not. the konserve-parts need to be changed as well.

whilo22:12:28

I think https://github.com/replikativ/konserve-jdbc/ already implements the current konserve interface. Is there something missing from it still? My understanding is that datahike-jdbc just wraps it and should be fine then.

whilo23:12:06

I have open a PR that bumps the versions here https://github.com/replikativ/konserve-jdbc/pull/7.

👍 1