This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-11-15
Channels
- # aleph (24)
- # announcements (8)
- # babashka (27)
- # beginners (55)
- # biff (4)
- # calva (32)
- # cider (5)
- # clj-kondo (11)
- # clojure (59)
- # clojure-android (3)
- # clojure-australia (1)
- # clojure-belgium (6)
- # clojure-dev (21)
- # clojure-europe (26)
- # clojure-nl (1)
- # clojure-norway (17)
- # clojurescript (19)
- # css (1)
- # data-science (10)
- # datahike (17)
- # events (3)
- # figwheel-main (4)
- # honeysql (1)
- # hugsql (5)
- # hyperfiddle (1)
- # jobs (1)
- # leiningen (3)
- # lsp (6)
- # malli (5)
- # meander (4)
- # nbb (6)
- # off-topic (87)
- # pathom (19)
- # portal (2)
- # re-frame (4)
- # reitit (6)
- # releases (1)
- # remote-jobs (3)
- # shadow-cljs (29)
- # sql (8)
- # tools-deps (6)
- # xtdb (7)
Hi, I would like some suggestions with the following, I have a data structure similar to this:
clojure
[{:db/id 10 :record/rows [20 30 40 50]}
{:db/id 60 :record/rows [70 80]}
{:db/id 20 :row/name "name1" :row/valid true :row/value "value1"}
{:db/id 30 :row/name "name2" :row/valid false :row/value "value2"}
{:db/id 40 :row/name "name3" :row/valid false :row/value "value3"}
{:db/id 70 :row/name "name4" :row/valid true :row/value "value4"}
{:db/id 80 :row/name "name5" :row/valid false :row/value "value5"}]
With several hundred thousand maps. I need to query this data structure to obtain for example:
"Get :row/name
of rows that have :row/valid
false
grouped by :record/rows
" or
"All rows that are in the same record as the row with :row/value
value4
"
I can do this with a datalog query:
clojure
(d/q '[:find (pull ?r [* {:record/rows [:row/name]}])
:where
[?e :row/valid false]
[?r :record/rows ?e]]
@conn)
=> [[{:db/id 60, :record/rows [#:row{:name "name4"} #:row{:name "name5"}]}]
[{:db/id 10, :record/rows [#:row{:name "name1"} #:row{:name "name2"} #:row{:name "name3"}]}]]
(d/q '[:find (pull ?r [* {:record/rows [:row/name]}])
:where
[?e :row/value "value4"]
[?r :record/rows ?e]]
@conn)
=> [[{:db/id 60, :record/rows [#:row{:name "name4"} #:row{:name "name5"}]}]]
But unfortunately the performance of datahike when inserting thousands of rows is not so good, (around 4 minutes per each 100000 rows).
The intention is to use this as an internal data structure for a migration tool (from xml files to an sql database), so the datahike data is not meant to be permanent, I'm currently using it with an in-memory datastore.
Is it worth using datahike for this purpose, or should instead use something more light weight, like pure clojure (by manipulating maps), [specter](https://github.com/redplanetlabs/specter) or [meander](https://github.com/noprompt/meander)?
With pure clojure, it might require several functions to replicate the queries. And the two other libraries have a learning curve.
I like the fact that with datahike the queries are concise. Any suggestions would be appreciated.https://github.com/tonsky/datascript
Datahike evolved out of a purely in-memory datalog DB, datascript
. You could see if it is performs well enough for your migration; with no persistent storage transactions will have less overhead.
There is also https://github.com/djjolicoeur/datamaps But i don't have much experience with it.
You may also want to ask in #CQT1NFF4L and #CJ322KHNX.
@UFTRLDZEW Thanks for the suggestions! I'll take a look a those.
Hey @U0U2W7B71. The work on the persistent-sorted-set is almost done and should improve the performance. Maybe you want to try it out as a beta tester? It should speed up performance quite a bit.
@U4GEXTNGZ, sure I could do try that too, which version of Datahike should I use?
@U0U2W7B71 0.6.1523
By default the persistent-sorted-set index is now used, you should be able to see this when checking (:config @conn)
. Write performance should be much better now, just try to transact not in too small chunks, but big batches. We will improve this also in the future.
Also we are looking into improving query engine performance, so please point out issues you run into so we can consider what people need.
Reporting on my performance tests of Datahike 0.6.1523 Vs. 0.5.1517 Using Datahike configured with a memory store. Inserting two collection, one with 37210 maps and a second one with 2423563 maps, of 9 items each (Tested 3 times on each version): Datahike version 0.6.1523:
"inserted" 37210
"inserted" 2423563
"Elapsed time: 520334.781115 msecs"
"inserted" 37210
"inserted" 2423563
"Elapsed time: 535225.000643 msecs"
"inserted" 37210
"inserted" 2423563
"Elapsed time: 541459.163161 msecs"
Datahike version 0.5.1517:
"inserted" 37210
"inserted" 2423563
"Elapsed time: 667409.285893 msecs"
"inserted" 37210
"inserted" 2423563
"Elapsed time: 697995.542002 msecs"
"inserted" 37210
"inserted" 2423563
"Elapsed time: 697827.084664 msecs"
Performance has definitely improved.But unfortunately with version 0.6.1523, I cannot use datahike-jdbc.core
.
I get an error:
Execution error (IllegalAccessError) at datahike-jdbc.core/eval78578$loading (core.clj:1).
scheme->index does not exist
ohoh, datahike-jdbc has a dependency on hitchhiker-tree. means we need to put some work into it. thanks for bringing this up.
I think https://github.com/replikativ/konserve-jdbc/ already implements the current konserve interface. Is there something missing from it still? My understanding is that datahike-jdbc just wraps it and should be fine then.
I have open a PR that bumps the versions here https://github.com/replikativ/konserve-jdbc/pull/7.
Reporting on my performance tests of Datahike 0.6.1523 Vs. 0.5.1517 Using Datahike configured with a memory store. Inserting two collection, one with 37210 maps and a second one with 2423563 maps, of 9 items each (Tested 3 times on each version): Datahike version 0.6.1523:
"inserted" 37210
"inserted" 2423563
"Elapsed time: 520334.781115 msecs"
"inserted" 37210
"inserted" 2423563
"Elapsed time: 535225.000643 msecs"
"inserted" 37210
"inserted" 2423563
"Elapsed time: 541459.163161 msecs"
Datahike version 0.5.1517:
"inserted" 37210
"inserted" 2423563
"Elapsed time: 667409.285893 msecs"
"inserted" 37210
"inserted" 2423563
"Elapsed time: 697995.542002 msecs"
"inserted" 37210
"inserted" 2423563
"Elapsed time: 697827.084664 msecs"
Performance has definitely improved.