datalevin

xlfe 2025-02-21T00:31:21.069129Z

Thank you for datalevin 🙏 - I'm using it for a KV store of entity registrations (in Australia, business numbers and business names) and I'm exploring using the search engine, I'm wondering if there's a way to speed up (add-doc) - ie is it safe to run a pmap or something to parallelise this? I'm trying to add about 18 million documents. Thanks for your advice

Huahai 2025-02-21T00:43:07.141619Z

You can run pmap or whatever to try to parallelize, but it won't do any good. This is a single writer DB. What you can do is to turn on :nosync to load you docs, and then turn it off after you are done. Indexing 18 million documents will take a long time now matter what tool you use.

Huahai 2025-02-21T00:44:41.090659Z

Look at the search benchmark to get an idea https://github.com/juji-io/datalevin/tree/master/benchmarks/search-bench

Huahai 2025-02-21T00:45:16.855939Z

You should also allocate a large enough DB size upfront to save yourself from dynamic resizing, which will be very slow.

xlfe 2025-02-21T00:46:36.654759Z

thank you - I'll have a look at the search benchmark, and allocating the right DB size upfront - very helpful 🎉

Huahai 2025-02-21T00:48:50.115479Z

wikipedia data set is only 4.1 million articles, and it took almost 40 minutes, and produced a 60GB file, you should expect a much longer time, as B-tree will become slower as data volume increases.

xlfe 2025-02-21T00:49:46.159419Z

ok. the documents in my index are very small - just an entity names (so typically < 200 characters) perhaps on average 4 words

Huahai 2025-02-21T00:49:48.553679Z

:nosync is your best bet

👍 1
Huahai 2025-02-21T00:49:57.408959Z

ok

Huahai 2025-02-21T00:50:55.483869Z

If you are are on Linux, :writemap and :mapasync together if faster, but if you are on Mac or Windows, :nosync is best

xlfe 2025-02-21T00:52:27.277499Z

yep linux - so (d/set-env-flags db :writemap true :mapasync true) before indexing, and

(d/set-env-flags db :writemap false :mapasync false)
    (d/commit search-writer)
    (d/close-kv db)))
when done?

Huahai 2025-02-21T00:52:41.919809Z

wrong syntax

xlfe 2025-02-21T00:53:14.008629Z

oh sorry

(d/set-env-flags db #{:writemap :mapasync} true)

Huahai 2025-02-21T00:53:29.726509Z

(d/set-env-flags db [:writemap :mapasync] true)

xlfe 2025-02-21T00:53:39.710489Z

ok thank you

Huahai 2025-02-21T00:53:54.775739Z

vector or set doesn't matter

Huahai 2025-02-21T00:55:06.720429Z

commit before set the flags back

Huahai 2025-02-21T00:55:34.145489Z

because that flag effects commit

Huahai 2025-02-21T00:56:34.104369Z

I think :writemap has to be specified when opening the DB, it's not something changeable dynamically

Huahai 2025-02-21T00:57:30.301559Z

the file is either open as a read only mmap, or as a writable map

xlfe 2025-02-21T00:57:55.455499Z

ah ok

xlfe 2025-02-21T00:58:44.141059Z

I close the db after indexing, then open it again later for (read only) usage, so perhaps I'll just use the flags in open-kv and then I don't need to use them with set-env-flags

xlfe 2025-02-21T01:00:11.768589Z

{:clojure.main/message
 "Execution error (ArrayIndexOutOfBoundsException) at java.lang.System/arraycopy (System.java:-2).\narraycopy: length -7 is negative\n",
 :clojure.main/triage
 {:clojure.error/class java.lang.ArrayIndexOutOfBoundsException,
  :clojure.error/line -2,
  :clojure.error/cause "arraycopy: length -7 is negative",
  :clojure.error/symbol java.lang.System/arraycopy,
  :clojure.error/source "System.java",
  :clojure.error/phase :execution},
 :clojure.main/trace
 {:via
  [{:type java.util.concurrent.ExecutionException,
    :message
    "java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -7 is negative",
    :at [java.util.concurrent.FutureTask report "FutureTask.java" 122]}
   {:type java.util.concurrent.ExecutionException,
    :message
    "java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -7 is negative",
    :at [java.util.concurrent.FutureTask report "FutureTask.java" 122]}
   {:type java.lang.ArrayIndexOutOfBoundsException,
    :message "arraycopy: length -7 is negative",
    :at [java.lang.System arraycopy "System.java" -2]}],
  :trace
  [[java.lang.System arraycopy "System.java" -2]
   [datalevin.utl.GrowingIntArray insert "GrowingIntArray.java" 42]
   [datalevin.sparselist.SparseIntArrayList set "sparselist.clj" 38]
   [datalevin.search.IndexWriter write "search.clj" 1147]
   [datalevin.search$eval14045$fn__14046$G__14035__14050
    invoke
    "search.clj"
    1106]
   [datalevin.search$eval14045$fn__14046$G__14034__14055
    invoke
    "search.clj"
    1106]
   [psithur.database_abn.core$update_tables$fn__23583
    invoke
    "core.clj"
    119]
   [clojure.core$pmap$fn__8552$fn__8553 invoke "core.clj" 7089]
   [clojure.core$binding_conveyor_fn$fn__5823 invoke "core.clj" 2047]
   [clojure.lang.AFn call "AFn.java" 18]
   [java.util.concurrent.FutureTask run "FutureTask.java" 317]
   [java.util.concurrent.ThreadPoolExecutor
    runWorker
    "ThreadPoolExecutor.java"
    1144]
   [java.util.concurrent.ThreadPoolExecutor$Worker
    run
    "ThreadPoolExecutor.java"
    642]
   [java.lang.Thread run "Thread.java" 1583]],
  :cause "arraycopy: length -7 is negative"}}

xlfe 2025-02-21T01:00:16.470179Z

Hmm

xlfe 2025-02-21T01:01:15.543759Z

(trying without pmap now)

xlfe 2025-02-21T01:07:25.453249Z

that looks to be working 👍

Huahai 2025-02-21T20:21:31.202889Z

Nevertheless, filed an issue for this exception. https://github.com/juji-io/datalevin/issues/315