Thank you for datalevin 🙏 - I'm using it for a KV store of entity registrations (in Australia, business numbers and business names) and I'm exploring using the search engine, I'm wondering if there's a way to speed up (add-doc) - ie is it safe to run a pmap or something to parallelise this? I'm trying to add about 18 million documents. Thanks for your advice
You can run pmap or whatever to try to parallelize, but it won't do any good. This is a single writer DB. What you can do is to turn on :nosync to load you docs, and then turn it off after you are done. Indexing 18 million documents will take a long time now matter what tool you use.
Look at the search benchmark to get an idea https://github.com/juji-io/datalevin/tree/master/benchmarks/search-bench
You should also allocate a large enough DB size upfront to save yourself from dynamic resizing, which will be very slow.
thank you - I'll have a look at the search benchmark, and allocating the right DB size upfront - very helpful 🎉
wikipedia data set is only 4.1 million articles, and it took almost 40 minutes, and produced a 60GB file, you should expect a much longer time, as B-tree will become slower as data volume increases.
ok. the documents in my index are very small - just an entity names (so typically < 200 characters) perhaps on average 4 words
:nosync is your best bet
ok
If you are are on Linux, :writemap and :mapasync together if faster, but if you are on Mac or Windows, :nosync is best
yep linux - so (d/set-env-flags db :writemap true :mapasync true) before indexing, and
(d/set-env-flags db :writemap false :mapasync false)
(d/commit search-writer)
(d/close-kv db)))
when done?wrong syntax
oh sorry
(d/set-env-flags db #{:writemap :mapasync} true)(d/set-env-flags db [:writemap :mapasync] true)
ok thank you
vector or set doesn't matter
commit before set the flags back
because that flag effects commit
I think :writemap has to be specified when opening the DB, it's not something changeable dynamically
the file is either open as a read only mmap, or as a writable map
ah ok
I close the db after indexing, then open it again later for (read only) usage, so perhaps I'll just use the flags in open-kv and then I don't need to use them with set-env-flags
{:clojure.main/message
"Execution error (ArrayIndexOutOfBoundsException) at java.lang.System/arraycopy (System.java:-2).\narraycopy: length -7 is negative\n",
:clojure.main/triage
{:clojure.error/class java.lang.ArrayIndexOutOfBoundsException,
:clojure.error/line -2,
:clojure.error/cause "arraycopy: length -7 is negative",
:clojure.error/symbol java.lang.System/arraycopy,
:clojure.error/source "System.java",
:clojure.error/phase :execution},
:clojure.main/trace
{:via
[{:type java.util.concurrent.ExecutionException,
:message
"java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -7 is negative",
:at [java.util.concurrent.FutureTask report "FutureTask.java" 122]}
{:type java.util.concurrent.ExecutionException,
:message
"java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -7 is negative",
:at [java.util.concurrent.FutureTask report "FutureTask.java" 122]}
{:type java.lang.ArrayIndexOutOfBoundsException,
:message "arraycopy: length -7 is negative",
:at [java.lang.System arraycopy "System.java" -2]}],
:trace
[[java.lang.System arraycopy "System.java" -2]
[datalevin.utl.GrowingIntArray insert "GrowingIntArray.java" 42]
[datalevin.sparselist.SparseIntArrayList set "sparselist.clj" 38]
[datalevin.search.IndexWriter write "search.clj" 1147]
[datalevin.search$eval14045$fn__14046$G__14035__14050
invoke
"search.clj"
1106]
[datalevin.search$eval14045$fn__14046$G__14034__14055
invoke
"search.clj"
1106]
[psithur.database_abn.core$update_tables$fn__23583
invoke
"core.clj"
119]
[clojure.core$pmap$fn__8552$fn__8553 invoke "core.clj" 7089]
[clojure.core$binding_conveyor_fn$fn__5823 invoke "core.clj" 2047]
[clojure.lang.AFn call "AFn.java" 18]
[java.util.concurrent.FutureTask run "FutureTask.java" 317]
[java.util.concurrent.ThreadPoolExecutor
runWorker
"ThreadPoolExecutor.java"
1144]
[java.util.concurrent.ThreadPoolExecutor$Worker
run
"ThreadPoolExecutor.java"
642]
[java.lang.Thread run "Thread.java" 1583]],
:cause "arraycopy: length -7 is negative"}}Hmm
(trying without pmap now)
that looks to be working 👍
Nevertheless, filed an issue for this exception. https://github.com/juji-io/datalevin/issues/315