π hi there, I'm trying to batch ingest a collection of tag strings into a standalone search engine, but am running into an what looks like an lmdb error:
Execution error (ExceptionInfo) at datalevin.binding.cpp.CppLMDB/transact_kv (cpp.clj:1070).
Fail to transact to LMDB: #error {
:cause "MDB_BAD_VALSIZE: Unsupported size of key/DB name/data, or wrong DUPFIXED size"
:via
[{:type datalevin.cpp.Util$DTLVException
:message "MDB_BAD_VALSIZE: Unsupported size of key/DB name/data, or wrong DUPFIXED size"
:at [datalevin.cpp.Util checkRc "Util.java" 63]}]
can I please get some help debugging this, as I'm not familiar with lmdb? i'll add more details in π§΅this is the code I'm running in a local repl:
(let [temp-dir (str (System/getProperty "java.io.tmpdir") "/datalevin-test-" (System/currentTimeMillis))
lmdb (d/open-kv temp-dir {:flags
(conj c/default-env-flags :nosync)})
engine (s/new-search-engine
lmdb {:analyzer autocomplete-analyzer
:query-analyzer autocomplete-query-time-analyzer
:include-text? true})
idxed-tags (->> (mapcat :recipe/tags converted-recipes)
(mapv :tag/name)
(map-indexed vector))]
(doseq [[tag-idx tag] idxed-tags]
(try
(d/add-doc engine tag-idx tag false)
(catch Exception e
(println (str "error trying to add-doc tag-idx: " tag-idx ", tag: " tag ", err msg: " (.getMessage e)))
(throw e))))
(d/close-kv lmdb))this is the full error I'm getting when trying to ingest the tags:
error trying to add-doc tag-idx: 46516, tag: american, err msg: Fail to transact to LMDB: #error {
:cause "MDB_BAD_VALSIZE: Unsupported size of key/DB name/data, or wrong DUPFIXED size"
:via
[{:type datalevin.cpp.Util$DTLVException
:message "MDB_BAD_VALSIZE: Unsupported size of key/DB name/data, or wrong DUPFIXED size"
:at [datalevin.cpp.Util checkRc "Util.java" 63]}]
:trace
[[datalevin.cpp.Util checkRc "Util.java" 63]
[datalevin.cpp.Dbi put "Dbi.java" 56]
[datalevin.binding.cpp.DBI put "cpp.clj" 314]
[datalevin.binding.cpp.DBI put "cpp.clj" 315]
[datalevin.binding.cpp$put_tx invokeStatic "cpp.clj" 633]
[datalevin.binding.cpp$put_tx invoke "cpp.clj" 626]
[datalevin.binding.cpp$transact_STAR_ invokeStatic "cpp.clj" 660]
[datalevin.binding.cpp$transact_STAR_ invoke "cpp.clj" 653]
[datalevin.binding.cpp.CppLMDB transact_kv "cpp.clj" 1054]
[datalevin.binding.cpp.CppLMDB transact_kv "cpp.clj" 1037]
[datalevin.binding.cpp.CppLMDB transact_kv "cpp.clj" 1035]
[datalevin.search$add_doc_STAR_ invokeStatic "search.clj" 930]
[datalevin.search$add_doc_STAR_ invoke "search.clj" 887]
[datalevin.search.SearchEngine add_doc "search.clj" 633]
[datalevin.interface$eval46479$fn__46528$G__46461__46536 invoke "interface.clj" 274]
[datalevin.interface$eval46479$fn__46528$G__46460__46545 invoke "interface.clj" 274]
[datalevin_db.main$eval73633$fn__73641 invoke "main.clj" 376]
[datalevin_db.main$eval73633 invokeStatic "main.clj" 375]
[datalevin_db.main$eval73633 invoke "main.clj" 364]
[clojure.lang.Compiler eval "Compiler.java" 7757]
[cider.nrepl.middleware.util.eval$eval_dispatcher$fn__42725 invoke "eval.clj" 15]
[nrepl.middleware.interruptible_eval$evaluator$run__35316$fn__35330 invoke "interruptible_eval.clj" 130]
[nrepl.middleware.interruptible_eval$evaluator$run__35316 invoke "interruptible_eval.clj" 128]
[nrepl.middleware.session$session_exec$session_loop__35421 invoke "session.clj" 251]
[nrepl.SessionThread run "SessionThread.java" 21]]}this is a sample of the entities that I'm extracting the tags from (some recipes scraped from the web):
(#:recipe{:author #:author{:name "G. Stephen Jones", :normalizedName "g. stephen jones"},
:cookTime 15,
:date-published "2006/05/10",
:description "",
:domain #:domain{:brand "reluctantgourmet"},
:ingredients ["1 egg" "2 teaspoons Worcestershire sauce" "ΒΌ teaspoon dry mustard" "2 tablespoons mayonnaise"
"1 teaspoon lemon juice" "1 tablespoon Dijon mustard" "1 tablespoon olive oil"
"1 teaspoon dried parsley flakes" "1 teaspoon Old Bay Seasoning" "ΒΎ cup breadcrumbs"
"16 oz. lump crab meat"],
:name "Crab Cakes Recipe",
:normalizedName "crab cakes recipe",
:num-ingredients 11,
:prepTime 15,
:ratingCount 5,
:ratingValue 4.8,
:reviewCount 4,
:tags [#:tag{:name "american"} #:tag{:name "main course"}],
:tags-group "american main course",
:totalTime 30,
:url " "}
#:recipe{:cookTime 25,
:date-published "2022/08/19",
:description
"Need something really simple for a party? Try these easy, cheesy, crispy Greek mini feta cheese pies (AKA tiropitakia!) made with just filo pastry, feta cheese and a few other simple ingredients. Make them ahead, stash them in the freezer, and you're ready to go!",
:domain #:domain{:brand "scrummylane"},
:ingredients
["10.5 ounces feta cheese"
"Β½ cup Greek yogurt (Substitute with ricotta cheese, more cream cheese, cottage cheese or even sour cream. )"
"Β½ cup cream cheese"
"β
cup parmesan cheese (Grated, about a large handful. Substitute with any grated cheese (strong flavored is best). )"
"1 egg (Lightly whisked)" "ΒΌ teaspoon nutmeg" "ΒΌ teaspoon black pepper"
"2 tablespoons fresh mint (chopped, or dill, or 2 teaspoons dried oregano (optional))"
"9 ounces filo pastry" "β
cup olive oil"
"sesame seeds (Optional, for sprinkling over just before baking.)"],
:name "Tiropitakia (Mini Greek Feta Cheese Pies)",
:normalizedName "tiropitakia (mini greek feta cheese pies)",
:num-ingredients 11,
:prepTime 15,
:tags [#:tag{:name "greek"} #:tag{:name "appetizer"} #:tag{:name "cheese pies"} #:tag{:name "filo pies"}
#:tag{:name "tiropita"} #:tag{:name "tiropitakia"}],
:tags-group "greek appetizer cheese pies filo pies tiropita tiropitakia",
:totalTime 40,
:url " "})this is a small sample of the raw tag strings formatted in a way to use with add-doc :
([0 "american"]
[1 "main course"]
[2 "greek"]
[3 "appetizer"]
[4 "cheese pies"]
[5 "filo pies"]
[6 "tiropita"]
[7 "tiropitakia"]
[8 "canadian"]
[9 "dessert"]
[10 "snack"]
[11 "no bake puffed wheat squares"]
[12 "puffed wheat squares"]
[13 "dinner"]
[14 "american"]
[15 "dessert"]
[16 "cupcake"])these are the analyzers that i've attached to the search engine:
(def autocomplete-analyzer (sut/create-analyzer {:tokenizer (inter-fn [s] (a/en-analyzer s))
:token-filters [sut/prefix-token-filter]}))
(def autocomplete-query-time-analyzer (sut/create-analyzer
{:tokenizer (sut/create-regexp-tokenizer #"[\s:/\.;,!=?\"'()\[\]{}|<>&@#^*\\~`\-]+")
:token-filters [sut/lower-case-token-filter
sut/unaccent-token-filter]}))i wasn't able to reproduce the error on a smaller scale; it seems to only happen after ingesting 20k or more tag strings. Here's a https://gist.github.com/nkad129/61c344f72d5a5d79026cd34fe56c470a of enough tags to get to the idx where the err was thrown on my local. I'm confused on whether something is getting corrupted in the search store π€
this is in context of a similar problem I've been narrowing down: I've been trying to ingest some recipes entities into a dtlv datalog store that has search enabled on some of its fields, where the lmdb errs start happening during a batch ingest. I narrowed it down to an issue with fields that have :db/fulltext true , especially ones where it tries to use the autocomplete/prefix analyzers. When I remove all the fulltext attributes the batch ingest goes well. So I've been trying to see if I can get a batch ingest just for a standalone search engine to work first then try again with the datalog store
i'm currently using datalevin 0.10.5 if that helps
Please file a GitHub issue.
filed: https://github.com/datalevin/datalevin/issues/355, lmk how/if I can help
This looks like a prefix-compression issue again.
A fix will be in next release.
Fix is in master branch. Please let me know it is not fixed.
yep the fix worked, thank you for the fast turnarounds! I was trying to read the fix you made in C but haven't touched that language since college π
LMDB is not the easiest code base to work in, but the performance is hard to beat. Thank you for the bug report!
i'm running into a different issue now. i'm able to batch ingest and create a new datalog db with the tags field using prefix tokenization, but having trouble reading it in using get-conn without a jvm oom crash in my repl, specifically:
Execution error (OutOfMemoryError) at org.roaringbitmap.RoaringArray/deserialize (RoaringArray.java:601).
Java heap space
the size of the db using du -sh is ~5.8G while ls -s gives me:
ls -s
total 950960
950816 data.mdb 136 lock.mdb 0 snapshots 0 txlog 8 VERSION
The jvm ooms kept happening until I increased JVM heap allocation from 8G -> 20G like:
:jvm-opts ["--add-opens" "java.management/sun.management=ALL-UNNAMED" "-Xmx20G" "-Xms2G"]
is that expected and I should stick with large JVM heaps? π€ I can file that as another issueInteresting, needing 20G heap for such a small DB is excessive.
You are using prefix-token-filter, right? https://cljdoc.org/d/datalevin/datalevin/0.10.5/api/datalevin.search-utils#prefix-token-filter says that "This is useful for producing efficient autocomplete engines, provided this filter is NOT applied at query time."
You may want to exclude single letter prefix, as they will have huge post list that can blow up heap like that.
Use something like
create-min-length-token-filter
so it will not blow up your heap.yeah just using prefix-token-filter, these ingest and search time analyzers:
(def autocomplete-analyzer (sut/create-analyzer {:tokenizer (inter-fn [s] (a/en-analyzer s))
:token-filters [sut/prefix-token-filter]}))
(def autocomplete-query-time-analyzer (sut/create-analyzer
{:tokenizer (sut/create-regexp-tokenizer #"[\s:/\.;,!=?\"'()\[\]{}|<>&@#^*\\~`\-]+")
:token-filters [sut/lower-case-token-filter
sut/unaccent-token-filter]}))
You may want to exclude single letter prefix, as they will have huge post list that can blow up heap like thatok let me try that. In a real app i wouldn't send autocomplete requests on 1 letter queries
Interesting, needing 20G heap for such a small DB is excessiveyeah i was surprised by this too. I'm not familiar with roaring bitmaps but it's possible that it's trying to deserialize really large objects on the heap? edit: oh yeah that'll make sense with the huge post list detail you added
i updated the ingest analyzer to take out the single letters:
(def autocomplete-analyzer (sut/create-analyzer {:tokenizer (inter-fn [s] (a/en-analyzer s))
:token-filters [sut/prefix-token-filter
(sut/create-min-length-token-filter 2)]}))
resulting general file size dropped:
du -sh
3.4G .
ls -s also shows a bit lower:
ls -sh
total 918192
918048 data.mdb 136 lock.mdb 0 txlog 8 VERSION
not sure if these are that useful
8G heap didn't work, got
Execution error (OutOfMemoryError) at me.lemire.integercompression.IntCompressor/uncompress (IntCompressor.java:56).
Java heap space
but 12G heap did worktook out 2 letters would probably do. Yes, the post list of a term is stored as a roaring bitmap and a compressed integer list, and these are fetched as a binary blob in one shot, and deserialized in memory. For a huge post list, it needs a lot of memory. We can probably switch to an off heap deserialize path to reduce heap pressure.
I will file this as an enhancement.
darn, identical err still happening when I re-run the code. Just to check that I pulled the commit the master branch is pointing to, I set in my deps.edn:
datalevin/datalevin {:git/url ""
:git/sha "c0803e7de905cf528cb23cfd4153d9162f8932fa"}
then ran
clj -X:deps prep
Checking out: at c0803e7de905cf528cb23cfd4153d9162f8932fa
Prepping datalevin/datalevin in /Users/nkad129/.gitlibs/libs/datalevin/datalevin/c0803e7de905cf528cb23cfd4153d9162f8932fa
./src/java/datalevin/cpp/UnsafeAccess.java:6: warning: sun.misc.Unsafe is internal proprietary API and may be removed in a future release
import sun.misc.Unsafe;
^
./src/java/datalevin/cpp/UnsafeAccess.java:12: warning: sun.misc.Unsafe is internal proprietary API and may be removed in a future release
static Unsafe UNSAFE = null;
^
./src/java/datalevin/cpp/UnsafeAccess.java:18: warning: sun.misc.Unsafe is internal proprietary API and may be removed in a future release
final Field u = Unsafe.class.getDeclaredField("theUnsafe");
^
./src/java/datalevin/cpp/UnsafeAccess.java:20: warning: sun.misc.Unsafe is internal proprietary API and may be removed in a future release
UNSAFE = (Unsafe) u.get(null);
^
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Then ran the same code mentioned above with the same strings collection in a fresh Calva REPL, and received the same erri'll re-open the issue or file another issue if that looks correct
ok, I will reopen and dig more
The master branch should have a fix. Please let me know if there are other problems. Thanks.
awesome, thank you for all the help π
Hi. I am trying to upgrade to 0.10.* from 0.9.27 and consistently seeing an MDB_PAGE_FULL error. I'm on Java 25, Clojure 1.12.4, datalevin 0.10.5. The same code, with the same data works fine with 0.9.27.
Root: clojure.lang.ExceptionInfo - Fail to transact to LMDB: #error {
:cause "MDB_PAGE_FULL: Internal error - page has no more space"
:via
[{:type datalevin.cpp.Util$DTLVException
:message "MDB_PAGE_FULL: Internal error - page has no more space"
:at [datalevin.cpp.Util checkRc "Util.java" 63]}]
:trace
[[datalevin.cpp.Util checkRc "Util.java" 63]
[datalevin.cpp.Dbi put "Dbi.java" 56]
[datalevin.binding.cpp.DBI put "cpp.clj" 314]
[datalevin.binding.cpp.DBI put "cpp.clj" 315]
[datalevin.binding.cpp$put_tx invokeStatic "cpp.clj" 633]
[datalevin.binding.cpp$put_tx invoke "cpp.clj" 626]
[datalevin.binding.cpp$transact1_STAR_ invokeStatic "cpp.clj" 651]
[datalevin.binding.cpp$transact1_STAR_ invoke "cpp.clj" 648]
[datalevin.binding.cpp.CppLMDB transact_kv "cpp.clj" 1053]
[datalevin.interface$eval8059$fn__8969$G__8025__8984 invoke "interface.clj" 90]
[datalevin.interface$eval8059$fn__8969$G__8024__9000 invoke "interface.clj" 90]
Root stack trace:
datalevin.binding.cpp.CppLMDB/transact_kv at cpp.clj:1070
datalevin.interface$eval8059$fn__8969$G__8025__8984/invoke at interface.clj:90
datalevin.interface$eval8059$fn__8969$G__8024__9000/invoke at interface.clj:90
The calling code looks like:
(d/transact-kv *kv*
index
[[:put (key->bytes k) (value->bytes v)]]
:bytes :bytes)
Any ideas?Thanks. I'll try to see if I can put together a minimal example.
Please file a github issue with a link to the DB.
Please test if the master branch fixes this.
If you can share a minimal DB that produce this error, we can take a look. It looks like an edge case in prefix compression.
If not, the next release may have a fix.
Hi @huahaiy Yes! the master branch fixes the issue. Thank-you very much!