datalevin

niko 2026-02-28T18:03:50.480439Z

πŸ‘‹ hi there, I'm trying to batch ingest a collection of tag strings into a standalone search engine, but am running into an what looks like an lmdb error:

Execution error (ExceptionInfo) at datalevin.binding.cpp.CppLMDB/transact_kv (cpp.clj:1070).
Fail to transact to LMDB: #error {
 :cause "MDB_BAD_VALSIZE: Unsupported size of key/DB name/data, or wrong DUPFIXED size"
 :via
 [{:type datalevin.cpp.Util$DTLVException
   :message "MDB_BAD_VALSIZE: Unsupported size of key/DB name/data, or wrong DUPFIXED size"
   :at [datalevin.cpp.Util checkRc "Util.java" 63]}]
can I please get some help debugging this, as I'm not familiar with lmdb? i'll add more details in 🧡

niko 2026-02-28T18:04:15.201809Z

this is the code I'm running in a local repl:

(let [temp-dir (str (System/getProperty "java.io.tmpdir") "/datalevin-test-" (System/currentTimeMillis))
        lmdb (d/open-kv temp-dir {:flags
                                  (conj c/default-env-flags :nosync)})
        engine (s/new-search-engine
                lmdb {:analyzer        autocomplete-analyzer
                      :query-analyzer autocomplete-query-time-analyzer
                      :include-text? true})
        idxed-tags (->> (mapcat :recipe/tags converted-recipes)
                        (mapv :tag/name)
                        (map-indexed vector))]
    (doseq [[tag-idx tag] idxed-tags]
      (try
        (d/add-doc engine tag-idx tag false)
        (catch Exception e
          (println (str "error trying to add-doc tag-idx: " tag-idx ", tag: " tag ", err msg: "  (.getMessage e)))
          (throw e))))
    (d/close-kv lmdb))

niko 2026-02-28T18:04:33.791079Z

this is the full error I'm getting when trying to ingest the tags:

error trying to add-doc tag-idx: 46516, tag: american, err msg: Fail to transact to LMDB: #error {
 :cause "MDB_BAD_VALSIZE: Unsupported size of key/DB name/data, or wrong DUPFIXED size"
 :via
 [{:type datalevin.cpp.Util$DTLVException
   :message "MDB_BAD_VALSIZE: Unsupported size of key/DB name/data, or wrong DUPFIXED size"
   :at [datalevin.cpp.Util checkRc "Util.java" 63]}]
 :trace
 [[datalevin.cpp.Util checkRc "Util.java" 63]
  [datalevin.cpp.Dbi put "Dbi.java" 56]
  [datalevin.binding.cpp.DBI put "cpp.clj" 314]
  [datalevin.binding.cpp.DBI put "cpp.clj" 315]
  [datalevin.binding.cpp$put_tx invokeStatic "cpp.clj" 633]
  [datalevin.binding.cpp$put_tx invoke "cpp.clj" 626]
  [datalevin.binding.cpp$transact_STAR_ invokeStatic "cpp.clj" 660]
  [datalevin.binding.cpp$transact_STAR_ invoke "cpp.clj" 653]
  [datalevin.binding.cpp.CppLMDB transact_kv "cpp.clj" 1054]
  [datalevin.binding.cpp.CppLMDB transact_kv "cpp.clj" 1037]
  [datalevin.binding.cpp.CppLMDB transact_kv "cpp.clj" 1035]
  [datalevin.search$add_doc_STAR_ invokeStatic "search.clj" 930]
  [datalevin.search$add_doc_STAR_ invoke "search.clj" 887]
  [datalevin.search.SearchEngine add_doc "search.clj" 633]
  [datalevin.interface$eval46479$fn__46528$G__46461__46536 invoke "interface.clj" 274]
  [datalevin.interface$eval46479$fn__46528$G__46460__46545 invoke "interface.clj" 274]
  [datalevin_db.main$eval73633$fn__73641 invoke "main.clj" 376]
  [datalevin_db.main$eval73633 invokeStatic "main.clj" 375]
  [datalevin_db.main$eval73633 invoke "main.clj" 364]
  [clojure.lang.Compiler eval "Compiler.java" 7757]
  [cider.nrepl.middleware.util.eval$eval_dispatcher$fn__42725 invoke "eval.clj" 15]
  [nrepl.middleware.interruptible_eval$evaluator$run__35316$fn__35330 invoke "interruptible_eval.clj" 130]
  [nrepl.middleware.interruptible_eval$evaluator$run__35316 invoke "interruptible_eval.clj" 128]
  [nrepl.middleware.session$session_exec$session_loop__35421 invoke "session.clj" 251]
  [nrepl.SessionThread run "SessionThread.java" 21]]}

niko 2026-02-28T18:05:47.594609Z

this is a sample of the entities that I'm extracting the tags from (some recipes scraped from the web):

(#:recipe{:author #:author{:name "G. Stephen Jones", :normalizedName "g. stephen jones"},
          :cookTime 15,
          :date-published "2006/05/10",
          :description "",
          :domain #:domain{:brand "reluctantgourmet"},
          :ingredients ["1  egg" "2 teaspoons Worcestershire sauce" "ΒΌ teaspoon dry mustard" "2 tablespoons mayonnaise"
                        "1 teaspoon lemon juice" "1 tablespoon Dijon mustard" "1 tablespoon olive oil"
                        "1 teaspoon dried parsley flakes" "1 teaspoon Old Bay Seasoning" "ΒΎ cup breadcrumbs"
                        "16 oz. lump crab meat"],
          :name "Crab Cakes Recipe",
          :normalizedName "crab cakes recipe",
          :num-ingredients 11,
          :prepTime 15,
          :ratingCount 5,
          :ratingValue 4.8,
          :reviewCount 4,
          :tags [#:tag{:name "american"} #:tag{:name "main course"}],
          :tags-group "american main course",
          :totalTime 30,
          :url ""}
 #:recipe{:cookTime 25,
          :date-published "2022/08/19",
          :description
            "Need something really simple for a party? Try these easy, cheesy, crispy Greek mini feta cheese pies (AKA tiropitakia!) made with just filo pastry, feta cheese and a few other simple ingredients. Make them ahead, stash them in the freezer, and you're ready to go!",
          :domain #:domain{:brand "scrummylane"},
          :ingredients
            ["10.5 ounces feta cheese"
             "Β½ cup Greek yogurt (Substitute with ricotta cheese, more cream cheese, cottage cheese or even sour cream. )"
             "Β½ cup cream cheese"
             "β…“ cup parmesan cheese (Grated, about a large handful. Substitute with any grated cheese (strong flavored is best). )"
             "1  egg (Lightly whisked)" "ΒΌ teaspoon nutmeg" "ΒΌ teaspoon black pepper"
             "2 tablespoons fresh mint (chopped, or dill, or 2 teaspoons dried oregano (optional))"
             "9 ounces filo pastry" "β…“ cup olive oil"
             "sesame seeds (Optional, for sprinkling over just before baking.)"],
          :name "Tiropitakia (Mini Greek Feta Cheese Pies)",
          :normalizedName "tiropitakia (mini greek feta cheese pies)",
          :num-ingredients 11,
          :prepTime 15,
          :tags [#:tag{:name "greek"} #:tag{:name "appetizer"} #:tag{:name "cheese pies"} #:tag{:name "filo pies"}
                 #:tag{:name "tiropita"} #:tag{:name "tiropitakia"}],
          :tags-group "greek appetizer cheese pies filo pies tiropita tiropitakia",
          :totalTime 40,
          :url ""})

niko 2026-02-28T18:06:45.206279Z

this is a small sample of the raw tag strings formatted in a way to use with add-doc :

([0 "american"]
 [1 "main course"]
 [2 "greek"]
 [3 "appetizer"]
 [4 "cheese pies"]
 [5 "filo pies"]
 [6 "tiropita"]
 [7 "tiropitakia"]
 [8 "canadian"]
 [9 "dessert"]
 [10 "snack"]
 [11 "no bake puffed wheat squares"]
 [12 "puffed wheat squares"]
 [13 "dinner"]
 [14 "american"]
 [15 "dessert"]
 [16 "cupcake"])

niko 2026-02-28T18:07:18.932229Z

these are the analyzers that i've attached to the search engine:

(def autocomplete-analyzer (sut/create-analyzer {:tokenizer (inter-fn [s] (a/en-analyzer s))
                                                 :token-filters [sut/prefix-token-filter]}))
(def autocomplete-query-time-analyzer (sut/create-analyzer
                                       {:tokenizer (sut/create-regexp-tokenizer #"[\s:/\.;,!=?\"'()\[\]{}|<>&@#^*\\~`\-]+")
                                        :token-filters [sut/lower-case-token-filter
                                                        sut/unaccent-token-filter]}))

niko 2026-02-28T18:14:09.895179Z

i wasn't able to reproduce the error on a smaller scale; it seems to only happen after ingesting 20k or more tag strings. Here's a https://gist.github.com/nkad129/61c344f72d5a5d79026cd34fe56c470a of enough tags to get to the idx where the err was thrown on my local. I'm confused on whether something is getting corrupted in the search store πŸ€”

niko 2026-02-28T18:19:35.518089Z

this is in context of a similar problem I've been narrowing down: I've been trying to ingest some recipes entities into a dtlv datalog store that has search enabled on some of its fields, where the lmdb errs start happening during a batch ingest. I narrowed it down to an issue with fields that have :db/fulltext true , especially ones where it tries to use the autocomplete/prefix analyzers. When I remove all the fulltext attributes the batch ingest goes well. So I've been trying to see if I can get a batch ingest just for a standalone search engine to work first then try again with the datalog store

niko 2026-02-28T18:21:58.370179Z

i'm currently using datalevin 0.10.5 if that helps

Huahai 2026-03-01T00:48:09.534599Z

Please file a GitHub issue.

βœ… 1
niko 2026-03-01T03:29:34.650679Z

filed: https://github.com/datalevin/datalevin/issues/355, lmk how/if I can help

❀️ 1
Huahai 2026-03-01T04:05:15.404869Z

This looks like a prefix-compression issue again.

Huahai 2026-03-01T04:43:41.848399Z

A fix will be in next release.

Huahai 2026-03-01T06:32:56.288569Z

Fix is in master branch. Please let me know it is not fixed.

πŸ‘€ 1
niko 2026-03-02T16:55:25.037159Z

yep the fix worked, thank you for the fast turnarounds! I was trying to read the fix you made in C but haven't touched that language since college πŸ˜‚

Huahai 2026-03-02T17:10:09.503319Z

LMDB is not the easiest code base to work in, but the performance is hard to beat. Thank you for the bug report!

🫑 1
niko 2026-03-02T17:59:07.093399Z

i'm running into a different issue now. i'm able to batch ingest and create a new datalog db with the tags field using prefix tokenization, but having trouble reading it in using get-conn without a jvm oom crash in my repl, specifically:

Execution error (OutOfMemoryError) at org.roaringbitmap.RoaringArray/deserialize (RoaringArray.java:601).
Java heap space
the size of the db using du -sh is ~5.8G while ls -s gives me:
ls -s
total 950960
950816 data.mdb		   136 lock.mdb		     0 snapshots	     0 txlog		     8 VERSION
The jvm ooms kept happening until I increased JVM heap allocation from 8G -> 20G like:
:jvm-opts ["--add-opens" "java.management/sun.management=ALL-UNNAMED" "-Xmx20G" "-Xms2G"]
is that expected and I should stick with large JVM heaps? πŸ€” I can file that as another issue

Huahai 2026-03-02T23:27:25.540129Z

Interesting, needing 20G heap for such a small DB is excessive.

Huahai 2026-03-02T23:57:46.925599Z

You are using prefix-token-filter, right? https://cljdoc.org/d/datalevin/datalevin/0.10.5/api/datalevin.search-utils#prefix-token-filter says that "This is useful for producing efficient autocomplete engines, provided this filter is NOT applied at query time."

Huahai 2026-03-03T00:00:16.178439Z

You may want to exclude single letter prefix, as they will have huge post list that can blow up heap like that.

Huahai 2026-03-03T00:01:19.802149Z

Use something like

create-min-length-token-filter
so it will not blow up your heap.

niko 2026-03-03T01:54:26.427599Z

yeah just using prefix-token-filter, these ingest and search time analyzers:

(def autocomplete-analyzer (sut/create-analyzer {:tokenizer (inter-fn [s] (a/en-analyzer s))
                                                 :token-filters [sut/prefix-token-filter]}))
(def autocomplete-query-time-analyzer (sut/create-analyzer
                                       {:tokenizer (sut/create-regexp-tokenizer #"[\s:/\.;,!=?\"'()\[\]{}|<>&@#^*\\~`\-]+")
                                        :token-filters [sut/lower-case-token-filter
                                                        sut/unaccent-token-filter]}))

niko 2026-03-03T01:56:05.356899Z

You may want to exclude single letter prefix, as they will have huge post list that can blow up heap like thatok let me try that. In a real app i wouldn't send autocomplete requests on 1 letter queries

niko 2026-03-03T01:57:05.883419Z

Interesting, needing 20G heap for such a small DB is excessiveyeah i was surprised by this too. I'm not familiar with roaring bitmaps but it's possible that it's trying to deserialize really large objects on the heap? edit: oh yeah that'll make sense with the huge post list detail you added

niko 2026-03-03T02:14:55.207359Z

i updated the ingest analyzer to take out the single letters:

(def autocomplete-analyzer (sut/create-analyzer {:tokenizer (inter-fn [s] (a/en-analyzer s))
                                                 :token-filters [sut/prefix-token-filter
                                                                 (sut/create-min-length-token-filter 2)]}))
resulting general file size dropped:
du -sh
3.4G	.
ls -s also shows a bit lower:
ls -sh
total 918192
918048 data.mdb	   136 lock.mdb	     0 txlog	     8 VERSION
not sure if these are that useful 8G heap didn't work, got
Execution error (OutOfMemoryError) at me.lemire.integercompression.IntCompressor/uncompress (IntCompressor.java:56).
Java heap space
but 12G heap did work

Huahai 2026-03-03T06:15:04.995979Z

took out 2 letters would probably do. Yes, the post list of a term is stored as a roaring bitmap and a compressed integer list, and these are fetched as a binary blob in one shot, and deserialized in memory. For a huge post list, it needs a lot of memory. We can probably switch to an off heap deserialize path to reduce heap pressure.

Huahai 2026-03-03T06:25:10.763349Z

I will file this as an enhancement.

πŸ™Œ 1
Huahai 2026-03-03T06:28:50.640149Z

https://github.com/datalevin/datalevin/issues/356

niko 2026-03-01T16:39:20.566549Z

darn, identical err still happening when I re-run the code. Just to check that I pulled the commit the master branch is pointing to, I set in my deps.edn:

datalevin/datalevin {:git/url ""
                             :git/sha "c0803e7de905cf528cb23cfd4153d9162f8932fa"}
then ran
clj -X:deps prep
Checking out:  at c0803e7de905cf528cb23cfd4153d9162f8932fa
Prepping datalevin/datalevin in /Users/nkad129/.gitlibs/libs/datalevin/datalevin/c0803e7de905cf528cb23cfd4153d9162f8932fa
./src/java/datalevin/cpp/UnsafeAccess.java:6: warning: sun.misc.Unsafe is internal proprietary API and may be removed in a future release
import sun.misc.Unsafe;
               ^
./src/java/datalevin/cpp/UnsafeAccess.java:12: warning: sun.misc.Unsafe is internal proprietary API and may be removed in a future release
    static Unsafe UNSAFE = null;
           ^
./src/java/datalevin/cpp/UnsafeAccess.java:18: warning: sun.misc.Unsafe is internal proprietary API and may be removed in a future release
            final Field u = Unsafe.class.getDeclaredField("theUnsafe");
                            ^
./src/java/datalevin/cpp/UnsafeAccess.java:20: warning: sun.misc.Unsafe is internal proprietary API and may be removed in a future release
            UNSAFE = (Unsafe) u.get(null);
                      ^
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Then ran the same code mentioned above with the same strings collection in a fresh Calva REPL, and received the same err

niko 2026-03-01T16:40:45.030029Z

i'll re-open the issue or file another issue if that looks correct

Huahai 2026-03-01T18:04:25.142419Z

ok, I will reopen and dig more

πŸ™ 1
Huahai 2026-03-01T23:03:50.117989Z

The master branch should have a fix. Please let me know if there are other problems. Thanks.

niko 2026-03-03T16:36:45.102399Z

awesome, thank you for all the help πŸ™

amar 2026-02-28T03:02:50.548439Z

Hi. I am trying to upgrade to 0.10.* from 0.9.27 and consistently seeing an MDB_PAGE_FULL error. I'm on Java 25, Clojure 1.12.4, datalevin 0.10.5. The same code, with the same data works fine with 0.9.27.

Root: clojure.lang.ExceptionInfo - Fail to transact to LMDB: #error {
 :cause "MDB_PAGE_FULL: Internal error - page has no more space"
 :via
 [{:type datalevin.cpp.Util$DTLVException
   :message "MDB_PAGE_FULL: Internal error - page has no more space"
   :at [datalevin.cpp.Util checkRc "Util.java" 63]}]
 :trace
 [[datalevin.cpp.Util checkRc "Util.java" 63]
  [datalevin.cpp.Dbi put "Dbi.java" 56]
  [datalevin.binding.cpp.DBI put "cpp.clj" 314]
  [datalevin.binding.cpp.DBI put "cpp.clj" 315]
  [datalevin.binding.cpp$put_tx invokeStatic "cpp.clj" 633]
  [datalevin.binding.cpp$put_tx invoke "cpp.clj" 626]
  [datalevin.binding.cpp$transact1_STAR_ invokeStatic "cpp.clj" 651]
  [datalevin.binding.cpp$transact1_STAR_ invoke "cpp.clj" 648]
  [datalevin.binding.cpp.CppLMDB transact_kv "cpp.clj" 1053]
  [datalevin.interface$eval8059$fn__8969$G__8025__8984 invoke "interface.clj" 90]
  [datalevin.interface$eval8059$fn__8969$G__8024__9000 invoke "interface.clj" 90]

Root stack trace:
  datalevin.binding.cpp.CppLMDB/transact_kv at cpp.clj:1070
  datalevin.interface$eval8059$fn__8969$G__8025__8984/invoke at interface.clj:90
  datalevin.interface$eval8059$fn__8969$G__8024__9000/invoke at interface.clj:90
The calling code looks like:
(d/transact-kv *kv*
                 index
                 [[:put (key->bytes k) (value->bytes v)]]
                 :bytes :bytes)
Any ideas?

amar 2026-02-28T13:39:40.420569Z

Thanks. I'll try to see if I can put together a minimal example.

Huahai 2026-03-01T00:58:36.297909Z

Please file a github issue with a link to the DB.

Huahai 2026-03-01T06:33:23.877829Z

Please test if the master branch fixes this.

Huahai 2026-02-28T05:20:26.461499Z

If you can share a minimal DB that produce this error, we can take a look. It looks like an edge case in prefix compression.

Huahai 2026-02-28T06:52:58.536719Z

If not, the next release may have a fix.

amar 2026-03-01T15:32:42.448619Z

Hi @huahaiy Yes! the master branch fixes the issue. Thank-you very much!