2026-02-28 datalevin | Clojure Slack Archive

datalevin 2026-02-28

niko 2026-02-28T18:03:50.480439Z

👋 hi there, I'm trying to batch ingest a collection of tag strings into a standalone search engine, but am running into an what looks like an lmdb error:

Execution error (ExceptionInfo) at datalevin.binding.cpp.CppLMDB/transact_kv (cpp.clj:1070).
Fail to transact to LMDB: #error {
 :cause "MDB_BAD_VALSIZE: Unsupported size of key/DB name/data, or wrong DUPFIXED size"
 :via
 [{:type datalevin.cpp.Util$DTLVException
   :message "MDB_BAD_VALSIZE: Unsupported size of key/DB name/data, or wrong DUPFIXED size"
   :at [datalevin.cpp.Util checkRc "Util.java" 63]}]

can I please get some help debugging this, as I'm not familiar with lmdb? i'll add more details in 🧵

niko 2026-02-28T18:04:15.201809Z

this is the code I'm running in a local repl:

(let [temp-dir (str (System/getProperty "java.io.tmpdir") "/datalevin-test-" (System/currentTimeMillis))
        lmdb (d/open-kv temp-dir {:flags
                                  (conj c/default-env-flags :nosync)})
        engine (s/new-search-engine
                lmdb {:analyzer        autocomplete-analyzer
                      :query-analyzer autocomplete-query-time-analyzer
                      :include-text? true})
        idxed-tags (->> (mapcat :recipe/tags converted-recipes)
                        (mapv :tag/name)
                        (map-indexed vector))]
    (doseq [[tag-idx tag] idxed-tags]
      (try
        (d/add-doc engine tag-idx tag false)
        (catch Exception e
          (println (str "error trying to add-doc tag-idx: " tag-idx ", tag: " tag ", err msg: "  (.getMessage e)))
          (throw e))))
    (d/close-kv lmdb))

niko 2026-02-28T18:04:33.791079Z

this is the full error I'm getting when trying to ingest the tags:

error trying to add-doc tag-idx: 46516, tag: american, err msg: Fail to transact to LMDB: #error {
 :cause "MDB_BAD_VALSIZE: Unsupported size of key/DB name/data, or wrong DUPFIXED size"
 :via
 [{:type datalevin.cpp.Util$DTLVException
   :message "MDB_BAD_VALSIZE: Unsupported size of key/DB name/data, or wrong DUPFIXED size"
   :at [datalevin.cpp.Util checkRc "Util.java" 63]}]
 :trace
 [[datalevin.cpp.Util checkRc "Util.java" 63]
  [datalevin.cpp.Dbi put "Dbi.java" 56]
  [datalevin.binding.cpp.DBI put "cpp.clj" 314]
  [datalevin.binding.cpp.DBI put "cpp.clj" 315]
  [datalevin.binding.cpp$put_tx invokeStatic "cpp.clj" 633]
  [datalevin.binding.cpp$put_tx invoke "cpp.clj" 626]
  [datalevin.binding.cpp$transact_STAR_ invokeStatic "cpp.clj" 660]
  [datalevin.binding.cpp$transact_STAR_ invoke "cpp.clj" 653]
  [datalevin.binding.cpp.CppLMDB transact_kv "cpp.clj" 1054]
  [datalevin.binding.cpp.CppLMDB transact_kv "cpp.clj" 1037]
  [datalevin.binding.cpp.CppLMDB transact_kv "cpp.clj" 1035]
  [datalevin.search$add_doc_STAR_ invokeStatic "search.clj" 930]
  [datalevin.search$add_doc_STAR_ invoke "search.clj" 887]
  [datalevin.search.SearchEngine add_doc "search.clj" 633]
  [datalevin.interface$eval46479$fn__46528$G__46461__46536 invoke "interface.clj" 274]
  [datalevin.interface$eval46479$fn__46528$G__46460__46545 invoke "interface.clj" 274]
  [datalevin_db.main$eval73633$fn__73641 invoke "main.clj" 376]
  [datalevin_db.main$eval73633 invokeStatic "main.clj" 375]
  [datalevin_db.main$eval73633 invoke "main.clj" 364]
  [clojure.lang.Compiler eval "Compiler.java" 7757]
  [cider.nrepl.middleware.util.eval$eval_dispatcher$fn__42725 invoke "eval.clj" 15]
  [nrepl.middleware.interruptible_eval$evaluator$run__35316$fn__35330 invoke "interruptible_eval.clj" 130]
  [nrepl.middleware.interruptible_eval$evaluator$run__35316 invoke "interruptible_eval.clj" 128]
  [nrepl.middleware.session$session_exec$session_loop__35421 invoke "session.clj" 251]
  [nrepl.SessionThread run "SessionThread.java" 21]]}

niko 2026-02-28T18:05:47.594609Z

this is a sample of the entities that I'm extracting the tags from (some recipes scraped from the web):

(#:recipe{:author #:author{:name "G. Stephen Jones", :normalizedName "g. stephen jones"},
          :cookTime 15,
          :date-published "2006/05/10",
          :description "",
          :domain #:domain{:brand "reluctantgourmet"},
          :ingredients ["1  egg" "2 teaspoons Worcestershire sauce" "¼ teaspoon dry mustard" "2 tablespoons mayonnaise"
                        "1 teaspoon lemon juice" "1 tablespoon Dijon mustard" "1 tablespoon olive oil"
                        "1 teaspoon dried parsley flakes" "1 teaspoon Old Bay Seasoning" "¾ cup breadcrumbs"
                        "16 oz. lump crab meat"],
          :name "Crab Cakes Recipe",
          :normalizedName "crab cakes recipe",
          :num-ingredients 11,
          :prepTime 15,
          :ratingCount 5,
          :ratingValue 4.8,
          :reviewCount 4,
          :tags [#:tag{:name "american"} #:tag{:name "main course"}],
          :tags-group "american main course",
          :totalTime 30,
          :url ""}
 #:recipe{:cookTime 25,
          :date-published "2022/08/19",
          :description
            "Need something really simple for a party? Try these easy, cheesy, crispy Greek mini feta cheese pies (AKA tiropitakia!) made with just filo pastry, feta cheese and a few other simple ingredients. Make them ahead, stash them in the freezer, and you're ready to go!",
          :domain #:domain{:brand "scrummylane"},
          :ingredients
            ["10.5 ounces feta cheese"
             "½ cup Greek yogurt (Substitute with ricotta cheese, more cream cheese, cottage cheese or even sour cream. )"
             "½ cup cream cheese"
             "⅓ cup parmesan cheese (Grated, about a large handful. Substitute with any grated cheese (strong flavored is best). )"
             "1  egg (Lightly whisked)" "¼ teaspoon nutmeg" "¼ teaspoon black pepper"
             "2 tablespoons fresh mint (chopped, or dill, or 2 teaspoons dried oregano (optional))"
             "9 ounces filo pastry" "⅓ cup olive oil"
             "sesame seeds (Optional, for sprinkling over just before baking.)"],
          :name "Tiropitakia (Mini Greek Feta Cheese Pies)",
          :normalizedName "tiropitakia (mini greek feta cheese pies)",
          :num-ingredients 11,
          :prepTime 15,
          :tags [#:tag{:name "greek"} #:tag{:name "appetizer"} #:tag{:name "cheese pies"} #:tag{:name "filo pies"}
                 #:tag{:name "tiropita"} #:tag{:name "tiropitakia"}],
          :tags-group "greek appetizer cheese pies filo pies tiropita tiropitakia",
          :totalTime 40,
          :url ""})

niko 2026-02-28T18:06:45.206279Z

this is a small sample of the raw tag strings formatted in a way to use with add-doc :

([0 "american"]
 [1 "main course"]
 [2 "greek"]
 [3 "appetizer"]
 [4 "cheese pies"]
 [5 "filo pies"]
 [6 "tiropita"]
 [7 "tiropitakia"]
 [8 "canadian"]
 [9 "dessert"]
 [10 "snack"]
 [11 "no bake puffed wheat squares"]
 [12 "puffed wheat squares"]
 [13 "dinner"]
 [14 "american"]
 [15 "dessert"]
 [16 "cupcake"])

niko 2026-02-28T18:07:18.932229Z

these are the analyzers that i've attached to the search engine:

(def autocomplete-analyzer (sut/create-analyzer {:tokenizer (inter-fn [s] (a/en-analyzer s))
                                                 :token-filters [sut/prefix-token-filter]}))
(def autocomplete-query-time-analyzer (sut/create-analyzer
                                       {:tokenizer (sut/create-regexp-tokenizer #"[\s:/\.;,!=?\"'()\[\]{}|<>&@#^*\\~`\-]+")
                                        :token-filters [sut/lower-case-token-filter
                                                        sut/unaccent-token-filter]}))

niko 2026-02-28T18:14:09.895179Z

i wasn't able to reproduce the error on a smaller scale; it seems to only happen after ingesting 20k or more tag strings. Here's a https://gist.github.com/nkad129/61c344f72d5a5d79026cd34fe56c470a of enough tags to get to the idx where the err was thrown on my local. I'm confused on whether something is getting corrupted in the search store 🤔

niko 2026-02-28T18:19:35.518089Z

this is in context of a similar problem I've been narrowing down: I've been trying to ingest some recipes entities into a dtlv datalog store that has search enabled on some of its fields, where the lmdb errs start happening during a batch ingest. I narrowed it down to an issue with fields that have :db/fulltext true , especially ones where it tries to use the autocomplete/prefix analyzers. When I remove all the fulltext attributes the batch ingest goes well. So I've been trying to see if I can get a batch ingest just for a standalone search engine to work first then try again with the datalog store

niko 2026-02-28T18:21:58.370179Z

i'm currently using datalevin 0.10.5 if that helps

Huahai 2026-03-01T00:48:09.534599Z

Please file a GitHub issue.

✅ 1

niko 2026-03-01T03:29:34.650679Z

filed: https://github.com/datalevin/datalevin/issues/355, lmk how/if I can help

❤️ 1

Huahai 2026-03-01T04:05:15.404869Z

This looks like a prefix-compression issue again.

Huahai 2026-03-01T04:43:41.848399Z

A fix will be in next release.

Huahai 2026-03-01T06:32:56.288569Z

Fix is in master branch. Please let me know it is not fixed.

👀 1

niko 2026-03-02T16:55:25.037159Z

yep the fix worked, thank you for the fast turnarounds! I was trying to read the fix you made in C but haven't touched that language since college 😂

Huahai 2026-03-02T17:10:09.503319Z

LMDB is not the easiest code base to work in, but the performance is hard to beat. Thank you for the bug report!

🫡 1

niko 2026-03-02T17:59:07.093399Z

i'm running into a different issue now. i'm able to batch ingest and create a new datalog db with the tags field using prefix tokenization, but having trouble reading it in using get-conn without a jvm oom crash in my repl, specifically:

Execution error (OutOfMemoryError) at org.roaringbitmap.RoaringArray/deserialize (RoaringArray.java:601).
Java heap space

the size of the db using du -sh is ~5.8G while ls -s gives me:

ls -s
total 950960
950816 data.mdb		   136 lock.mdb		     0 snapshots	     0 txlog		     8 VERSION

The jvm ooms kept happening until I increased JVM heap allocation from 8G -> 20G like:

:jvm-opts ["--add-opens" "java.management/sun.management=ALL-UNNAMED" "-Xmx20G" "-Xms2G"]

is that expected and I should stick with large JVM heaps? 🤔 I can file that as another issue

Huahai 2026-03-02T23:27:25.540129Z

Interesting, needing 20G heap for such a small DB is excessive.

Huahai 2026-03-02T23:57:46.925599Z

You are using prefix-token-filter, right? https://cljdoc.org/d/datalevin/datalevin/0.10.5/api/datalevin.search-utils#prefix-token-filter says that "This is useful for producing efficient autocomplete engines, provided this filter is NOT applied at query time."

Huahai 2026-03-03T00:00:16.178439Z

You may want to exclude single letter prefix, as they will have huge post list that can blow up heap like that.

Huahai 2026-03-03T00:01:19.802149Z

Use something like

create-min-length-token-filter

so it will not blow up your heap.

niko 2026-03-03T01:54:26.427599Z

yeah just using prefix-token-filter, these ingest and search time analyzers:

(def autocomplete-analyzer (sut/create-analyzer {:tokenizer (inter-fn [s] (a/en-analyzer s))
                                                 :token-filters [sut/prefix-token-filter]}))
(def autocomplete-query-time-analyzer (sut/create-analyzer
                                       {:tokenizer (sut/create-regexp-tokenizer #"[\s:/\.;,!=?\"'()\[\]{}|<>&@#^*\\~`\-]+")
                                        :token-filters [sut/lower-case-token-filter
                                                        sut/unaccent-token-filter]}))

niko 2026-03-03T01:56:05.356899Z

You may want to exclude single letter prefix, as they will have huge post list that can blow up heap like that

ok let me try that. In a real app i wouldn't send autocomplete requests on 1 letter queries

niko 2026-03-03T01:57:05.883419Z

Interesting, needing 20G heap for such a small DB is excessive

yeah i was surprised by this too. I'm not familiar with roaring bitmaps but it's possible that it's trying to deserialize really large objects on the heap? edit: oh yeah that'll make sense with the huge post list detail you added

niko 2026-03-03T02:14:55.207359Z

i updated the ingest analyzer to take out the single letters:

(def autocomplete-analyzer (sut/create-analyzer {:tokenizer (inter-fn [s] (a/en-analyzer s))
                                                 :token-filters [sut/prefix-token-filter
                                                                 (sut/create-min-length-token-filter 2)]}))

resulting general file size dropped:

du -sh
3.4G	.

ls -s also shows a bit lower:

ls -sh
total 918192
918048 data.mdb	   136 lock.mdb	     0 txlog	     8 VERSION

not sure if these are that useful 8G heap didn't work, got

Execution error (OutOfMemoryError) at me.lemire.integercompression.IntCompressor/uncompress (IntCompressor.java:56).
Java heap space

but 12G heap did work

Huahai 2026-03-03T06:15:04.995979Z

took out 2 letters would probably do. Yes, the post list of a term is stored as a roaring bitmap and a compressed integer list, and these are fetched as a binary blob in one shot, and deserialized in memory. For a huge post list, it needs a lot of memory. We can probably switch to an off heap deserialize path to reduce heap pressure.

Huahai 2026-03-03T06:25:10.763349Z

I will file this as an enhancement.

🙌 1

Huahai 2026-03-03T06:28:50.640149Z

https://github.com/datalevin/datalevin/issues/356

niko 2026-03-01T16:39:20.566549Z

darn, identical err still happening when I re-run the code. Just to check that I pulled the commit the master branch is pointing to, I set in my deps.edn:

datalevin/datalevin {:git/url ""
                             :git/sha "c0803e7de905cf528cb23cfd4153d9162f8932fa"}

then ran

clj -X:deps prep
Checking out:  at c0803e7de905cf528cb23cfd4153d9162f8932fa
Prepping datalevin/datalevin in /Users/nkad129/.gitlibs/libs/datalevin/datalevin/c0803e7de905cf528cb23cfd4153d9162f8932fa
./src/java/datalevin/cpp/UnsafeAccess.java:6: warning: sun.misc.Unsafe is internal proprietary API and may be removed in a future release
import sun.misc.Unsafe;
               ^
./src/java/datalevin/cpp/UnsafeAccess.java:12: warning: sun.misc.Unsafe is internal proprietary API and may be removed in a future release
    static Unsafe UNSAFE = null;
           ^
./src/java/datalevin/cpp/UnsafeAccess.java:18: warning: sun.misc.Unsafe is internal proprietary API and may be removed in a future release
            final Field u = Unsafe.class.getDeclaredField("theUnsafe");
                            ^
./src/java/datalevin/cpp/UnsafeAccess.java:20: warning: sun.misc.Unsafe is internal proprietary API and may be removed in a future release
            UNSAFE = (Unsafe) u.get(null);
                      ^
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

Then ran the same code mentioned above with the same strings collection in a fresh Calva REPL, and received the same err

niko 2026-03-01T16:40:45.030029Z

i'll re-open the issue or file another issue if that looks correct

Huahai 2026-03-01T18:04:25.142419Z

ok, I will reopen and dig more

🙏 1

Huahai 2026-03-01T23:03:50.117989Z

The master branch should have a fix. Please let me know if there are other problems. Thanks.

niko 2026-03-03T16:36:45.102399Z

awesome, thank you for all the help 🙏

amar 2026-02-28T03:02:50.548439Z

Hi. I am trying to upgrade to 0.10.* from 0.9.27 and consistently seeing an MDB_PAGE_FULL error. I'm on Java 25, Clojure 1.12.4, datalevin 0.10.5. The same code, with the same data works fine with 0.9.27.

Root: clojure.lang.ExceptionInfo - Fail to transact to LMDB: #error {
 :cause "MDB_PAGE_FULL: Internal error - page has no more space"
 :via
 [{:type datalevin.cpp.Util$DTLVException
   :message "MDB_PAGE_FULL: Internal error - page has no more space"
   :at [datalevin.cpp.Util checkRc "Util.java" 63]}]
 :trace
 [[datalevin.cpp.Util checkRc "Util.java" 63]
  [datalevin.cpp.Dbi put "Dbi.java" 56]
  [datalevin.binding.cpp.DBI put "cpp.clj" 314]
  [datalevin.binding.cpp.DBI put "cpp.clj" 315]
  [datalevin.binding.cpp$put_tx invokeStatic "cpp.clj" 633]
  [datalevin.binding.cpp$put_tx invoke "cpp.clj" 626]
  [datalevin.binding.cpp$transact1_STAR_ invokeStatic "cpp.clj" 651]
  [datalevin.binding.cpp$transact1_STAR_ invoke "cpp.clj" 648]
  [datalevin.binding.cpp.CppLMDB transact_kv "cpp.clj" 1053]
  [datalevin.interface$eval8059$fn__8969$G__8025__8984 invoke "interface.clj" 90]
  [datalevin.interface$eval8059$fn__8969$G__8024__9000 invoke "interface.clj" 90]

Root stack trace:
  datalevin.binding.cpp.CppLMDB/transact_kv at cpp.clj:1070
  datalevin.interface$eval8059$fn__8969$G__8025__8984/invoke at interface.clj:90
  datalevin.interface$eval8059$fn__8969$G__8024__9000/invoke at interface.clj:90

The calling code looks like:

(d/transact-kv *kv*
                 index
                 [[:put (key->bytes k) (value->bytes v)]]
                 :bytes :bytes)

Any ideas?

amar 2026-02-28T13:39:40.420569Z

Thanks. I'll try to see if I can put together a minimal example.

Huahai 2026-03-01T00:58:36.297909Z

Please file a github issue with a link to the DB.

Huahai 2026-03-01T06:33:23.877829Z

Please test if the master branch fixes this.

Huahai 2026-02-28T05:20:26.461499Z

If you can share a minimal DB that produce this error, we can take a look. It looks like an edge case in prefix compression.

Huahai 2026-02-28T06:52:58.536719Z

If not, the next release may have a fix.

amar 2026-03-01T15:32:42.448619Z

Hi @huahaiy Yes! the master branch fixes the issue. Thank-you very much!

Clojurians Log v2

datalevin 2026-02-28