2026-04-14 datalevin | Clojure Slack Archive

datalevin

beetleman 2026-04-14T08:21:15.799129Z

I’m using the new vector feature from master and wondering how to use a different model as an embedding provider in the datalog store. I want to use a model from Hugging Face. How can I do this? I know this isn’t the final release, so I don’t have any expectations :D Just want to validate some ideas.

euccastro 2026-04-15T09:56:16.874379Z

I have since switched to a model run externally and served via http, so I could make better use of my particular intel NPU. Happy to give details on that too, but it boils down to implementing IEmbeddingProvider (see datalevin.embedding in the source). I can give you my implementation if you want, but a good LLM will probably figure out something more adapted to your needs if you want to go this route.

Huahai 2026-04-16T02:23:41.093689Z

You don't even need to do that. Just creating an embedding service with base url, model, and API key is enough

👍 1

Huahai 2026-04-16T02:26:01.618599Z

This is to support other languages that can't implement an clojure protocol

euccastro 2026-04-14T08:38:26.384329Z

I've done exactly this and I asked Claude to describe how it works (sorry for the formatting, but hopefully it's usable): Hey! I'm doing exactly this — using a Hugging Face GGUF model (Qwen3-Embedding-8B) as the embedding provider in a Datalog store on master. Here's what works: 1. Get a GGUF-format model You need the GGUF version of whatever HF model you want. Many popular embedding models have community-provided GGUF quantizations on HF (search for <model-name> gguf). 2. Configure the embedding opts when creating your connection (d/create-conn "/tmp/my-db" my-schema {:embedding-opts {:provider :default ;; uses the built-in llama.cpp provider :model "/path/to/your-model.gguf" :dimensions 4096 ;; must match the model's n_embd :ctx-size 512 ;; token context window :batch-size 512 :threads 6 ;; number of CPU threads :metric-type :cosine}}) The key is :model — point it at your GGUF file and Datalevin will use it instead of the default multilingual-e5-small. You also need :dimensions matching your model's output dimensionality. 3. Schema: mark attributes for embedding {:my/content {:db/valueType :db.type/string :db/embedding true :db.embedding/autoDomain true}} Then embedding-neighbors in Datalog queries just works with your model. 4. Important: :ctx-size for large models If your model is large (e.g. 8B params), you must set :ctx-size explicitly. The default (0, which defers to native defaults) can cause an enormous memory allocation — I hit a 215 GB allocation attempt with Qwen3-Embedding-8B before setting :ctx-size 512 :batch-size 512. 5. Matryoshka limitation One thing to be aware of: the native LlamaEmbedder.embed() always returns the full n_embd dimensions from the model. There's no truncation support, so if your model supports Matryoshka (variable-dimension) embeddings, you can't use a reduced dimensionality through Datalevin — you'll get the full output. For something like Qwen3-Embedding-8B that means all 4096 dims. At small-to-medium scale the storage/latency cost is negligible, but worth knowing if you were planning to use a lower-dimensional slice. 6. Optional: .edn manifest file You can place a your-model.gguf.edn file next to the model with metadata that Datalevin will pick up automatically (pooling strategy, prefixes, etc.): {:embedding/provider {:kind :local, :model-id "your-org/model-name"} :embedding/output {:dimensions 1024, :pooling :mean, :normalize? true :query-prefix "query: ", :document-prefix "passage: "}} This is optional — things work without it, but it makes metadata-compatibility checks more robust if you later swap models.

euccastro 2026-04-14T08:39:23.127079Z

Note that in my particular model embeddings take up a lot of space. I don't care, but typically you want a smaller model, esp. until Matryoshka is supported

Huahai 2026-04-14T20:04:01.662699Z

❤️

Samuel Ludwig 2026-04-14T16:28:11.417339Z

i notice open-tx-log seems to be only for kv databases- if I wanted to record changes to my datalog DB for future reference/debug, would my best alternative be a listen! callback writing to a log? Currently trying to understand how best to get 'diffs' from the tx-results that would get propagated to the callback (there's datoms in :tx-data, but no clear indication that I may have invoked a :db.fn/retractEntity that I can see)

Huahai 2026-04-14T20:08:49.307639Z

Right, WAL is implemented at KV level. Tx-data has everything you need, datoms deleted are in there, datom has an added field

⭐ 1

Clojurians Log v2

datalevin