I’m using the new vector feature from master and wondering how to use a different model as an embedding provider in the datalog store. I want to use a model from Hugging Face. How can I do this? I know this isn’t the final release, so I don’t have any expectations :D Just want to validate some ideas.
I have since switched to a model run externally and served via http, so I could make better use of my particular intel NPU. Happy to give details on that too, but it boils down to implementing IEmbeddingProvider (see datalevin.embedding in the source). I can give you my implementation if you want, but a good LLM will probably figure out something more adapted to your needs if you want to go this route.
You don't even need to do that. Just creating an embedding service with base url, model, and API key is enough
This is to support other languages that can't implement an clojure protocol
I've done exactly this and I asked Claude to describe how it works (sorry for the formatting, but hopefully it's usable): Hey! I'm doing exactly this — using a Hugging Face GGUF model (Qwen3-Embedding-8B) as the embedding provider in a Datalog store on master. Here's what works: 1. Get a GGUF-format model You need the GGUF version of whatever HF model you want. Many popular embedding models have community-provided GGUF quantizations on HF (search for <model-name> gguf). 2. Configure the embedding opts when creating your connection (d/create-conn "/tmp/my-db" my-schema {:embedding-opts {:provider :default ;; uses the built-in llama.cpp provider :model "/path/to/your-model.gguf" :dimensions 4096 ;; must match the model's n_embd :ctx-size 512 ;; token context window :batch-size 512 :threads 6 ;; number of CPU threads :metric-type :cosine}}) The key is :model — point it at your GGUF file and Datalevin will use it instead of the default multilingual-e5-small. You also need :dimensions matching your model's output dimensionality. 3. Schema: mark attributes for embedding {:my/content {:db/valueType :db.type/string :db/embedding true :db.embedding/autoDomain true}} Then embedding-neighbors in Datalog queries just works with your model. 4. Important: :ctx-size for large models If your model is large (e.g. 8B params), you must set :ctx-size explicitly. The default (0, which defers to native defaults) can cause an enormous memory allocation — I hit a 215 GB allocation attempt with Qwen3-Embedding-8B before setting :ctx-size 512 :batch-size 512. 5. Matryoshka limitation One thing to be aware of: the native LlamaEmbedder.embed() always returns the full n_embd dimensions from the model. There's no truncation support, so if your model supports Matryoshka (variable-dimension) embeddings, you can't use a reduced dimensionality through Datalevin — you'll get the full output. For something like Qwen3-Embedding-8B that means all 4096 dims. At small-to-medium scale the storage/latency cost is negligible, but worth knowing if you were planning to use a lower-dimensional slice. 6. Optional: .edn manifest file You can place a your-model.gguf.edn file next to the model with metadata that Datalevin will pick up automatically (pooling strategy, prefixes, etc.): {:embedding/provider {:kind :local, :model-id "your-org/model-name"} :embedding/output {:dimensions 1024, :pooling :mean, :normalize? true :query-prefix "query: ", :document-prefix "passage: "}} This is optional — things work without it, but it makes metadata-compatibility checks more robust if you later swap models.
Note that in my particular model embeddings take up a lot of space. I don't care, but typically you want a smaller model, esp. until Matryoshka is supported
❤️
i notice open-tx-log seems to be only for kv databases- if I wanted to record changes to my datalog DB for future reference/debug, would my best alternative be a listen! callback writing to a log? Currently trying to understand how best to get 'diffs' from the tx-results that would get propagated to the callback (there's datoms in :tx-data, but no clear indication that I may have invoked a :db.fn/retractEntity that I can see)
Right, WAL is implemented at KV level. Tx-data has everything you need, datoms deleted are in there, datom has an added field