Announcing Proximum - Persistent Vector Database with Git-like Versioning
We're excited to release Proximum, an embeddable vector database for Clojure and Java that brings persistent data structure semantics to vector search.
Key Features:
• Git-like versioning - branches, commits, time-travel queries
• Zero-cost branching - fork indices for experiments without copying data
• Clojure collection protocols - use assoc, dissoc, get on your index
• SIMD-accelerated - ~50% of native C++ hnswlib performance, pure JVM
• Spring AI & LangChain4j integrations included
(require '[proximum.core :as prox])
(def idx (prox/create-index {:type :hnsw :dim 384 :capacity 10000
:store-config {:backend :memory :id (random-uuid)}}))
;; Works like a Clojure map
(def idx2 (assoc idx "doc-1" (float-array (repeatedly 384 rand))))
;; Git-like operations
(prox/sync! idx2)
(def experiment (prox/branch! idx2 :experiment))
Perfect for RAG applications where you need reproducible results, A/B testing embeddings, or audit trails.
Install:
org.replikativ/proximum {:mvn/version "0.1.2"}
Links:
• GitHub: https://github.com/replikativ/proximum
• Product page: https://datahike.io/proximum/
📋 Help us prioritize! Please fill out our 2-min feedback survey: https://docs.google.com/forms/d/e/1FAIpQLSeUQuw5SPyIx661e1pwZiX0100bP-DPpF2Zfpptg1h6k14OTA/viewform
Requires Java 22+. This is an early beta - feedback welcome!Integration into #datahike as a secondary index is planned. Here are examples of how it can already be manually integrated into persistent databases like Datomic. Datahike, DataScript or XTDB: https://github.com/replikativ/einbetten/blob/main/docs/datalog-semantic-search-patterns.md
That's dope. I have a project in the planning stages that's using Datahike as the DB, and I thought I was going to have to lean on my python service to manage a separate Vector DB 🙂
Nice 🙂. What application of a vector db do you need? If you have a moment you could also fill out the form or lmk what I should add to it to get a clearer picture of the needs.
I'm not a paying customer, just doing a random personal project. But, if you are curious, it's basically a big RAG DB of podcasts. I listen to alot, and I'm building a tool to download them all to a self hosted s3 bucket, transcribe them with whisper, then index them for future search-ability.
Right, the replikativ libraries always were and will be open source projects and I don't expect people to pay to use it. I need to gauge commercial interests though to grow it. Yes, I am curious. That sounds very cool and makes a lot of sense. I used whisper, too, in my academic work. What embedder do you use? I found fastembed which works on CPU only and promises to be somewhat competitive recallwise, but I am a bit skeptical. I guess the Qwen embedding models are maybe a good compromise right now, but then you need a GPU or an external provider (including the latency that induces during searches).
For sure, I just don't want to steer you because my use case doesn't end in payment most likely 🙂. For embedding, I think was using multi-qa-MiniLM-L6-cos-v1 via python SentenceTransformer, though don't take that as a recommendation - I have never loaded enough data into it to see what the recall quality is for my use case.