datahike

whilo 2026-01-17T06:05:35.833519Z

set the channel topic: https://datahike.io/, join the conversation at GitHub: https://github.com/replikativ/datahike/discussions and take a look at our docs on cljdoc: https://cljdoc.org/d/org.replikativ/datahike

whilo 2026-01-17T06:05:42.306729Z

set the channel description: https://datahike.io/, join the conversation at GitHub: https://github.com/replikativ/datahike/discussions and take a look at our docs on cljdoc: https://cljdoc.org/d/org.replikativ/datahike

whilo 2026-01-17T06:20:05.372369Z

Heads-up: I have done another big refactoring for https://github.com/replikativ/datahike/ and all its dependencies. All artifacts live now under org.replikativ and not io.replikativ anymore. As annoying as this is, it is necessary because we lost the domain unfortunately. But it also suits the project better as replikativ will not just be for "IO", but also about building replicating systems/organisations in general (think AI systems etc.). It also captures the open source nature of the replikativ project better. http://datahike.io got an overhaul, too, and hopefully will help attract more users that are not yet part of our little community. Regarding the refactoring, I have moved our https://github.com/replikativ/datahike/blob/main/src/datahike/api/specification.cljc#L80 to malli, and expand now also https://github.com/replikativ/datahike/blob/main/doc/README.md#-language-bindings-beta from it (instead of manual independent mappings). There is also a Java example project which should make it easier to get started for people who miss out on (()). I also improved the docs overall and aimed for more hierarchical structure and a shorter README. Please let me know if you run into any problems, as I have probably missed something, or if you have feedback in general. I aim to get the bindings stable, now we have a consistent JS/Java/Python -> edn mapping, but I am not willing to commit to not doing any breaking changes for the bindings yet.

whilo 2026-01-17T06:28:07.156099Z

The last thing that I find somewhat annoying in the current code and that can slow down development a bit is the test suite. It would be nice to refactor and unify across clj and cljs for instance, or unify the fixtures/utilities. Not critical though, and there is some benefit in keeping the tests fixed in general, as I have tried to avoid any backwards compatibility breakage in general.

whilo 2026-01-17T06:31:44.591899Z

I would be curious what people here are interested in and what they need from Datahike. Better performance is an obvious target. Another possibility is to provide precreated databases that you can download, or directly join with through S3/GCS etc. This is what Datahike's Distributed Index Space enables in general and what now is much more accessible through the different distributed access patterns + language bindings. This is very different to traditional databases, who cannot share their index data with their consumers, but rather need to silo it locally in a MVCC context.

Coby Tamayo 2026-01-17T08:32:01.724689Z

I'd personally be very interested in some kind of support for full-text search like https://v1-docs.xtdb.com/extensions/1.24.3/full-text-search/. A dedicated search index I can query with Datalog is extremely appealing. Bread is going to need a search story sooner or later and I'd love to have something that's not just bolted on

whilo 2026-01-17T09:15:32.121619Z

hold my beer 😉

🍻 1
Coby Tamayo 2026-01-17T16:18:03.225189Z

wow, ask and ye shall receive 😅

whilo 2026-01-17T19:06:29.323879Z

hehe. lmk what kind of fulltext search you need. feedback on this is very helpful.

Coby Tamayo 2026-01-18T04:24:04.461289Z

I am pretty far away from knowing what I need tbh. Not least of all because I need to brush up on search tech in general to be able to talk about it intelligently 😛

alekcz 2026-01-17T08:48:08.953049Z

It would be cool to ad a frontend cache for performance using konserve new tiered model. e.g.

(def cfg {:store {:backend :file 
                  :id #uuid "550e8400-e29b-41d4-a716-446655440000"
                  :path "/tmp/example"}
          :frontend {:backend :redis 
                     :id #uuid "776684bb-e29b-41d4-a716-446655440000"
                     :uri ""}})
Then I have near infinite storage on s3 but I still get microsecond retrievals for recent data

🚀 1
whilo 2026-01-17T09:15:07.380149Z

that should already work with datahike

alekcz 2026-01-17T09:24:41.339419Z

Oh I didn't know that. I'll have a look

whilo 2026-01-17T09:55:18.441839Z

you can even stack them; this has limited use ofc., but the abstraction is compositional

whilo 2026-01-17T09:55:30.411449Z

the tricky bit is the write through/synch on load logic

whilo 2026-01-17T09:55:51.746879Z

i also recommend checking out konserve-sync, which is used by the kabel writer

alekcz 2026-01-17T09:57:05.155999Z

I was thinking more of using redis as a cache. With tiered store do you need to be able to fit the whole thing?

alekcz 2026-01-17T09:57:22.743759Z

Or does it just cache a few datoms?

whilo 2026-01-17T10:17:40.323729Z

you can decide the strategy, but if you don't replicate everything then you have potential latency cliffs on reads because you need to hit the backend store

whilo 2026-01-17T10:10:47.607909Z

Announcing Proximum - Persistent Vector Database with Git-like Versioning We're excited to release Proximum, an embeddable vector database for Clojure and Java that brings persistent data structure semantics to vector search. Key Features:Git-like versioning - branches, commits, time-travel queries • Zero-cost branching - fork indices for experiments without copying data • Clojure collection protocols - use assoc, dissoc, get on your index • SIMD-accelerated - ~50% of native C++ hnswlib performance, pure JVM • Spring AI & LangChain4j integrations included

(require '[proximum.core :as prox])

(def idx (prox/create-index {:type :hnsw :dim 384 :capacity 10000
                             :store-config {:backend :memory :id (random-uuid)}}))

;; Works like a Clojure map
(def idx2 (assoc idx "doc-1" (float-array (repeatedly 384 rand))))

;; Git-like operations
(prox/sync! idx2)
(def experiment (prox/branch! idx2 :experiment))
Perfect for RAG applications where you need reproducible results, A/B testing embeddings, or audit trails. Install:
org.replikativ/proximum {:mvn/version "0.1.2"}
Links: • GitHub: https://github.com/replikativ/proximum • Product page: https://datahike.io/proximum/ 📋 Help us prioritize! Please fill out our 2-min feedback survey: https://docs.google.com/forms/d/e/1FAIpQLSeUQuw5SPyIx661e1pwZiX0100bP-DPpF2Zfpptg1h6k14OTA/viewform Requires Java 22+. This is an early beta - feedback welcome!

3
🙌 2
🚀 1
whilo 2026-01-17T11:11:42.271209Z

Here is an example I created from Wikipedia with embeddings to try (no GPU needed): https://github.com/replikativ/einbetten

fmjrey 2026-01-17T15:13:45.840039Z

> Zero-cost branching - fork indices for experiments without copying data That single sentence totally explained your storage model and the elusive insistence on indexes in your proximum product page, and your earlier comment. I think this sentence/fact deserves to be highlighted earlier in the narrative. Most people do not necessarily understand how indexes can become so much more in databases using structural sharing to keep the history. IIRC datomic indexes store the actual datoms, they're not pointing to another storage area, meaning data is actually duplicated in each index, thus blurring the distinction between data and indexes. Now I understand that in datahike indexes are separated from datoms. Even though they're in the same hitchhiker tree, they're in different nodes, with datoms being in leaf nodes, is that right? For the record, and to ensure we're not pushing the git analogy too far, git commits are stored in objects under .git/objects, while .git/index is confusingly not an index but a staging area that tracks the current working tree. The actual git history is stored with each commit with a parent attribute, meaning index and data are mingled.

whilo 2026-01-17T19:09:37.480579Z

Datomic has the same storage model, it just exposes it less. Our indices also store facts/datoms, eavt, aevt, avet in https://github.com/replikativ/persistent-sorted-set (pss). Same for the vector index (but for vectors and an id/metadata binding). I have deprecated the hitfchhiker-tree btw., because its baseline mechanics were slower and we pay for this on every read. The ideas in there are interesting and worth revisiting, but it was better to optimize for reads with the pss.

fmjrey 2026-01-17T19:30:20.052069Z

OIC, I was reading https://medium.com/@csm/datahiking-into-the-cloud-ae9cb8619748#edf1, which is 6 years old indeed, and tbh I did not continuously follow datahike. So are datoms duplicated in each index, or is there some structural sharing in datahike and proximum?

whilo 2026-01-17T20:01:41.699899Z

Yes, I am kind of rebooting the Datahike project with a deeper scope atm. I always kept working on the github repo, also during my PhD, but the original company (lambdaforge) that supported it, is not active anymore unfortunately. Datahike is used at scale in the Swedish government though and we did a lot of improvements to make Datahike work well there, most importantly query engine and indexing optimizations.

whilo 2026-01-17T20:01:56.001159Z

They are duplicated in indices and that is what you want, because fast retrieval requires the things you need to be accessible through as few memory hops as necessary. Each index is good for a different access pattern, and you directly have the datom when you do the lookup in it. You can break out large values (and you should), e.g. if you store large strings/byte arrays, and then they can be structurally shared between the indices. But many datoms are just a few bytes large and in the indices you can do fast range scans to get a whole bunch of them in one go. Does this make sense?

whilo 2026-01-17T20:02:42.526029Z

I am happy to add documentation to explain this better, I just need to make sure people find what they are looking for. There are already quite a few docs now and lately I tried to condense this more.

fmjrey 2026-01-17T20:11:13.427989Z

I don't know if documentation is a priority right now, unless you've reached a stable plateau. My reaction was more about making sense of these sentences regarding proximum: > Zero-cost branching - fork indices for experiments without copying data > • Snapshots — Immutable index snapshots in O(1) (structural sharing). So I suppose the structural sharing mentioned is in memory then, not in storage, right?

fmjrey 2026-01-17T20:13:45.431429Z

Also from a point of view of a person that does not know much about datomic or datahike storage model, it can be surprising to see indexes being the center of attention. So in that regard it may help to have a paragraph provide more context.

fmjrey 2026-01-17T20:16:19.334179Z

And it would help to understand why indexes are loaded by the client, while other DB keep them internal as you said.

whilo 2026-01-17T20:28:58.902679Z

My value proposition is that the main benefit of immutable data structures is that they can be shared and cached much better. For instance Datahike could have a konserve store backend that directly hooks into Cloudflare (which is maybe the best cache solution out there) and no matter where you are most of the database would be just a few millisecond fetch away. The same locally on your machine (where it is), whenever any of the persistent index data structures is updated, only a logarithmically sized index delta (the path https://hypirion.com/musings/understanding-persistent-vector-pt-1 ) is written and needs to be sent/fetched, the rest you still have at hand. Does this make sense? This is also the reason why git deltas are very efficient and fast to sync btw.

fmjrey 2026-01-17T20:39:37.204869Z

It definitely makes sense and see where you want to go, I'm all in. What I think would help the reader of https://datahike.io/proximum/ is to have early on a paragraph like: Proximum/Datahike persistence model stores data within indexes. The latter are shared with clients because immutability makes them easy to cache. On the client structural sharing allows for branching into new local forks without duplicating the unchanged data.

fmjrey 2026-01-17T20:51:56.133449Z

Am I right in saying that what Proximum brings to the table is to make these forks also persistent and shareable like git forks?

whilo 2026-01-17T21:05:59.218389Z

Yes, exactly.

fmjrey 2026-01-17T21:06:37.844299Z

I guess it's probably better to say branches because a fork implies a new remote, in git-speak

whilo 2026-01-17T21:08:47.472619Z

The general sport is to take a useful data structure and carefully introduce copy-on-write during changes to make the git-model possible. I was not sure how generally feasible this is, and there might be cases where it is difficult, but proximum gave me confidence that it can be pulled off very widely. ZFS has this on file system level btw. This is very general and powerful, but you still need to implement the actual in-memory data structure this way to benefit properly (i.e. you can snapshot postgres all the time, but it will still not work well with its connection management, you will have to start processes/containers etc. and it will create much more overhead).

whilo 2026-01-17T21:10:00.318339Z

For AI agents people now try to use Docker/OS processes etc., but this is very clunky. It is a reasonable fallback, but not nearly as good.

whilo 2026-01-17T21:10:26.831149Z

I think this is where Clojure could actually break through, this is why I am working on this (in part).

fmjrey 2026-01-17T21:19:18.025999Z

Makes sense. So now the introductory paragraph that I imagine would help set the scene for the Proximum page would look like this: Proximum uses Datahike, whose persistence model stores data within indexes. Thanks to their immutability, indexes are easy to share with clients. On the client, they can be used as a cache, enabling local querying, while structural sharing allows for local branches without duplicating the unchanged data. Proximum makes these branches durable and shareable like git branches.

Coby Tamayo 2026-01-17T17:38:32.247569Z

At some point I plan on implementing revisions i.e. pretty much exactly this use-case: > • Collaborative editing where changes need review before merging The rough plan was to model a revision (defined as a sequence of one or more diffs of arbitrary content, stored as edn, with some metadata such as authorship, revision notes, merged-at time etc.) explicitly in the db schema. A "preview revision" feature would then work like this: pull revision diffs, apply them with https://github.com/juji-io/editscript or similar, and render the revised content (optionally offering a "diff view"). I like that this is very direct and that you can query revision metadata just by traversing a ref. However, db-level branching seems much more powerful in its ability to capture arbitrary changes and I'm sure solves many of the problems I would need to solve at a deeper level. My questions are: 1. Are there facilities for dealing with merge conflicts? I assume merge conflicts are possible since you mention Git but not Pijul 😉 ...in which case, I expect merge! just throws an exception? 2. How would I represent/where would I store revision metadata? In the UI, I would want to be able to list arbitrary revisions, so presumably references to unmerged revisions would be reified with their metadata in the prod branch. Looking at the https://cljdoc.org/d/io.replikativ/datahike/0.7.1624/doc/core-features/versioning-beta-#staging-environment-for-data-review I imagine I could just store the branch name? So:

;; Editor creates an arbitrary branch
  (let [branch-name (generate-branch-name)
        staging-conn (d/connect (assoc cfg :branch branch-name))]

    ;; Make draft changes
    (d/transact staging-conn [{:article/title "New Article"
                               :article/status :draft
                               :article/content "..."}])

    ;; Save reference to the new revision branch. For review UI,
    ;; simply query for open revisions
    (d/transact prod-conn [{:revision/author (:db/id current-user)
                            :revision/status :open
                            :revision/created-at (Date.)
                            :revision/notes "check out my mew article uwu"
                            :revision/branch branch-name}])

    ;; Reviewers can read staging branch without affecting production
    ;; ... review process ...

    ;; merge...or reject!
    (d/transact prod-conn [({:db/id revision-id
                             :revision/reviewed-at (Date.)
                             :revision/status (if accepted? :merged :closed)}])
    (when accepted? (merge! ...))
    )

whilo 2026-01-17T19:19:39.360709Z

TLDR; you can use Datalog itself to extract the data you want to merge.

whilo 2026-01-17T19:21:50.167539Z

I still need to merge the git API in datahike into the main API. This is on my agenda next.

Coby Tamayo 2026-01-17T19:45:25.168999Z

Yeah I get the diffing part and that is straightforward enough. And I guess that side-steps the merge conflict question in that you can always just diff against the target branch. 👍 #2 is the more interesting question though. Does that approach seem reasonable? And if I wanted an audit log of past revisions, it means I would have to either keep all branches around indefinitely or denormalize them as serialized diffs. Right?

whilo 2026-01-17T22:51:21.341839Z

Good point. This is like branch metadata. You can track this in the same database, we might also want to standardize this. I am thinking about this atm. The one issue you have with transacting in the same database is that the transaction will be separate from the actual commit, but that is not necessarily a problem.

Coby Tamayo 2026-01-17T23:33:48.320669Z

Separate transaction/commit is not a problem for my use-case I don't think. What would standardizing look like in your mind? I could see it going in lots of directions depending on how faithful to the git model you want to be. I don't have a strong opinion, as long as storing branch-name like this is not against the grain in some way.

Coby Tamayo 2026-01-17T23:44:05.584379Z

I guess I'd be wary of introducing a whole "committer" concept that could diverge from the application-level concept of "user." For example Git has the user.email setting; Bread has :user/emails. So if by standardizing you mean introducing special attributes akin to :db/doc for branches/commits and their metadata, I would definitely keep any opinionated schema as minimal as possible for that reason. It is already "standardized" in the sense that generic Datalog already lets you express metadata in whatever schema you want!

Coby Tamayo 2026-01-17T23:56:01.149999Z

I guess having a clearer idea about branch/commit semantics would clarify my thinking. In Datahike, commits are root pointers, OK...so are branches pointers to pointers? sorry read that wrong...but what exactly are commits then? Is there the same "plumbing vs. porcelain" distinction as in Git? And are commits reified in a similar way to transactions, i.e. :db/txInstant?

whilo 2026-01-18T01:40:13.084729Z

yes, similar to transactions; the txInstant is an internal value in the index, while datahike itself also keeps track of parent pointers https://github.com/replikativ/datahike/blob/main/src/datahike/writing.cljc#L161 that can be walked to get the commit graph; this is a newer addition, after we covered the datomic equivalents for history search

whilo 2026-01-18T01:46:04.380299Z

datahike's history index for instance provides the ability to query across all transactions (by retaining an index with all changes); this is not automatically given by just writing snapshots with commit ids; the latter is more the basic memory model we have now; as-of is a bit inbetween, we construct it from the history index for historical reasons, and probably that is the right thing to do, but we could also just retrieve the snapshot of that point, it would be more efficient to use it than projecting the history index.

Coby Tamayo 2026-01-18T03:29:24.709489Z

That all makes sense, I think. > datahike's history index for instance provides the ability to query across all transactions (by retaining an index with all changes) right, the t in eavt if I understand correctly > writing snapshots with commit ids...is more the basic memory model we have now; I think you are just describing "place-oriented programming"? rich2 But what I'm asking is are the commits reified in schema the way :db/txInstant is? It looks like they are just checkpoints stored in the metadata though. They don't have relations and can't be queried in the same way.

Coby Tamayo 2026-01-18T03:38:26.671809Z

What I'm getting at is, in my example above, the :revision/branch attr is of type keyword. The db doesn't know it happens to represent a branch or that it has any special meaning at all. But if there were, say, a :db.type/commitRef type, maybe there would be some advantages to that. I don't know if that would work, it could be utter nonsense, that is just where my brain went when you mentioned standardizing

whilo 2026-01-18T03:40:42.269179Z

We could add the commit-ids into the database, yes. The one thing to keep in mind though, that you can gc them if you don't want to keep all these snapshots around, in which case they would be gone. That can be fine, then the commitRef points to nowhere which will indicate it was gc'ed. I am just saying this because we don't gc things inside the indices, only snapshots.

Coby Tamayo 2026-01-18T03:41:21.340119Z

ah ok that makes sense

whilo 2026-01-18T03:42:40.154879Z

The snapshots are a more fundamental memory model that is different to Datomic or DataScript (although the storage support I helped add there makes it similar), because we keep the snapshots around for distributed/remote readers and sync them as needed in parallel. Datomic needs a transactor overlay which is the most complicated piece in its memory model and a central point of failure. Datahike always writes the full index to storage and any process can read it, even if the writer (transactor) is down.

whilo 2026-01-18T03:43:20.051059Z

E.g. you can give somebody only access to your S3 bucket, and they can join your db from S3 with Datahike against their own databases.

🔥 1
whilo 2026-01-18T03:45:05.561279Z

Most people don't want to be too permissive with their DB, but this should not be conflated with a limited memory model which forces all participants to meet in a MVCC context (e.g. SQL dbs, LMDB etc.). Clojure actually tried to get away from this, but Datomic still bought back into it a little, which in the end made it very complex, I think.

Coby Tamayo 2026-01-18T03:46:00.123179Z

yeah totally agree

whilo 2026-01-18T03:46:46.121569Z

DataScript also keeps a transaction log, which I thought about, too. The upside is less write amplification and overhead. The downside is that on every db deref you need to retransact it and load potentially many index fragments from storage for that. I think the write overhead is not that bad, you can just do gc regularly and reclaim the storage and the upside is that you always have optimal indices available for each snapshot you want to use.

whilo 2026-01-18T03:47:29.872549Z

We don't have problems with write throughput in general, so this is an ok price to pay in my opinion.

whilo 2026-01-18T03:51:41.855249Z

It might be annoying if you are on DynamoDB and have a lot of small transactions.

whilo 2026-01-18T03:52:05.686819Z

Because of the pricing model.

whilo 2026-01-18T03:52:24.053699Z

It is always possible to add a transaction log though.

whilo 2026-01-18T03:52:58.505789Z

Anyway, this is a bit of a digression, but maybe it helps to understand what I am prioritizing.

Coby Tamayo 2026-01-18T04:07:22.629939Z

It seems like a good tradeoff in general. Appreciate the context! For my use-case I will probably just keep branches around indefinitely by default, and store the branch names as keywords. That seems like the simplest way, and I don't think it incurs much storage overhead above and beyond what you'd incur for keeping history around generally? Also, denormalizing would be bad for GDPR, and would probably mean a bunch of other weird edge cases down the line! At some point Bread could optionally GC open db branches after some time, and/or denormalize on merge if data excision doesn't matter to the user.