announcements

lukasz 2026-02-09T15:23:23.526889Z

I've been playing around with creating MCP servers that can be shipped as native binaries, and at the same time I needed MCP support that works out of the box - so here's MCP2000XL: https://github.com/lukaszkorecki/mcp2000xl • wraps official MCP SDK • but ignores its heavy transport layers • perfect for creating stateless (ish) MCP servers with STDIO or HTTP transport It started as a fork of Latacora's MCP SDK wrapper, but it very quickly spiraled out of control and I ended up removing most of it.

Felipe 2026-02-09T22:57:44.321789Z

great name

❤️ 1
whilo 2026-02-09T05:41:37.649739Z

Announcing Scriptum - Copy-on-Write Branching for Apache Lucene We're releasing Scriptum, a library that brings value semantics to Lucene indices. Key Features:Zero-cost forking - Branch any index in 3-5ms regardless of size (copies metadata, not data) • Structural sharing - Branches share immutable Lucene segments via COW overlay directories • Time travel - Open readers at any historical commit point • Full Lucene 10.x - Text search, KNN vectors, facets - all branch-aware • Yggdrasil integration - Implements Snapshotable, Branchable, Graphable, Mergeable protocols

(require '[scriptum.core :as sc])

(def writer (sc/create-index "/tmp/search-idx"))
(sc/add-doc writer {:title {:type :text :value "Hello World"}})
(sc/commit! writer "Initial commit")

;; Fork in milliseconds
(def experiment (sc/fork writer "experiment"))
(sc/add-doc experiment {:title {:type :text :value "Branch only"}})
(sc/commit! experiment "Experimental change")

;; Main unchanged, branch has new doc
(sc/search writer {:match-all {}} 10)      ;; => 1 result
(sc/search experiment {:match-all {}} 10)  ;; => 2 results

;; Merge back when ready
(sc/merge-from! writer experiment)
How it works: Scriptum extends Lucene with four components: BranchedDirectory (COW overlay), BranchDeletionPolicy (retains all commits), BranchAwareMergePolicy (protects shared segments), and BranchIndexWriter (main API). See docs/LUCENE_EXTENSION.md for the technical deep-dive. Install:
org.replikativ/scriptum {:mvn/version "0.1.11"}
Requirements: Java 21+, Lucene 10.3.2 Links: • GitHub: https://github.com/replikativ/scriptum • Technical docs: docs/LUCENE_EXTENSION.md Feedback appreciated!

👀 10
🎉 7
Joe Lane 2026-02-09T15:54:58.685009Z

Is there any notion of domain-provided "T" so users could know they have ingested up to at least some transaction T basis?

whilo 2026-02-09T18:53:31.889339Z

Yes, I have put a demo here https://github.com/replikativ/scriptum/blob/main/dev/scriptum/experiment/datahike_integration.clj. Datahike integration is not done yet, so this is manual binding here, it should be very similar for Datomic. I am happy to help with Datomic integration and exploring and fixing any potential limitations. Full-text was always a bit of a pain since I started using Datomic/DataScript over ten years ago.

whilo 2026-02-09T18:59:44.168579Z

Maybe you mean history index itself though, this would require a deeper investigation. I can take a look into this, too.

Joe Lane 2026-02-09T19:01:18.393719Z

In my experience, sophisticated users of IR tools (anything lucene backed) always need to bring their own tokenizers and filters, which was never something Datomic supported (or planned to). Creating secondary indexes in user-space that can participate with database-as-a-value queries is a great direction to head towards as a community. Fun fact, I actually put a lucene secondary index inside of Datomic Cloud as a Datomic Ion long before I ever joined the team back when I was a customer. AFAIK that application is still humming along, performing IR queries w/ BM25 against datomic EIDs returned from complex joins done with the datalog engine.

🚀 1
whilo 2026-02-09T19:05:08.558839Z

Yes, I also think that people want to directly use all the facilities of Lucene in many cases. No need to reinvent the wheel here. Scriptum is a fairly thin/minimal modification of the writers only. All Lucene features should still work as is. (If not then I would consider this a bug.)

whilo 2026-02-09T19:08:32.285229Z

Btw., this is a bit more niche, but I am also working on a persistent HNSW implementation https://github.com/replikativ/proximum/, which I plan to use for AI workloads/machine learning more directly. Lucene also has KNN search, but proximum is much faster in my benchmarks and stands alone independently of all the fulltext machinery. Maybe less interesting for you, but any feedback is appreciated.

🆒 1
Joe Lane 2026-02-09T19:45:20.836869Z

One important property of Datomic's lucene indexes is that they scale beyond a single box (I believe, I didn't write them and it has been a minute since I looked at their impl). For Proximum, when building the index, is the intention to support data larger than a single box can support? For lots of the other immutable dbs being created in clojure, they are often constrained to a single box for their indexes, which is why I'm inquiring. Hoping to catch you early enough in the design process that, if the answer is "No, not yet", you still have time to think about how to incorporate that property into the design :)

whilo 2026-02-09T19:46:41.049959Z

Yes, I want to go beyond single machine, but it would be cool to have pilots or ways to work on the pieces at that scale, because synthetic benchmarks only get you so far.

👍 1
Joe Lane 2026-02-09T19:46:58.383649Z

Also, looking at the repo, this has all sorts of things that I'm interested in like SIMD.

whilo 2026-02-09T19:47:16.544109Z

I am very sensitive to memory locality, so if sharding is going to happen it needs to minimize communication.

Joe Lane 2026-02-09T19:47:51.071729Z

Say more? RE: "memory locality", which I assume you mean NUMA or page fragmentation?

whilo 2026-02-09T19:48:37.694179Z

I mean that if shards in a hnsw implementation need to search across machines performance will collapse.

whilo 2026-02-09T19:51:39.232319Z

Right now I can mmap files and have a thin persistent data structure on top with array chunks that can do copy-on-write efficiently (following Clojure, based on https://github.com/replikativ/persistent-sorted-set [still talking to @tonsky on how to merge the modifications back to his original work]). Hnsw is fundamentally non-local, you scan the graph neighborhood at multiple scales and can hit other nodes in the graph kind of at random (depending on insertion order). Top vector dbs such as qdrant or jvector introduce quantization and have an approximate index fully in memory locally and then correct with the full precision one to keep everything in memory locally.

whilo 2026-02-09T19:52:04.595199Z

I have looked into this, but not explored quantization yet. I think it will be important at the >10M scale.

Joe Lane 2026-02-09T19:52:40.111249Z

> >10M scale 10M what?

whilo 2026-02-09T19:53:15.675289Z

10 million vectors, this is just from internet research I did, not my own experiments. You need to get embedding datasets at that scale in the first place.

Joe Lane 2026-02-09T19:54:09.280849Z

Imagine you had a robust RDMA setup with placement groups set so you are on the same rack, would perf still "collapse"? What is the latency threshold for "collapse"?

whilo 2026-02-09T19:54:13.003979Z

I thought about maybe embedding Wikipedia and sharing it openly, would be kind of cool. Unfortunately the industry seems to be greedy and has not done this already. I need to check how Lucene does it internally when I index Wikipedia.

Joe Lane 2026-02-09T19:55:04.902129Z

If only I knew of an "AI First" company with a great relationship with AWS...

whilo 2026-02-09T19:55:52.093669Z

I don't have experience with RDMA yet, but proximum right now basically does direct Java array lookups which are fastest if they are in L2 cache. It might still be fine if memory latency is low enough and it is similar to main memory.

whilo 2026-02-09T19:57:29.939909Z

My PhD supervisor, Frank Wood, has an AI company here in Vancouver, http://inverted.ai , he seems to like AWS (probably because he got some free compute). I mostly worked with scientific clusters and AWS for our data collection on https://plaicraft.ai/, but have not done machine learning on AWS yet.

whilo 2026-02-09T19:58:42.645069Z

Anyway, I would be happy to help with this, it requires direct experimentation and adjustments to the memory architecture.

whilo 2026-02-09T20:02:57.769489Z

I have done a small and quick Wikipedia demo here, but it will also depend on the embeddings you use, I think. https://github.com/replikativ/einbetten

whilo 2026-02-09T05:45:29.419849Z

This is planned to become a secondary index for #datahike, it could also benefit Datomic, XTDB, etc.