I've been playing around with creating MCP servers that can be shipped as native binaries, and at the same time I needed MCP support that works out of the box - so here's MCP2000XL: https://github.com/lukaszkorecki/mcp2000xl • wraps official MCP SDK • but ignores its heavy transport layers • perfect for creating stateless (ish) MCP servers with STDIO or HTTP transport It started as a fork of Latacora's MCP SDK wrapper, but it very quickly spiraled out of control and I ended up removing most of it.
great name
Announcing Scriptum - Copy-on-Write Branching for Apache Lucene We're releasing Scriptum, a library that brings value semantics to Lucene indices. Key Features: • Zero-cost forking - Branch any index in 3-5ms regardless of size (copies metadata, not data) • Structural sharing - Branches share immutable Lucene segments via COW overlay directories • Time travel - Open readers at any historical commit point • Full Lucene 10.x - Text search, KNN vectors, facets - all branch-aware • Yggdrasil integration - Implements Snapshotable, Branchable, Graphable, Mergeable protocols
(require '[scriptum.core :as sc])
(def writer (sc/create-index "/tmp/search-idx"))
(sc/add-doc writer {:title {:type :text :value "Hello World"}})
(sc/commit! writer "Initial commit")
;; Fork in milliseconds
(def experiment (sc/fork writer "experiment"))
(sc/add-doc experiment {:title {:type :text :value "Branch only"}})
(sc/commit! experiment "Experimental change")
;; Main unchanged, branch has new doc
(sc/search writer {:match-all {}} 10) ;; => 1 result
(sc/search experiment {:match-all {}} 10) ;; => 2 results
;; Merge back when ready
(sc/merge-from! writer experiment)
How it works:
Scriptum extends Lucene with four components: BranchedDirectory (COW overlay), BranchDeletionPolicy (retains all commits), BranchAwareMergePolicy (protects shared segments), and BranchIndexWriter (main API). See docs/LUCENE_EXTENSION.md for the technical deep-dive.
Install:
org.replikativ/scriptum {:mvn/version "0.1.11"}
Requirements: Java 21+, Lucene 10.3.2
Links:
• GitHub: https://github.com/replikativ/scriptum
• Technical docs: docs/LUCENE_EXTENSION.md
Feedback appreciated!Is there any notion of domain-provided "T" so users could know they have ingested up to at least some transaction T basis?
Yes, I have put a demo here https://github.com/replikativ/scriptum/blob/main/dev/scriptum/experiment/datahike_integration.clj. Datahike integration is not done yet, so this is manual binding here, it should be very similar for Datomic. I am happy to help with Datomic integration and exploring and fixing any potential limitations. Full-text was always a bit of a pain since I started using Datomic/DataScript over ten years ago.
Maybe you mean history index itself though, this would require a deeper investigation. I can take a look into this, too.
In my experience, sophisticated users of IR tools (anything lucene backed) always need to bring their own tokenizers and filters, which was never something Datomic supported (or planned to). Creating secondary indexes in user-space that can participate with database-as-a-value queries is a great direction to head towards as a community. Fun fact, I actually put a lucene secondary index inside of Datomic Cloud as a Datomic Ion long before I ever joined the team back when I was a customer. AFAIK that application is still humming along, performing IR queries w/ BM25 against datomic EIDs returned from complex joins done with the datalog engine.
Yes, I also think that people want to directly use all the facilities of Lucene in many cases. No need to reinvent the wheel here. Scriptum is a fairly thin/minimal modification of the writers only. All Lucene features should still work as is. (If not then I would consider this a bug.)
Btw., this is a bit more niche, but I am also working on a persistent HNSW implementation https://github.com/replikativ/proximum/, which I plan to use for AI workloads/machine learning more directly. Lucene also has KNN search, but proximum is much faster in my benchmarks and stands alone independently of all the fulltext machinery. Maybe less interesting for you, but any feedback is appreciated.
One important property of Datomic's lucene indexes is that they scale beyond a single box (I believe, I didn't write them and it has been a minute since I looked at their impl). For Proximum, when building the index, is the intention to support data larger than a single box can support? For lots of the other immutable dbs being created in clojure, they are often constrained to a single box for their indexes, which is why I'm inquiring. Hoping to catch you early enough in the design process that, if the answer is "No, not yet", you still have time to think about how to incorporate that property into the design :)
Yes, I want to go beyond single machine, but it would be cool to have pilots or ways to work on the pieces at that scale, because synthetic benchmarks only get you so far.
Also, looking at the repo, this has all sorts of things that I'm interested in like SIMD.
I am very sensitive to memory locality, so if sharding is going to happen it needs to minimize communication.
Say more? RE: "memory locality", which I assume you mean NUMA or page fragmentation?
I mean that if shards in a hnsw implementation need to search across machines performance will collapse.
Right now I can mmap files and have a thin persistent data structure on top with array chunks that can do copy-on-write efficiently (following Clojure, based on https://github.com/replikativ/persistent-sorted-set [still talking to @tonsky on how to merge the modifications back to his original work]). Hnsw is fundamentally non-local, you scan the graph neighborhood at multiple scales and can hit other nodes in the graph kind of at random (depending on insertion order). Top vector dbs such as qdrant or jvector introduce quantization and have an approximate index fully in memory locally and then correct with the full precision one to keep everything in memory locally.
I have looked into this, but not explored quantization yet. I think it will be important at the >10M scale.
> >10M scale 10M what?
10 million vectors, this is just from internet research I did, not my own experiments. You need to get embedding datasets at that scale in the first place.
Imagine you had a robust RDMA setup with placement groups set so you are on the same rack, would perf still "collapse"? What is the latency threshold for "collapse"?
I thought about maybe embedding Wikipedia and sharing it openly, would be kind of cool. Unfortunately the industry seems to be greedy and has not done this already. I need to check how Lucene does it internally when I index Wikipedia.
If only I knew of an "AI First" company with a great relationship with AWS...
I don't have experience with RDMA yet, but proximum right now basically does direct Java array lookups which are fastest if they are in L2 cache. It might still be fine if memory latency is low enough and it is similar to main memory.
My PhD supervisor, Frank Wood, has an AI company here in Vancouver, http://inverted.ai , he seems to like AWS (probably because he got some free compute). I mostly worked with scientific clusters and AWS for our data collection on https://plaicraft.ai/, but have not done machine learning on AWS yet.
Anyway, I would be happy to help with this, it requires direct experimentation and adjustments to the memory architecture.
I have done a small and quick Wikipedia demo here, but it will also depend on the embeddings you use, I think. https://github.com/replikativ/einbetten
This is planned to become a secondary index for #datahike, it could also benefit Datomic, XTDB, etc.