Would also love to get some feedback on the new homepage and framing: https://datahike.io/. I also added a "Try" section with TypeScript.
I like the techy clean style and datahike in the browser is awesome esp. for a demo
I'll be completely honest with a completely opposite perspective on the "techy clean style": there are a ton of AI-assisted writing clichΓ©s in this landing page. (for example: pithy phrase followed by an em dash β adjective, adjective, adjective) The more of them I see, the faster my eyes glaze over. The generic phrasing increases the ratio of noise to signal in the writing. I know from my prior familiarity with Datahike that these libraries all have extremely compelling value propositions for non-AI use cases, but the website being entirely framed around AI, RAG, and agents makes it even more buzzwordy. Maybe that's a cost of doing business these days, which is regrettable but not really under your control. Maybe serve an AI-friendly version of the site to AI agents scraping it and a more human-written one to humans reading it? π I think that brings me to the business case: when I see a project described with this much AI-assisted writing, I then inevitably wonder about the code itself. I can't see in any of the repos any disclosures about any of the code being generated with AI assistance. Whether it's fair or not, this makes me hesitant. Regarding the system descriptions themselves - quoting from the FAQ: > β’ Production-ready? β Weβll share a concrete readiness checklist and help you evaluate quickly. > > this reads like a "contact us for pricing" pitch - does it belong in an open source project description? > β’ Upgrades? β We optimize for explicit versioning and reproducible state transitions. > β’ Security? β Offline-friendly, minimal dependencies, and transparent benchmarking methodology. Sounds great! But without details or examples, it's hard to know whether this is a description of your desired state or something I can verify right now. I might be missing something, but I'm not entirely sure how your benchmarking methodology gives me assurances about security. Speaking of security/compliance: the two big questions with "a database that remembers everything" are "how do I remove data that shouldn't be there?" and "how do I prevent people from seeing data that they shouldn't?" What happens when I try to purge data from a database that 50+ AI agents have a fork of? Does that change propagate quickly? How do I ensure that an AI agent only sees data that it should via a mechanism like row-level security in SQL? How can I quickly ensure permission updates roll out across a widely forked DB? A front-page answer to the questions of purging and access control would be useful. I know purging is in the Datahike README, but it's a cross-system question because of the history and branching model common to all of them. A lot of AI stuff is playing completely fast and loose with security and data privacy right now, attempting to duct tape access controls on to fundamentally insecure systems. A compelling differentiating factor for you can be putting a serious approach to data governance front and center.
Thank you @afoltzm for taking time to give such detailed feedback! I have improved the writing on the website, yes I use LLMs, my writing in the past was not always very accessible, I just wrote a whole PhD thesis and it was a problem. I tend to write dense and talk about very technical/abstract bits too much, using LLMs helps me to express my ideas in an easy to read common ground. I also use coding assistants now because I cannot write as fast and the quality of them has been improving considerably, I think this is now unavoidable. Sometimes I still program manually to sort out my own thoughts (e.g. for language/interface design), but it really depends on the context. Datahike has had excision for a while. The new additional indices don't have explicit history indices, but work more like git (i.e. as-of is a true snapshot and not a filter on the history database). In this case the data is simply deleted when the snapshot is freed by gc (proximum, scriptum, stratum work this way, for the other external yggdrasil backends I would have to check first; my understanding is git's gc works like this, too). It is up to the user to define a cutoff date and delete branches, if the data is not in the newer snapshots then it is gone. I plan to write a note on the website about this, thank you for explicitly demanding this. Datahike is and will be an open-source project, but I will also have to figure out how to finance this, and enabling AI workflows is honestly a very compelling value proposition not just for Datahike, but for Clojure in general. It is in my mind maybe the best language for this, just very few people know yet.
https://datahike.io/stratum/: a columnar SQL engine for the JVM with git-like branching
We're releasing Stratum β a SIMD-accelerated columnar SQL engine built on the JVM, with copy-on-write branching semantics baked into the storage layer.
What it does:
β’ PostgreSQL wire protocol β connect with psql, DBeaver, JDBC, psycopg2
β’ Full analytical SQL: CTEs, window functions, correlated subqueries, PERCENTILE_CONT, APPROX_QUANTILE, full DML
β’ Fork a dataset in O(1) via structural sharing β no data copied, just a new root pointer
β’ Time-travel and named branch persistence via konserve
β’ Datasets implement IPersistentCollection / IEditableCollection β tablecloth and tech.ml.dataset work directly
Why faster than DuckDB on many queries:
Each column chunk carries pre-computed min, max, sum, and count statistics. Unfiltered aggregates like COUNT(*) or AVG(price) on 10M rows are answered from metadata β no row data touched. Predicates and accumulation are also fused into a single SIMD loop via the Java Vector API, eliminating intermediate arrays and second passes.
Single-threaded, 10M rows, JDK 25, Intel Core Ultra 7:
TPC-H Q6 (filter + sum-product) 13ms vs 28ms 2.2x
H2O Q3 (100K string groups) 71ms vs 362ms 5.1x
H2O Q10 (10M groups, 6 cols) 832ms vs 7056ms 8.5x
LIKE '%search%' 47ms vs 240ms 5.1x
Wins 35 of 46 benchmark queries. Full results: https://github.com/replikativ/stratum/blob/main/doc/benchmarks.md
The branching part:
clojure
(def experiment (st/fork orders)) ; O(1), shares all unchanged chunks
(<!! (st/sync! experiment store "exp")) ; persist as named branch
(st/q "SELECT SUM(price*qty) FROM t" {"t" (st/columns orders)})
(st/q "SELECT SUM(price*qty) FROM t" {"t" (st/columns experiment)})
; compare results β original untouched
Getting started:
clojure
{:deps {org.replikativ/stratum {:mvn/version "0.1.7"}}}
:jvm-opts ["--add-modules=jdk.incubator.vector"
"--enable-native-access=ALL-UNNAMED"]
Requires JDK 21+.
Source: https://github.com/replikativ/stratum
Full write-up: https://datahike.io/notes/stratum-analytics-engine
Part of the replikativ ecosystem alongside Datahike, Proximum, Scriptum, and Yggdrasil β the same CoW branching model across Datalog, SQL, vector search, and full-text.
Would love feedback, especially on the branching use cases and the benchmark methodology.