datahike

whilo 2026-02-15T06:15:16.020179Z

https://github.com/replikativ/persistent-sorted-set 0.4.119 released Major release of our B-tree based persistent sorted set (org.replikativ/persistent-sorted-set). This is the backing data structure for Datahike's indices. Subtree Counts - count is now O(1) (was O(n)) — each node tracks its subtree size - New count-slice for O(log n) range counting: how many elements between A and B without scanning - Works across both CLJ and CLJS get-nth & Rank Statistics - O(log n) positional access into the sorted set via get-nth - Weighted navigation: nodes can carry custom weights (e.g. element count per leaf entry) for rank-based lookups - Enables efficient percentile/median queries over indexed data Measure Support - Pluggable IMeasure interface for user-defined aggregate metadata on tree nodes - Measures propagate up the tree and are maintained incrementally on conj/disj/replace - Supports lazy computation (try-compute-measure) and eager (force-compute-measure) modes - Use case: min/max/sum/count statistics per subtree for zone-map pruning ILeafProcessor Hook - New ILeafProcessor interface lets you post-process leaf entries after add/remove operations - Supports both compaction (merging small entries) and expansion (splitting large ones) - Generalized Branch.add to handle N children from processor output - modifiedIndex parameter for targeted O(1) processing instead of full leaf scans - Enabled on both persistent and transient paths Other Improvements - Namespace migrated from me.tonsky to org.replikativ - B-tree structural invariant tests and diagnostics API (tree-stats, validate-full, fill factor histograms) - compact operation for storage-backed sets - Transient corruption fix (shared _keys array between persistent and transient views) - Full CLJ + CLJS parity for all new features

jumraiya 2026-02-15T13:03:34.755839Z

Hi, I see that this is forked off Tonsky's persistent sorted set which I have been using in one of my projects. what is different from Tonsky's version?

whilo 2026-02-15T17:49:24.238429Z

Mostly what is in mentioned above, and additionally also the ability to mark freed addresses on the go which enables online gc such as https://github.com/replikativ/datahike/blob/main/doc/gc.md#online-garbage-collection-incremental-gc.

jumraiya 2026-02-15T18:38:17.745799Z

nice, thank you!

whilo 2026-02-15T06:17:15.123779Z

I will bump the Datahike dependency to this soon, it would be cool if people could do son their existing test databases and report to me whether there were any problems @sasha_bogdanov_dev @alekcz360 @hoertlehner or anybody who is able to help a bit with this :)

alekcz 2026-02-15T07:32:08.514229Z

@whilo I can do that next we weekend

👍 1
whilo 2026-02-15T08:01:07.167579Z

Perfect thanks!