data-science

whilo 2026-02-27T04:34:00.599849Z

I have built a highly optimized columnar query engine that can work with tech.ml.dataset and tablecloth, https://github.com/replikativ/stratum. I have done a clay notebook, but this can probably still be communicated better https://replikativ.github.io/stratum/stratum_intro.html. I would love to get some feedback to help me make this more accessible and useful to the Clojure data science community.

šŸ‘€ 2
whilo 2026-02-27T04:35:11.302539Z

If you are interested in the broader agenda, I have worked on the Datahike website to make it clearer https://datahike.io/, and am also discussing this in #datahike.

phronmophobic 2026-02-27T05:24:09.254599Z

I've been looking for a db where I can host https://github.com/phronmophobic/dewey data as a public dataset. I want to make the data as easy to query as possible. Ideally, I could store the data somewhere cheap like s3 and have a web interface that can run queries from the browser without having to download the whole dataset. Is this something that stratum could support?

whilo 2026-02-27T05:26:06.910189Z

Yes, it stores the persistent-sorted-set indices in konserve (same as datahike and proximum), and it has these backends: https://github.com/replikativ/konserve?tab=readme-ov-file#available-external-backends. I haven't done this yet with stratum, but I am happy to help.

whilo 2026-02-27T05:27:57.720109Z

Stratum is good for numerical data that fits well in a columnar array layout. It can also deal with strings, but depending on your use case Datahike might be better suited. Datahike can also be stored in S3 only, readers don't need any infrastructure besides storage access.

phronmophobic 2026-02-27T05:31:20.042389Z

The data are primarily strings, although there are a few numbers like line numbers.

phronmophobic 2026-02-27T05:31:55.156749Z

does datahike run in the browser?

whilo 2026-02-27T05:32:35.170979Z

Yes. The only current caveat is that you need to hold all data in memory for querying, it can be backed up by indexeddb.

whilo 2026-02-27T05:34:18.784079Z

It can also be autosynced/replicated from a Datahike server (writer), which allows reactive apps. https://github.com/replikativ/datahike/tree/main/doc#clojurescript and https://github.com/replikativ/datahike/blob/main/doc/distributed.md#streaming-writer-kabel

whilo 2026-02-27T05:34:34.855119Z

But for your use case you maybe want something different.

phronmophobic 2026-02-27T05:35:16.719399Z

Ah, ok. Yea, I’m looking for something that can run queries and download only small parts of the dataset.

whilo 2026-02-27T05:36:24.223619Z

That will require an async query engine, which is doable, but has performance tradeoffs. The transaction logic and index is already both sync and async in cljs, it is mostly porting the query stack on top to support both.

whilo 2026-02-27T05:36:59.933879Z

Depending on how you want to query it a simpler solution might also work.

whilo 2026-02-27T05:39:43.763589Z

Another alternative in AWS are lambdas, we did a bit of experiments in this direction in the past. But they might be too expensive. Not sure.