I have built a highly optimized columnar query engine that can work with tech.ml.dataset and tablecloth, https://github.com/replikativ/stratum. I have done a clay notebook, but this can probably still be communicated better https://replikativ.github.io/stratum/stratum_intro.html. I would love to get some feedback to help me make this more accessible and useful to the Clojure data science community.
If you are interested in the broader agenda, I have worked on the Datahike website to make it clearer https://datahike.io/, and am also discussing this in #datahike.
I've been looking for a db where I can host https://github.com/phronmophobic/dewey data as a public dataset. I want to make the data as easy to query as possible. Ideally, I could store the data somewhere cheap like s3 and have a web interface that can run queries from the browser without having to download the whole dataset. Is this something that stratum could support?
Yes, it stores the persistent-sorted-set indices in konserve (same as datahike and proximum), and it has these backends: https://github.com/replikativ/konserve?tab=readme-ov-file#available-external-backends. I haven't done this yet with stratum, but I am happy to help.
Stratum is good for numerical data that fits well in a columnar array layout. It can also deal with strings, but depending on your use case Datahike might be better suited. Datahike can also be stored in S3 only, readers don't need any infrastructure besides storage access.
The data are primarily strings, although there are a few numbers like line numbers.
does datahike run in the browser?
Yes. The only current caveat is that you need to hold all data in memory for querying, it can be backed up by indexeddb.
It can also be autosynced/replicated from a Datahike server (writer), which allows reactive apps. https://github.com/replikativ/datahike/tree/main/doc#clojurescript and https://github.com/replikativ/datahike/blob/main/doc/distributed.md#streaming-writer-kabel
But for your use case you maybe want something different.
Ah, ok. Yea, Iām looking for something that can run queries and download only small parts of the dataset.
That will require an async query engine, which is doable, but has performance tradeoffs. The transaction logic and index is already both sync and async in cljs, it is mostly porting the query stack on top to support both.
Depending on how you want to query it a simpler solution might also work.
Another alternative in AWS are lambdas, we did a bit of experiments in this direction in the past. But they might be too expensive. Not sure.