data-science

pieterbreed 2026-05-29T07:50:58.336569Z

I have a query about the parallelism of dataset/sort-by-column and dataset/dataset. If this is off-topic, kindly point me in the right direction. I am using these fn's in a server backend that is also doing a lot of other "kafka things", so predictable resource consumption is important for me for this application. I've found that these fn's consume the CPU completely, which starves the other threads to the point of brokenness. Has anybody else encountered this behaviour as a problem? Is there any advice available of for customizing this?

pieterbreed 2026-05-29T08:27:17.403909Z

I've been reading the code and think I was wrong about dataset/dataset. Also, it seems like sort-by-column has an https://cnuernber.github.io/tmdjs/tech.v3.dataset.html#var-sort-by-column flag for :parallel?. I'll try this so long.

Harold 2026-05-29T15:08:35.779399Z

Sounds interesting, if no one replies here you can also try asking on the zulip: https://clojurians.zulipchat.com/#narrow/channel/236259-tech.2Eml.2Edataset.2Edev It's another place for such discussions.

pieterbreed 2026-05-29T15:10:09.446289Z

I've confirmed that it works as expected. As in, setting :parallel? to false (default is true) does indeed serialize the sorting. It is documented in the docstring of those fn's just not in the published docs (that I linked to.) Thanks for the link to the zulip.

👍 1
Harold 2026-05-29T15:12:31.661439Z

Without looking (and I could be wrong), my guess would be that the TMD parallelization bottoms out at the fork-join commonPool - if other code in the project does the same, they can cooperate, and make good use of available system resources.