Fork me on GitHub
#data-science
<
2022-11-01
>
SK18:11:51

Is there a framework or tool that allows breaking down my data processing pipeline into small tasks, that are run in parallel? And the framework/tool knows what are the dependencies between these small tasks and allows passing data between the tasks (hopefully with additional caching). Plus there is some management interface that shows me how much time the tasks use on average, estimated time to completion.. And the framework/tool is implemented or can be interfaced with Clojure 🙂

SK19:11:33

Currently I'm using a bunch of pmaps sprinkled here and there, but it's not very managable

metasoarous19:11:37

Oz's "notebook mode" does this!

metasoarous19:11:22

It looks evaluates the code for dependencies between var forms, and tries to run things in parallel when it can. It also displays elapsed execution time in the result blocks in the live view.

otfrom19:11:30

Cool

😊 1
metasoarous20:11:11

Note that this is available in the 2.0.0-alphaX , which has some bugs with static website building and misc other things, but the notebook functionality is fairly solid.

metasoarous20:11:33

It's been taking me much longer than hoped to wrap that work up, but I hope to have more time for this as the winter weather sets in here.

SK22:11:09

Thanks, I'll checkit. Though I'm looking for something more programmatic than a notebook. Eg I'd like to submit tasks in code dynamically, have control over the dependencies between tasks, etc.

👍 1
Rupert (All Street)10:11:40

@U01UPFK1M29 > pmaps sprinkled here and there, but it's not very managable • Since pmap is automatically configured relative to the number of cores - try to only have one pmap at a time and have the pmap fully utilise your machine (ie. be careful not to nest pmaps inside of pmaps inside of pmaps) • The https://github.com/clj-commons/claypoole is fantastic for giving the ease pmap with a bit more control - make sure you use the lazy namespace versions. > Is there a framework or tool that allows breaking down my data processing pipeline into small tasks • I would usually recommend integrant for this. It is single threaded - but you can use futures and claypoole/upmap to run steps in parallel with each other. You can visualise the graph. ◦ At https://sevva.ai/ with process huge datasets every day with integrant. • Alternatively https://www.lambda.cd/ (open source Clojure tool) is a pretty good at this and also provides nice UIs for starting stopping jobs. > I'd like to submit tasks in code dynamically, • Check if you really need this - it might be better to declaratively define your pipeline in a config file (e.g. with integrant) rather than have it dynamically in memory. Graphs structures that are changing or modifying in memory are harder to reason about. (e.g. we don't change production code in memory normally).

metasoarous17:11:53

My goal is for the parts of Oz that handle notebook evaluation be sufficiently abstracted for programmatic use. But it may be a while longer before that's the case.