This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2021-09-29
Channels
- # aws (8)
- # babashka (45)
- # beginners (83)
- # cider (23)
- # clj-on-windows (4)
- # cljdoc (23)
- # clojars (6)
- # clojure (68)
- # clojure-dev (33)
- # clojure-europe (75)
- # clojure-nl (1)
- # clojure-uk (4)
- # clojurescript (14)
- # conjure (6)
- # data-science (15)
- # datascript (7)
- # datomic (47)
- # docker (15)
- # events (1)
- # fulcro (4)
- # graphql (3)
- # jobs (4)
- # lsp (14)
- # nginx (2)
- # nrepl (2)
- # off-topic (41)
- # pathom (18)
- # pedestal (1)
- # polylith (72)
- # reitit (8)
- # reveal (1)
- # shadow-cljs (48)
- # tools-build (11)
- # tools-deps (24)
- # xtdb (8)
I am developing a in my view interesting approach to a pure functional way of evaluating / selecting ML models.
The code is here:
https://github.com/scicloj/metamorph.ml/blob/main/src/scicloj/metamorph/ml.clj#L121
and a tutorial here:
https://scicloj.github.io/scicloj.ml-tutorials/userguide-advanced.html
in chapter "Pipeline evaluation and selection".
At this point in time the code is "working for me and my use cases", but I would like to get a fresh view on it.
Especially regarding 2 question about existing function
:
1. is it generic enough to cover the most important forms of model evaluation and model selection ?
2. Are the usability / performance trade offs and the control on memory usage acceptable ?
There is quite some tutorial material available here: https://github.com/scicloj/scicloj.ml
If somebody here wants to take a look at it, I would be very grateful.
I'll tackle some ML tasks in the coming months, but I'm not there yet. I'll gladly give it a spin though. Thanks for your work and the user guide, very intriguing stuff! FWIW you'll have a bigger audience in the zulip channel
Thanks, I am usually in zulip, just tried here as well. It seems that ML in Clojure is a tiny group of people.
I count a lot on interop as well. For this you might be intrested in: https://github.com/behrica/clj-py-r-template This allows to "start in clojure" and continue in python / R if certain ML things are missing.
You need Docker, though.
Regarding point 2: The function need to "balance/support 2 very different usecases": 1. Exploratory evaluation of a few models and a few pipelines (= sets of hyper parameters) -> this requires to return "all possible information on the model evaluation process", including the data passed into evry model, the model instanced and so on (apart from only the evaluation metrics)
2. Systematic searches for the best model across a large number of hyper parameter combinations and cross validations -> This will not work if the methods returns "all evaluation information" as needed in case 1 -> We could have easily thousands / hundred thousands of model evaluations, so we would run out of heap space quickly return data for each of it -> Ideally in this process, the "used heap space" would be only a single float number per model evaluation (= the performance metrics) --> even this is not realy need, as we only needed to keep track of the "best" --> this becomes strange to do in a functional style --> lazy evaluation is neither an option, as model training is slow and can even take up to hours in the case of deep learning models
My proposed solution consists in being able to configure to a large extend via function parameters, which precise information "per model evaluation", the function will return. This allows the user to balance between "few evaluations and lot of (debug) information on teh evaluations" vs "lots of models and very few information per evaluation"
Is this reasonable to do ?
A single model evaluation is a rather complex map in a fixed structure, and the user can basically configure which parts of that map should be removed for each model evaluation. Additionally there are two boolean flags, which decide if only the best models (per cross validation / per hyper parameter) gets returned
An other very interesting "use case", I could add to the function, would be to allow a kind of "write all to disk" mode. A "model evaluation result" is currently just a map, which I could just serialize to disk after each evaluation. In this way: • the function would use zero-heap space (better said: only the heap space needed to evaluate a single model), after writing to disk, we could discard it completely
It would be more suitable to "run for days", as a crash of the JVM would not result in "loosing" all evaluation result.
This would requires some more functions, which could "analyse" what was written to disk and make model selection in that way