'http://metamorph.ml' has the namespace of scicloj.metamorph.ml.design-matrix which is supposed to address this.
see tests:
https://github.com/scicloj/metamorph.ml/blob/main/test/scicloj/metamorph/design_matrix_test.clj
It is clearly not the "precise same syntax" (as Clojure has no native support of "formulae as R has" (what Julia's feature of @formula is I don't know)
To me, "more compact" is impossible in Clojure, given "Clojure syntax" (unless going into a String based implementation, which we clearly did not want to embark on, as this wold have meant to write a complex parser for a "proper language" )
In R "formulae" are a base language feature, for which Clojure has no equivalent. I suggest to try `design-matrix" and we are open for enhancement requests. I think it can handle "automaticaly" what you ask for.
Hello Carsten, yes that is exactly what I meant. It is clearly possible. The urls I provided document it if someone is interested in doing so.
yes, but not in Javal / Clojure, no ? You are pointing to R packages.
Just providing the urls to R packages if it is desired to understand the dynamic way that formulas in R can convey expression, including terms, interactions, groupings, hierarchies, and mgcv style functions (see https://cran.r-project.org/web/packages/mgcv/index.html also)
Think of it as a standardized "syntax", often called lme4 formula syntax, that has become a bit of a standard as a DSL. Python bambi also adopts this standard. It is akin to the way SQL has been adopted to query many different backend databases, the formula syntax that originated in R and was extended into the R lme4 package, is now supported in other R packages like brms, but also non-R packages too. If this info is not helpful, I'll pipe down ;)
Not sure what you mean. It is clearly possible, to develop in Clojure a "parser" for a string based DSL, which does at the end something similar to the R formulae in linear regression. But as far as I know nobody did this so far. There is something similar for python as well: https://patsy.readthedocs.io/en/latest/index.html or java: https://haifengl.github.io/api/java/smile/data/formula/Formula.html (GPL licenced)
Base R formulas (see https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html ) and lme4 brms extensions (see https://paulbuerkner.com/brms/reference/brmsformula.html ) syntax can likely be parsed from text string and added to the Clojure ecosystem as desired.
I think @daslu has used this for linear regression and there is a tutorial somewhere.
Thanks for the answer. I'm guessing its this one: https://scicloj.github.io/noj/noj_book.interactions_ols.html? The guide uses the design-matrix function:
(def dm
(dm/create-design-matrix
preprocessed-data
[:sales] ;; predictor
[
[:youtube '(identity :youtube)] ;; youtube stays as-is
[:facebook '(identity :facebook)] ;; facebook stays as-is
[:youtube*facebook '(* :youtube :facebook)] ;; new term is created
]))
I'll try to work through it!It is as well integrated in the plotting library 'tableplot': https://github.com/scicloj/tableplot/blob/cfdb497909a66e80c6829cff22b6ea9bcd9e38f5/notebooks/tableplot_book/plotly_reference.clj#L1094
we transform automaticaly categorical variables to "numbers", but not in one-hot fashion, not sure if this makes difference fr your use case. I do remember that in the context of developping the feature we did wonder if / if-not "dataset" need a agreed way to mark the "statistical dataype of "columns" as being: https://en.wikipedia.org/wiki/Statistical_data_type R has this as well, we don't have it. (apart from "categorical" yes/no, which might ntt be enough for certain things happen automatically=
There are also some opinionated experiments in the experimental Tablemath library: https://scicloj.github.io/tablemath/tablemath_book.reference.html https://github.com/scicloj/tablemath Maybe you'll find its source useful.
also this notebook: https://scicloj.github.io/noj/noj_book.linear_regression_intro.html
First of all: thank you both for help and links.
So, I have some cell-health data. I want to predict vb_percent_live (cont. variable) by two categorical variables (`treatment` with three groups, cell_line with three groups).
I can one-hot encode by either taking:
(def cell-data (-> (tc/dataset "data/cell_health_part1.tsv")
(tc/rename-columns keyword)))
(def treatment-one-hot
(ds-cat/fit-one-hot cell-data :treatment ["Control", "PSMA1", "ORC4"]))
(def cell-line-one-hot
(ds-cat/fit-one-hot cell-data :cell_line ["A549" "HCC44", "ES2"]))
(def cell-data-one-hot
(-> cell-data
(ds-cat/transform-one-hot treatment-one-hot)
(ds-cat/transform-one-hot cell-line-one-hot)))
which gives me something like:
| :vb_percent_live | :treatment-Control | :treatment-PSMA1 | :treatment-ORC4 | :cell_line-A549 | :cell_line-HCC44 | :cell_line-ES2 |
|-----------------:|-------------------:|-----------------:|----------------:|----------------:|-----------------:|---------------:|
| 0.24570297 | 1 | 0 | 0 | 0 | 0 | 1 |
| 0.08859482 | 1 | 0 | 0 | 0 | 0 | 1 |
| -0.08032597 | 1 | 0 | 0 | 0 | 0 | 1 |
| 0.16655826 | 1 | 0 | 0 | 0 | 0 | 1 |
| 0.02126276 | 1 | 0 | 0 | 0 | 0 | 1 |
| 0.09568241 | 1 | 0 | 0 | 0 | 0 | 1 |
| -0.01535644 | 1 | 0 | 0 | 0 | 0 | 1 |
| -0.14647677 | 1 | 0 | 0 | 0 | 0 | 1 |
| 0.07914471 | 1 | 0 | 0 | 0 | 0 | 1 |
Alternatively, I can use the tm/design:
(tm/design
cell-data
[:vb_percent_live]
['(tm/one-hot treatment)
'(tm/one-hot cell_line)])
which gives me something like:
| :vb_percent_live | :treatment=ORC4 | :treatment=PSMA1 | :cell_line=A549 | :cell_line=HCC44 |
|-----------------:|----------------:|-----------------:|----------------:|-----------------:|
| 0.24570297 | 0 | 0 | 0 | 0 |
| 0.08859482 | 0 | 0 | 0 | 0 |
| -0.08032597 | 0 | 0 | 0 | 0 |
| 0.16655826 | 0 | 0 | 0 | 0 |
| 0.02126276 | 0 | 0 | 0 | 0 |
| 0.09568241 | 0 | 0 | 0 | 0 |
| -0.01535644 | 0 | 0 | 0 | 0 |
| -0.14647677 | 0 | 0 | 0 | 0 |
| 0.07914471 | 0 | 0 | 0 | 0 |
So - in the first method, I'd need to drop the reference group column for each variable before doing the regression.
I can't figure out how to do a regression with interaction-terms after one-hot encoding, without having to specify each treatment-cell_line pair individually. I can only find continuous variable by continuous variable interactions in the notebooks.Thanks for looking into this. Tablemath had been a draft that has never continued much, and it can be an opportunity to think about it again. I hope to look soon, but I am unsure about when exactly I'll get to that.
I am as well too litle of an statistician, to realy figure out what to improve. It is clear, that the current Clojure tooling around "http://metamorph.ml" firstly assumes that a user creates the "numeric design matrix" by himself. Which is a "big difference / burden" I guess, compared to R and its build in formulae language, which was I guess designed as a language feature to exactly solve the type of issues you are facing. With the desing-matrix ns, we do an effort of "automating this"
But very likely we only support a "subset" of needed functionality. What could help maybe is to see from use case: • real dataset • the real formula you want to do regression with • the R output of "model.matrix"
I suuesgt to discuss tghis maybe in more detail in teh form of an "issue" in , so we have all informatin together.
please open one, and we see if we can get further by ev. improving which "things" the design-matrix function can handle
In case you cannot share your data, we can of course use any available open dataset, where we look "which form of regession" we don't support easy.
FYI: metamoprh.ml.rdatasets gives access to all datasets here:
https://vincentarelbundock.github.io/Rdatasets/articles/data.html
Just to see, how compact "R2 is, but showing as well that we can do the same in Clojure:
; R model_matrix (mtcars, mpg ~ cyl * disp)
;clojure
(->
(dm/create-design-matrix mtcars
[]
[
[:cyl '(clojure.core/identity :cyl)]
[:disp '(clojure.core/identity :disp)]
[:cyl:disp '(* :cyl :disp)]
]
)
(tc/add-column :intercept [1] :cycle)
(tc/drop-columns [:mpg])
)1 lines vs 10 lines. But they produce identical result.
@sorenpost we discuss some direction in zulip: https://clojurians.zulipchat.com/#narrow/channel/321125-noj-dev/topic/shorter.20syntax.20of.20.22columns.20as-is.22.20is.20ml.2Fdesign-matrix/with/585052769 shorter syntax of "columns as-is" is ml/design-matrix>
I now understand the basic limitation of the design-matrix function you refer to
It has no support whatsoever for interactions of categorical variables, in the sense of "automatically expanding all levels of interacting categorical vars and auto-create columns for it"
I've added my 2cc on zulip regarding design-matrix. The main issue I see with this approach is that you create such matrix before regression, making predictions much harder. lm in fastmath has tranformer parameter which accepts a function translating original data into a designing matrix row autmatically before regression and before prediction.