data-science

Søren 2026-04-11T09:18:31.495599Z

2026-04-11T10:28:32.196769Z

'http://metamorph.ml' has the namespace of scicloj.metamorph.ml.design-matrix which is supposed to address this. see tests: https://github.com/scicloj/metamorph.ml/blob/main/test/scicloj/metamorph/design_matrix_test.clj

2026-04-11T10:32:22.464399Z

It is clearly not the "precise same syntax" (as Clojure has no native support of "formulae as R has" (what Julia's feature of @formula is I don't know) To me, "more compact" is impossible in Clojure, given "Clojure syntax" (unless going into a String based implementation, which we clearly did not want to embark on, as this wold have meant to write a complex parser for a "proper language" )

2026-04-11T10:38:19.781349Z

In R "formulae" are a base language feature, for which Clojure has no equivalent. I suggest to try `design-matrix" and we are open for enhancement requests. I think it can handle "automaticaly" what you ask for.

2026-04-21T11:37:28.365019Z

Hello Carsten, yes that is exactly what I meant. It is clearly possible. The urls I provided document it if someone is interested in doing so.

2026-04-21T17:38:38.799569Z

yes, but not in Javal / Clojure, no ? You are pointing to R packages.

2026-04-21T18:26:11.010499Z

Just providing the urls to R packages if it is desired to understand the dynamic way that formulas in R can convey expression, including terms, interactions, groupings, hierarchies, and mgcv style functions (see https://cran.r-project.org/web/packages/mgcv/index.html also)

2026-04-21T18:39:19.245969Z

Think of it as a standardized "syntax", often called lme4 formula syntax, that has become a bit of a standard as a DSL. Python bambi also adopts this standard. It is akin to the way SQL has been adopted to query many different backend databases, the formula syntax that originated in R and was extended into the R lme4 package, is now supported in other R packages like brms, but also non-R packages too. If this info is not helpful, I'll pipe down ;)

2026-04-20T19:51:09.596789Z

Not sure what you mean. It is clearly possible, to develop in Clojure a "parser" for a string based DSL, which does at the end something similar to the R formulae in linear regression. But as far as I know nobody did this so far. There is something similar for python as well: https://patsy.readthedocs.io/en/latest/index.html or java: https://haifengl.github.io/api/java/smile/data/formula/Formula.html (GPL licenced)

2026-04-19T14:31:32.402909Z

Base R formulas (see https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html ) and lme4 brms extensions (see https://paulbuerkner.com/brms/reference/brmsformula.html ) syntax can likely be parsed from text string and added to the Clojure ecosystem as desired.

2026-04-11T10:40:20.701339Z

I think @daslu has used this for linear regression and there is a tutorial somewhere.

Søren 2026-04-11T10:45:08.053769Z

Thanks for the answer. I'm guessing its this one: https://scicloj.github.io/noj/noj_book.interactions_ols.html? The guide uses the design-matrix function:

(def dm
  (dm/create-design-matrix 
   preprocessed-data
   [:sales]                                         ;; predictor
   [
    [:youtube '(identity :youtube)]                  ;; youtube stays as-is
    [:facebook '(identity :facebook)]                ;; facebook stays as-is
    [:youtube*facebook '(* :youtube :facebook)]       ;; new term is created
    ]))
I'll try to work through it!

2026-04-11T10:48:37.601509Z

It is as well integrated in the plotting library 'tableplot': https://github.com/scicloj/tableplot/blob/cfdb497909a66e80c6829cff22b6ea9bcd9e38f5/notebooks/tableplot_book/plotly_reference.clj#L1094

2026-04-11T10:56:06.452189Z

we transform automaticaly categorical variables to "numbers", but not in one-hot fashion, not sure if this makes difference fr your use case. I do remember that in the context of developping the feature we did wonder if / if-not "dataset" need a agreed way to mark the "statistical dataype of "columns" as being: https://en.wikipedia.org/wiki/Statistical_data_type R has this as well, we don't have it. (apart from "categorical" yes/no, which might ntt be enough for certain things happen automatically=

Daniel Slutsky 2026-04-11T11:28:41.446629Z

There are also some opinionated experiments in the experimental Tablemath library: https://scicloj.github.io/tablemath/tablemath_book.reference.html https://github.com/scicloj/tablemath Maybe you'll find its source useful.

Daniel Slutsky 2026-04-11T11:29:40.096319Z

also this notebook: https://scicloj.github.io/noj/noj_book.linear_regression_intro.html

Søren 2026-04-11T14:56:57.067159Z

First of all: thank you both for help and links. So, I have some cell-health data. I want to predict vb_percent_live (cont. variable) by two categorical variables (`treatment` with three groups, cell_line with three groups). I can one-hot encode by either taking:

(def cell-data (-> (tc/dataset "data/cell_health_part1.tsv")
                (tc/rename-columns keyword)))

(def treatment-one-hot
 (ds-cat/fit-one-hot cell-data :treatment ["Control", "PSMA1", "ORC4"]))

(def cell-line-one-hot
 (ds-cat/fit-one-hot cell-data :cell_line ["A549" "HCC44", "ES2"]))

(def cell-data-one-hot
 (-> cell-data
   (ds-cat/transform-one-hot treatment-one-hot)
   (ds-cat/transform-one-hot cell-line-one-hot)))
which gives me something like:
| :vb_percent_live | :treatment-Control | :treatment-PSMA1 | :treatment-ORC4 | :cell_line-A549 | :cell_line-HCC44 | :cell_line-ES2 |
|-----------------:|-------------------:|-----------------:|----------------:|----------------:|-----------------:|---------------:|
|       0.24570297 |                  1 |                0 |               0 |               0 |                0 |              1 |
|       0.08859482 |                  1 |                0 |               0 |               0 |                0 |              1 |
|      -0.08032597 |                  1 |                0 |               0 |               0 |                0 |              1 |
|       0.16655826 |                  1 |                0 |               0 |               0 |                0 |              1 |
|       0.02126276 |                  1 |                0 |               0 |               0 |                0 |              1 |
|       0.09568241 |                  1 |                0 |               0 |               0 |                0 |              1 |
|      -0.01535644 |                  1 |                0 |               0 |               0 |                0 |              1 |
|      -0.14647677 |                  1 |                0 |               0 |               0 |                0 |              1 |
|       0.07914471 |                  1 |                0 |               0 |               0 |                0 |              1 |
Alternatively, I can use the tm/design:
(tm/design
 cell-data
 [:vb_percent_live]
 ['(tm/one-hot treatment)
  '(tm/one-hot cell_line)])
which gives me something like:
| :vb_percent_live | :treatment=ORC4 | :treatment=PSMA1 | :cell_line=A549 | :cell_line=HCC44 |
|-----------------:|----------------:|-----------------:|----------------:|-----------------:|
|       0.24570297 |               0 |                0 |               0 |                0 |
|       0.08859482 |               0 |                0 |               0 |                0 |
|      -0.08032597 |               0 |                0 |               0 |                0 |
|       0.16655826 |               0 |                0 |               0 |                0 |
|       0.02126276 |               0 |                0 |               0 |                0 |
|       0.09568241 |               0 |                0 |               0 |                0 |
|      -0.01535644 |               0 |                0 |               0 |                0 |
|      -0.14647677 |               0 |                0 |               0 |                0 |
|       0.07914471 |               0 |                0 |               0 |                0 |
So - in the first method, I'd need to drop the reference group column for each variable before doing the regression. I can't figure out how to do a regression with interaction-terms after one-hot encoding, without having to specify each treatment-cell_line pair individually. I can only find continuous variable by continuous variable interactions in the notebooks.

Daniel Slutsky 2026-04-11T22:33:56.391979Z

Thanks for looking into this. Tablemath had been a draft that has never continued much, and it can be an opportunity to think about it again. I hope to look soon, but I am unsure about when exactly I'll get to that.

2026-04-12T15:11:46.061059Z

I am as well too litle of an statistician, to realy figure out what to improve. It is clear, that the current Clojure tooling around "http://metamorph.ml" firstly assumes that a user creates the "numeric design matrix" by himself. Which is a "big difference / burden" I guess, compared to R and its build in formulae language, which was I guess designed as a language feature to exactly solve the type of issues you are facing. With the desing-matrix ns, we do an effort of "automating this"

2026-04-12T15:19:18.496749Z

But very likely we only support a "subset" of needed functionality. What could help maybe is to see from use case: • real dataset • the real formula you want to do regression with • the R output of "model.matrix"

2026-04-12T15:22:17.116529Z

I suuesgt to discuss tghis maybe in more detail in teh form of an "issue" in , so we have all informatin together. please open one, and we see if we can get further by ev. improving which "things" the design-matrix function can handle

2026-04-12T15:28:35.224699Z

In case you cannot share your data, we can of course use any available open dataset, where we look "which form of regession" we don't support easy. FYI: metamoprh.ml.rdatasets gives access to all datasets here: https://vincentarelbundock.github.io/Rdatasets/articles/data.html

2026-04-12T16:48:56.388899Z

Just to see, how compact "R2 is, but showing as well that we can do the same in Clojure:

; R model_matrix (mtcars, mpg ~ cyl * disp)
;clojure
(-> 
 (dm/create-design-matrix mtcars
                          []
                          [
                           [:cyl '(clojure.core/identity :cyl)]
                           [:disp '(clojure.core/identity :disp)]
                           [:cyl:disp '(* :cyl :disp)]
                           ]
                          
                          )
 (tc/add-column :intercept [1] :cycle)
 (tc/drop-columns [:mpg])
 )

2026-04-12T16:49:50.066199Z

1 lines vs 10 lines. But they produce identical result.

2026-04-12T17:59:09.252899Z

@sorenpost we discuss some direction in zulip: https://clojurians.zulipchat.com/#narrow/channel/321125-noj-dev/topic/shorter.20syntax.20of.20.22columns.20as-is.22.20is.20ml.2Fdesign-matrix/with/585052769 shorter syntax of "columns as-is" is ml/design-matrix>

2026-04-12T21:21:14.007699Z

I now understand the basic limitation of the design-matrix function you refer to It has no support whatsoever for interactions of categorical variables, in the sense of "automatically expanding all levels of interacting categorical vars and auto-create columns for it"

genmeblog 2026-04-12T21:30:00.357969Z

I've added my 2cc on zulip regarding design-matrix. The main issue I see with this approach is that you create such matrix before regression, making predictions much harder. lm in fastmath has tranformer parameter which accepts a function translating original data into a designing matrix row autmatically before regression and before prediction.