Fork me on GitHub
#data-science
<
2022-06-23
>
markaddleman13:06:24

Hi folks - I'd like to take a dataset with a dependent variable and extract a data representation of a decision tree that explains the dependent variable. I am interested in explaining the data and I have no interest in predicting results. The data representation of the decision tree will be used for a variety of use cases such as producing human descriptions of rules, editing the rules by applying human wisdom so that an improved set of rules can be executed in the future, etc. In my review of existing decision tree libraries, I cannot find any that make this task straightforward. I'm hoping that I'm just missing something. Ideally, I'd like to find a way to back-out a data-representation of decision trees from an XGBoost model. Any thoughts on a good library to explore?

Carsten Behring15:06:26

which wraps Smile can do that. You can "train" on your data and some models (decision trees / random forest) allow you to ask for variable importance or the tree as such. (Xgboost, which is usable via scicloj.ml.xgboost does the same. In this happens by extracting the (trained) Smile java class from the training result and call via Java Interop the appropriate java Smile methods on it. Examples for this are here: https://scicloj.github.io/scicloj.ml-tutorials/userguide-models.html#:smile.classification/ada-boost https://scicloj.github.io/scicloj.ml-tutorials/userguide-models.html#:smile.classification/decision-tree

markaddleman17:06:26

Thanks for the pointer!

val_waeselynck20:06:00

So, people keep wondering why Clojure comes up at the top-paying lang in the StackOverflow Developers Survey. Of course, we'd all like to find a causal correlation from Clojure adoption to high pay 😉 but there might be a lot of confounding factors (experience, geographic adoption patterns, etc.) Maybe it's time for a little causal analysis of the data? Wouldn't that be a nice data science project for trying out the tools of the community?

👍 3
Daniel Slutsky21:06:19

Sounds great. @U06GS6P1N if you'd find it useful to turn that into a community session of some kind, please tell -- I'd love to help.

metasoarous03:06:43

I love this idea. I have a feeling it might be extremely hard to tease apart all of the various factors, but it's worth a shot! And either way it will be a fun example project to point people to (with not so subtle undertones of why folks should assimilate).

borg 1
👍 1
val_waeselynck10:06:22

I'm just throwing the idea around, as I don't really have time for this right now ^^

🙏 1
val_waeselynck11:06:35

It's not like I'm very familiar with the Clojure statistical stack anyway

val_waeselynck11:06:28

I'd tackle this by making some sort of regression on salary, in which country, seniority and language would be covariates.

👍 1
aaelony16:06:55

if you are using libpython-clj, you could use the DoWhy library https://github.com/py-why/dowhy

1
Daniel Gerson11:07:32

I like this thread. My 2p: It's either likely to be the kinds of people who do Clojure who can charge a premium or it will be the nature of Clojure work (and the supply and demand constraints in this vein) that demand the premium. A dataset that includes consulting fees from consultants in more than one language could tell you whether these individuals charge higher across all languages for which they consult. If not, it's likely the latter.

💡 1
👍 1