Fork me on GitHub
#data-science
<
2018-05-10
>
whilo08:05:08

gorilla is nice, but it is hardly maintained, lacks many features of jupyter and is less beginner friendly (jupyter has a nice UI with menu, is known to many people in the data science environment and has a lot of documentation everywhere).

whilo08:05:59

i have used gorilla and also contributed a patch, but i think the clojure strategy is in general trying to use a powerful hosted environment where it is compatible with clojure's value proposition and functional nature instead of reinventing the wheel.

whilo08:05:11

the same holds for plotting. gorilla and incanter are not enough to produce scientific plots. i really tried to use clojure+gorilla in competition to R and Python and it is not worth it with the pure clojure approach

whilo09:05:05

even the JVM has few good plotting options imo. compare them to plotly for example, to which i have converged for now mostly because it has good examples of how to do scientific plots with it.

whilo09:05:32

@justalanm what are you working on?

alan09:05:18

Yeah, unfortunately I noticed as well that Gorilla is not much alive (Incanter is even worse...). I'm a data scientist at an insurance company, I'm introducing Clojure at work for data engineering tasks (ETL, pipelines, batch processing, and so on). I feel that what we lack is not an environment such as Incanter, working with sequences of maps is easy, straightforward and pretty fast (easy parallelization is what really sold me on Clojure for data engineering), but something such as scikit-learn for Python

whilo09:05:45

what do you use of scikit-learn?

whilo09:05:59

i am atm. working on anglican, the probabilistic programming language

whilo09:05:54

i think we should outsource side-effects like plotting, worksheets and maybe even some dataformats (hdf5) and so on to standards and focus on core processing

whilo09:05:00

in that sense you are right

whilo09:05:19

for the core control flow and algorithms we would need to have something in clojure/java

alan09:05:26

I didn't know about Anglican, I'll take a look at it. We use more or less 50% of scikit-learn facilities: many classificators, metrics, decomposition and regression algos

alan09:05:02

When I'll have some more time I'd like to experiment with https://neanderthal.uncomplicate.org/

alan09:05:10

We're throwing a lot of XGBoost and other ensembles at problems, but I totally dislike its API (when it works, because we have many issues with it)

alan09:05:34

Anyway, I don't like much jupyter notebooks, people tend to use them as IDEs and converting them to standalone scripts is non-trivial, while Gorilla's worksheets are just .clj files with comments. A much better idea in my opinion

whilo09:05:27

true, but they also can become huge

whilo09:05:34

i could not always easily load them in an editor

whilo09:05:35

i think it would be easy to have a one way extraction of the jupyter json into a clj file

whilo09:05:45

but i am not sure whether this is good enough

whilo09:05:54

i agree that the worksheet approach has limits

whilo09:05:22

cider also seems to support images in the REPL now, which might be good enough for plots and actually a pretty powerful environment

whilo09:05:37

similar to proto REPL

whilo09:05:09

i have worked on a core.matrix wrapper around neanderthal: https://github.com/cailuno/denisovan

whilo09:05:48

neanderthal provides some nice low-level APIs and primitives, but it is not like numpy or scikit-learn

whilo09:05:12

btw. i hate this threading interface of slack

whilo09:05:18

(and i hate slack)

alan09:05:39

Nice! That's a very good idea. Anyway the strong part of scikit-learn is not all the algos, but a common interface and all the "tooling" like metrics, plotting and persistence facilities

alan09:05:54

I agree simple_smile

alan09:05:38

If you prefer you can fire me an email at <mailto:[email protected]|[email protected]>

whilo09:05:43

or we keep discussing in the main slack channel

whilo09:05:17

i think this is maybe interesting to others as well

whilo09:05:29

i just don't like slack because of the paywall to my own content

whilo09:05:38

i cannot look up stuff i discussed with others later

alan10:05:26

Let's continue on the main thread then

whilo09:05:52

alternatively this might be the way to go if we would like to stay in clojure https://github.com/cloudkj/lambda-ml

alan10:05:26

Let's continue on the main thread then

alan12:05:13

Yeah I know about both of them, but as you just said Weka is Java and anyway both lack the scope of scikit-learn. I feel like this can't be outsourced (as Weka) and though I started with R and still use it extensively I can perfectly understand the fact that having one coherent API to perform maybe 90% of the modeling tasks most of us need is a huge benefit

alan12:05:56

Most of my colleagues use Python because it's mainstream and because is a one stop shop. And when I say "use" I mean it, most of them have never bothered about Python internals, how to solve difficult and real problems and so on

alan12:05:09

So I would say: - One consistent API (or DSL in the Clojure case, it doesn't make much difference) - Good performance out of the box (better if GPU enabled) - All the tooling required to perform an analysis from start to finish (metrics, plotting facilities, etc) - Nice to have: XGBoost and maybe even https://github.com/catboost/catboost implemented within the same API and some deploying facilities These are the factors that would make Clojure viable as a machine learning ready language for the mainstream. We can bring to the table immutability, easy parallelization and probably very good performance out of the box and the JVM (with all its pros and cons)

alan12:05:26

Oh and of course piping stuff and layers (for neural nets) it's much better in Clojure than in other languages. I'm aware of Cortex and I kinda like it, but it's easy to beat TensorFlow's clunky syntax...

whilo13:05:11

do you know pytorch? it is extremely good syntax wise

whilo13:05:17

it feels like python

whilo14:05:35

i will be working on autograd next, i already have done some preliminary work last autumn: https://github.com/whilo/clj-autograd and will take a look whether this is mergeable with flare

12
chunsj02:05:49

@U1C36HC6N I am trying to adopt your autograd in my TH library binding for Common Lisp. Can you provide me references on your design of autograd code? I’ve managed to write a converted code for lisp (https://bitbucket.org/chunsj/th/src/master/ad/) and I’d like to extend this as in the case of pytorch. Thank you.

whilo13:05:24

this is a general website for the autodiff community: http://autodiff.org/

whilo13:05:45

this is reverse mode autodifferentiation, in neural networks it is often called backpropagation

whilo13:05:00

effectively all deep learning libraries do this nowadays

whilo13:05:06

(to my knowledge)

whilo14:05:23

@justalanm what are you missing from lambda-ml?

alan15:05:31

As I already said the whole tooling part: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics for instance, and a common API

alan15:05:49

And docs, a lot of docs and tutorials