Fork me on GitHub
#data-science
<
2016-03-14
>
blueberry00:03:28

Mike, I explained many times in detail what's wrong with core.matrix, and I think it is a bit funny that you jump in every time Neanderthal is mentioned with the same dreams about core.matrix, without even trying Neanderthal, or discussing the issues that I raised. Every time your answer is that core.matrix is fine for YOUR use cases. That's fine with me and I support your choice, but core.matrix fell short for MY use cases, and after detailed inspection I decided it was unsalvageable. If I thought I could improve it, it would have been easier for me to do that than to spend my time fiddling with JNI and GPU minutes.

blueberry00:03:22

I understand your emotions about core.matrix, and I empathize with you. I support your contributions to Clojure open-source space, and am glad if core.matrix is a fine solution for a number of people. Please also understand that it is not a solution to every problem, and that it can also be an obstacle, when it fells short in a challenge.

blueberry00:03:54

I support you in developing separate BLAS for core.matrix. If you do a good job, I might even learn something new and improve Neanderthal.

blueberry00:03:24

I am happy that you think that it is not much work, since it will be easy for you or someone else to implement it 😉 Contrary to what you said on slack, I am not against it. I said that many times. Go for it. The only thing that I said is that I do not have time for that nor I have any use of core.matrix. Regarding Windows - Neanderthal works on Windows. I know this because a student of mine compiled it (he's experimenting with an alternative GPU backend for Neanderthal and prefers to work on Windows). As I explained to you in the issue that you raised on GitHub last year, You have to install ATLAS on your machine, and Neanderthal has nothing un-Windowsy in its code. There is nothing Neanderthal specific there, it is all about comiling ATLAS. Follow any ATLAS or Nympu + ATLAS or R + ATLAS guide for instructions. Many people did that installation, so I doubt it'd be a real obstacle for you.

stephenmhopper00:03:49

I have an Incanter question. I have a dataset and I'm trying to create multiple aggregate statistics simultaneously. Basically, I need to group by two-columns and then create five different summary statistics for those two columns (mostly just simple mean, standard deviation, median type of operations). I know that I can do something like this for a single function:

(i/$rollup stats/mean :column-name-goes-here [:group-by-column-1 :group-by-column-2] data)

stephenmhopper00:03:59

Is there a way to do that for multiple functions at once?

stephenmhopper00:03:47

Right now, I'm doing each rollup calculation separately, and then brining them all together with conj-cols

stephenmhopper00:03:54

And that feels inefficient to me

mikera02:03:41

@blueberry I chime in on this because I'm trying to create a good data science ecosystem for Clojure. It will be better if people develop systems with dependencies on standard APIs, and core.matrix is the standard API for array programming in Clojure

mikera02:03:36

I have nothing against Neanderthal as an implementation, I'm sure it works well for your use cases. But that should be an implementation, rather than focusing on a separate API

mikera02:03:05

I'm seriously encouraging you to become a valuable contributor to the ecosystem

mikera02:03:20

If you have any issues with core.matrix the process is simple: file an issue. I haven't yet heard any valid technical objections from you, happy to discuss if you have any and consider how they might be addressed

mikera02:03:30

The worst case is if we end up with a situation like in Java, where everyone develops separate incompatible matrix libraries. We really don't want that for Clojure

blueberry09:03:49

@mikera well, in the words of Yoda: wishes do not a standard make. I am curious why you think core.matrix is a standard? It is the most popular api in a tiny niche, but it is far away from standard. BLAS, on the other hand, is. Why you think BLAS was not right? If the only reason was that you wanted NDArrays, why you didn't base 1D and 2D parts on BLAS, and extend only nD stuff?

mikera09:03:32

core.matrix is the only library in the Clojure ecosystem that provides a general purpose abstraction for n-dimensional array programming (analogous to NumPy). To my knowledge it is the only such library for any language that provides truly pluggable implementations thanks to Clojure protocols

mikera09:03:50

If you want a blas-type API in core.matrix, it is actually pretty trivial to do: I actually added a few functions recently becuase someone requested these: https://github.com/mikera/core.matrix/blob/develop/src/main/clojure/clojure/core/matrix/blas.cljc

mikera09:03:14

Having said that, I don't really like BLAS-style APIs in Clojure for a few reasons:

mikera09:03:37

- It depends on mutability. I don't think that is a good "default" API, even if you might want it sometimes for performance reasons

mikera09:03:21

- There are a bunch of unncessary arguments. All the LDA / LDB stuff in gemm for example is tied to a particular array representation. Not well enough abstracted from the memory layout IMO

mikera09:03:54

- It is fundamentally 1D / 2D. That's a big restriction, a lot of data science stuff uses higher dimensional arrays

mikera09:03:50

There's nothing to stop an implementation in core.matrix sending the 1D / 2D stuff to BLAS though, and using an alternative technique for higher dimensions

mikera09:03:43

The vectorz-clj implementation actually does something like that anyway, there is optimised code for the 1D / 2D case and more generic code for the ND cases. My point is that the user shouldn't have to care, it is an implementation detail.

mikera09:03:03

I do like the work you have done with Neanderthal, I think it has some great implementation ideas. I'd even like to try it out on one of my machine learning projects where the mmul would be really helpful

mikera09:03:01

Hence my offer to help you implement the core.matrix protocols still stands. Would like to work with you on this

mikera09:03:23

If I could only get it to build, I could probably have the implementation up and running in a couple of hours (most of the protocols are optional... you only need to implement them if you need to for performance reasons)

blueberry10:03:39

@mikera Neanderthal completely covers lda, strides, offset and such stuf, and the API is completely independent from that. The user only has to say (dge m n) and there is a matrix. (mm a b) - they are multiplied!

mikera10:03:34

That's cool... but that's not the BLAS API any more simple_smile

mikera10:03:56

What I propose is that the Neanderthal core.matrix implementation for mmul would just delegate to dge etc. as necessary. Would be a pretty lightweight wrapper

mikera10:03:05

And the wrapper would handle coercions, so that stuff like (mmul neanderthal-array clojure-vector) would "just work"

mikera10:03:37

i.e. you can focus on just making a great implementation, and let core.matrix handle the messy integration stuff. Does that make sense?

otfrom10:03:22

ah, so this is where all the data people hang out here

otfrom10:03:36

I'd been looking in #C08384015 and #C08PKSV2L

otfrom10:03:53

👋

otfrom10:03:42

mikera: is it worth pinning a message on those other channels pointing people to here?

mikera10:03:16

Could be... though I guess the other channels are still valuable for more specific topics. Quite a bit of discussion happens on #C0533TY12 as well

otfrom10:03:21

mikera: I found it to be a bit ghost townlike until I started chatting with @rickmoynihan

otfrom10:03:21

thoug I will admit that I'm more of a glommer than a splitter when it comes to channels

otfrom10:03:31

👋 eleonore

eleonore10:03:37

👋:skin-tone-6: otfrom

blueberry10:03:27

@mikera but that is precisely my main objection: core.matrix silently does silly stuff that I want to avoid. People are happily multiplying sequences with arrays with core.matrix, and then scratching their head trying to figure out why it has the speed of a snail.

blueberry10:03:06

I deliberately implemented checks in Neanderthal that stop you when you try to shoot yourself in the foot.

mikera10:03:02

I see all that as implementation detail. Mostly you don't care about performance and convenience is more important. If you do care, you should profile and figure out which hotspot you need to optimise.

mikera10:03:52

Nothing to stop core.matrix implementations doing the same checks, of course

rickmoynihan10:03:38

@mikera: regarding core.matrix protocols - specifically for Dataset (not Matrix)... am I right that the core.matrix implementation is eagerly loaded into memory? I'd like to build a lazy/reducible (and possibly transducible) implementation - for ETL tasks.... I'm curious if you have any thoughts on that?

mikera11:03:33

@rickmoynihan: The current implementation is eager, correct. There's nothing to stop a lazy implementation.... though I'm not entirely sure how you would want the semantics to work. A lot of operations would need to realise the whole dataset anyway. For data loading / ETL I would probably just use a combination of regular functions / transducers / reducing functions on a lazy sequence of input data and accumulate the results into an array / dataset of the appropriate shape.

rickmoynihan11:03:47

@mikera: for a lot of the ETL we do - we don't need (or want to hold the whole dataset in memory - because its too big - it does all need to be consumed eventually of course... We do this already in grafter - which uses the row oriented incanter.core.Dataset representation but our custom functions where possible prefer to put a lazy sequence in the :rows... and we tend to avoid most of incanter because of its eagerness preference... not that aren't problems with this approach too of course... I think the ideal from my perspective would be to be able to to use a transducer inside the dataset - and then leave the decision of lazy-seq/channel/reducible to the outermost process... I'm quite interested in the idea of using a reducer to reduce into the target without paying the costs of laziness - but avoiding (where possible) holding everything in RAM

rickmoynihan11:03:15

though I haven't explored this idea in any depth yet

mikera11:03:54

I think reducing into the target is the way to go - assuming your target will fit in memory (the common case?). That way you can just discard the lazily loaded data after it is processed (being careful not to hold onto the head, of course simple_smile ) If your target itself is bigger than memory it's going to get tricky whatever you do, you'll maybe have to look at things like Spark RDDs etc.

rickmoynihan11:03:40

@mikera: that's what we currently do

rickmoynihan11:03:33

but lazy sequences are pretty expensive in terms of object allocations compared to say reducers - plus reducers are easier to close properly

mikera11:03:33

It would surprise me if the object allocations are the real bottleneck, assuming you are doing some non-trivial work with each data item

otfrom11:03:25

rickmoynihan: not sure if iota actually helps any of that. It would allow you to use reducers over something larger than memory (tho not larger than disk). Never been quite sure about the perf though.

rickmoynihan11:03:10

@mikera: well I don't know for sure - but I do know the garbage collector gets hit pretty hard

rickmoynihan11:03:40

@otfrom: we don't need larger than disk just now

otfrom11:03:07

rickmoynihan: ah, what I like to call "annoying size data" 😄

rickmoynihan11:03:32

I've been meaning to play with iota and read its code a bit more thoroughly - because it's pretty close to what I was wanting - though it might benefit with more generality... also most of it seems to be in the Java classes

mikera11:03:38

I guess I don't quite yet see the benefit of having a Dataset-type implementation for your source data if all you are going to do is reduce / transduce over it's rows. That would mean your dataset wouldn't be able to hold onto it's head, which would be an odd implementation....

otfrom11:03:14

mikera: so scrubbing in transducers and calcs in core.matrix?

otfrom11:03:18

is that the pattern?

mikera11:03:30

Yup I think so.

otfrom11:03:31

cool. That feels like something I could work with

otfrom11:03:10

I've been thinking about where dplyr like things go too. I think that some are datascrubbing and some are more transformation as a part of calcs, but I think I need to think about it more.

rickmoynihan12:03:25

@mikera: isn't holding onto the head only useful when you need to return to previous rows though? If you're just mapping a function over the rows and outputting them somewhere - without needing to aggregate/rollup etc...

rickmoynihan12:03:46

@otfrom: by scrubbing - do you mean data cleansing?

mikera12:03:00

@rickmoynihan: yes that is exactly right. But then I don't see what data structure makes sense for your dataset: a record with :rows and :column-names would be holding on to the head, for example if :rows was a lazy seq

rickmoynihan12:03:06

@mikera: I've thought about that too - and I think that is only an issue if you realise the dataset with the dataset still bound... I think if you use the dataset merely as a vehicle for keeping the :rows computation and the :column-names together - you can be careful to ensure locals clearing still kicks in

mikera12:03:51

Makes sense. But then I wonder why bother having the Dataset in the first place..... is it just for column name tracking etc?

rickmoynihan12:03:13

it holds the order of columns

rickmoynihan12:03:20

which can be important

rickmoynihan12:03:48

but keeping them in sync is a bit of a pain - as every dataset operation has to worry about it

rickmoynihan12:03:20

@mikera: part of what we want to do with grafter is build user interfaces for building transformations.... openrefine style... e.g. this is a prototype interface we helped a project partner build: https://www.youtube.com/watch?v=zAruS4cEmvk

rickmoynihan12:03:31

it's something we'd do ourselves if we weren't resource constrained - and one of our FP7 project partners needed something to do - so we had them build that ontop of grafter

rickmoynihan12:03:12

but those kinda interfaces require columns to be tracked - and maintain a stable/predictable order

otfrom13:03:49

rickmoynihan: scrubbing == cleansing

otfrom13:03:27

though munging is still something that might be done w/the dataset, (munging being rearranging more than cleansing)

otfrom13:03:42

which is why I'm trying to think of where dplyr like things might go

rickmoynihan13:03:51

@otfrom: Yeah - in my experience data cleaning is usually more ammeneable to row/stream processing... you don't normally need multiple passes over the data... but clearly the numerical usecases and rearranging ones do need that

rickmoynihan13:03:49

what are you using currently for dplyr? We've used a tweaked version of this in the past: https://github.com/tomfaulhaber/split-apply-combine

rickmoynihan13:03:29

but it's pretty much unmaintained - I'm curious where the inertia is now...

otfrom14:03:00

rickmoynihan: just straight up clojure, but I have some R converts who are missing things

rickmoynihan14:03:14

yeah - we have one of those too simple_smile

otfrom14:03:11

3 from R (including our CEO Francine) and 1 from Python.

otfrom14:03:16

I'm working on them slowly

rickmoynihan14:03:16

I think the split apply combine approach is pretty interesting - its obviously got some interesting parallels to map/reduce, fork/join/reducers etc...

rickmoynihan14:03:31

I know next to nothing about R - but I skimmed over the R split apply combine paper a while back - and it seemed that it could be a lot more general in clojure...

otfrom14:03:32

am I being a bit dumb in that I can't find a good relational operator to join 2 datasets according to shared keys? (inner, left or right outer)

otfrom14:03:06

rickmoynihan: I think some of the limitations around data frames make some of the things that fit well there a bit easier as you don't need to worry as much about data structure shape

otfrom14:03:16

but IANA R coder

rickmoynihan14:03:58

yeah that's certainly true - but then part of that is perhaps because clojure doesn't really like encapsulation

rickmoynihan14:03:22

record based implementations have to worry quite a lot about the types stored in their keys etc... because otherwise equality breaks - that kinda thing - if you have a stronger protocol/type system you don't need to do so much runtime checking

otfrom16:03:30

bumping my earlier silly question: am I being a bit dumb in that I can't find a good relational operator to join 2 datasets according to shared keys? (inner, left or right outer)

otfrom16:03:49

particularly in core.clojure.matrix.dataset ?

jsa-aerial18:03:27

Re: I am curious why you think core.matrix is a standard? Because it is the defacto standard, written up in several books and other documentation.

blueberry18:03:47

So, if I include Neanderthal examples in a book, or a scientific article, that would make it a standard? (Neanderthal already appeared in publication, and there is also other documentation).

blueberry18:03:35

Of course, this is not how standards (de jure or de facto) are made, but if core.matrix works well for many people, I am happy that it does.

jsa-aerial18:03:50

You also need the traction - to some extent this is a case of 'first one with a plan wins'. Chances are you won't get that now, because core.matrix is settled into the ecosystem. 'Standards' become that because of people using them and that is exactly what is happening with core.matrix. But it also helps that the concepts behind core.matrix (especially the transparent pluggable impls) are really good.

blueberry18:03:46

But I am not after that. I created Neanderthal because I needed it as a base for my other projects, and open-sourced it because: why not. Many people don't need it since they are happy with core.matrix, and that's OK. Some other people experiment with Neanderthal, and I suppose that is also OK.

jsa-aerial18:03:02

You don't seem to realize that core.matrix is really an API, interface if you will (which as mikera points out is not set in concrete), that can sit over many implementations. This enables user code to take advantage of new ('better' in certain ways - say performance) implementations without the need to change the user code. So, if Neanderthal (or perhaps a lower impl level of it) were another implementation of the c.m api (which maybe would require a tweek or two as well), a user could take advantage of its apparent performance advantage in their current code. This seems like an obvious clear win for all concerned. I'm at a loss to understand why you feel that this is somehow undesirable.

blueberry18:03:15

Again, I said that many times: I think that would be desirable. Please make it happen.