This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2016-03-14
Channels
- # beginners (74)
- # boot (23)
- # braid-chat (7)
- # cider (5)
- # clara (3)
- # cljsjs (17)
- # cljsrn (1)
- # clojure (105)
- # clojure-austin (9)
- # clojure-new-zealand (34)
- # clojure-poland (2)
- # clojure-russia (177)
- # clojure-uk (41)
- # clojure-ukraine (2)
- # clojurescript (130)
- # component (1)
- # core-async (2)
- # core-matrix (6)
- # cursive (7)
- # data-science (103)
- # datomic (24)
- # emacs (15)
- # funcool (4)
- # hoplon (21)
- # immutant (151)
- # ldnclj (76)
- # melbourne (1)
- # off-topic (8)
- # om (152)
- # om-next (1)
- # onyx (26)
- # parinfer (38)
- # re-frame (13)
- # reagent (14)
- # spacemacs (1)
- # vim (92)
- # yada (1)
Mike, I explained many times in detail what's wrong with core.matrix, and I think it is a bit funny that you jump in every time Neanderthal is mentioned with the same dreams about core.matrix, without even trying Neanderthal, or discussing the issues that I raised. Every time your answer is that core.matrix is fine for YOUR use cases. That's fine with me and I support your choice, but core.matrix fell short for MY use cases, and after detailed inspection I decided it was unsalvageable. If I thought I could improve it, it would have been easier for me to do that than to spend my time fiddling with JNI and GPU minutes.
I understand your emotions about core.matrix, and I empathize with you. I support your contributions to Clojure open-source space, and am glad if core.matrix is a fine solution for a number of people. Please also understand that it is not a solution to every problem, and that it can also be an obstacle, when it fells short in a challenge.
I support you in developing separate BLAS for core.matrix. If you do a good job, I might even learn something new and improve Neanderthal.
I am happy that you think that it is not much work, since it will be easy for you or someone else to implement it 😉 Contrary to what you said on slack, I am not against it. I said that many times. Go for it. The only thing that I said is that I do not have time for that nor I have any use of core.matrix. Regarding Windows - Neanderthal works on Windows. I know this because a student of mine compiled it (he's experimenting with an alternative GPU backend for Neanderthal and prefers to work on Windows). As I explained to you in the issue that you raised on GitHub last year, You have to install ATLAS on your machine, and Neanderthal has nothing un-Windowsy in its code. There is nothing Neanderthal specific there, it is all about comiling ATLAS. Follow any ATLAS or Nympu + ATLAS or R + ATLAS guide for instructions. Many people did that installation, so I doubt it'd be a real obstacle for you.
I have an Incanter question. I have a dataset and I'm trying to create multiple aggregate statistics simultaneously. Basically, I need to group by two-columns and then create five different summary statistics for those two columns (mostly just simple mean, standard deviation, median type of operations). I know that I can do something like this for a single function:
(i/$rollup stats/mean :column-name-goes-here [:group-by-column-1 :group-by-column-2] data)
Is there a way to do that for multiple functions at once?
Right now, I'm doing each rollup calculation separately, and then brining them all together with conj-cols
And that feels inefficient to me
@blueberry I chime in on this because I'm trying to create a good data science ecosystem for Clojure. It will be better if people develop systems with dependencies on standard APIs, and core.matrix is the standard API for array programming in Clojure
I have nothing against Neanderthal as an implementation, I'm sure it works well for your use cases. But that should be an implementation, rather than focusing on a separate API
If you have any issues with core.matrix the process is simple: file an issue. I haven't yet heard any valid technical objections from you, happy to discuss if you have any and consider how they might be addressed
The worst case is if we end up with a situation like in Java, where everyone develops separate incompatible matrix libraries. We really don't want that for Clojure
@mikera well, in the words of Yoda: wishes do not a standard make. I am curious why you think core.matrix is a standard? It is the most popular api in a tiny niche, but it is far away from standard. BLAS, on the other hand, is. Why you think BLAS was not right? If the only reason was that you wanted NDArrays, why you didn't base 1D and 2D parts on BLAS, and extend only nD stuff?
core.matrix is the only library in the Clojure ecosystem that provides a general purpose abstraction for n-dimensional array programming (analogous to NumPy). To my knowledge it is the only such library for any language that provides truly pluggable implementations thanks to Clojure protocols
If you want a blas-type API in core.matrix, it is actually pretty trivial to do: I actually added a few functions recently becuase someone requested these: https://github.com/mikera/core.matrix/blob/develop/src/main/clojure/clojure/core/matrix/blas.cljc
- It depends on mutability. I don't think that is a good "default" API, even if you might want it sometimes for performance reasons
- There are a bunch of unncessary arguments. All the LDA / LDB stuff in gemm for example is tied to a particular array representation. Not well enough abstracted from the memory layout IMO
- It is fundamentally 1D / 2D. That's a big restriction, a lot of data science stuff uses higher dimensional arrays
There's nothing to stop an implementation in core.matrix sending the 1D / 2D stuff to BLAS though, and using an alternative technique for higher dimensions
The vectorz-clj implementation actually does something like that anyway, there is optimised code for the 1D / 2D case and more generic code for the ND cases. My point is that the user shouldn't have to care, it is an implementation detail.
By the way @blueberry
I do like the work you have done with Neanderthal, I think it has some great implementation ideas. I'd even like to try it out on one of my machine learning projects where the mmul would be really helpful
Hence my offer to help you implement the core.matrix protocols still stands. Would like to work with you on this
If I could only get it to build, I could probably have the implementation up and running in a couple of hours (most of the protocols are optional... you only need to implement them if you need to for performance reasons)
@mikera Neanderthal completely covers lda, strides, offset and such stuf, and the API is completely independent from that. The user only has to say (dge m n) and there is a matrix. (mm a b) - they are multiplied!
What I propose is that the Neanderthal core.matrix implementation for mmul would just delegate to dge etc. as necessary. Would be a pretty lightweight wrapper
And the wrapper would handle coercions, so that stuff like (mmul neanderthal-array clojure-vector) would "just work"
i.e. you can focus on just making a great implementation, and let core.matrix handle the messy integration stuff. Does that make sense?
mikera: is it worth pinning a message on those other channels pointing people to here?
Could be... though I guess the other channels are still valuable for more specific topics. Quite a bit of discussion happens on #C0533TY12 as well
mikera: I found it to be a bit ghost townlike until I started chatting with @rickmoynihan
thoug I will admit that I'm more of a glommer than a splitter when it comes to channels
@mikera but that is precisely my main objection: core.matrix silently does silly stuff that I want to avoid. People are happily multiplying sequences with arrays with core.matrix, and then scratching their head trying to figure out why it has the speed of a snail.
I deliberately implemented checks in Neanderthal that stop you when you try to shoot yourself in the foot.
I see all that as implementation detail. Mostly you don't care about performance and convenience is more important. If you do care, you should profile and figure out which hotspot you need to optimise.
@mikera: regarding core.matrix protocols - specifically for Dataset (not Matrix)... am I right that the core.matrix implementation is eagerly loaded into memory? I'd like to build a lazy/reducible (and possibly transducible) implementation - for ETL tasks.... I'm curious if you have any thoughts on that?
@rickmoynihan: The current implementation is eager, correct. There's nothing to stop a lazy implementation.... though I'm not entirely sure how you would want the semantics to work. A lot of operations would need to realise the whole dataset anyway. For data loading / ETL I would probably just use a combination of regular functions / transducers / reducing functions on a lazy sequence of input data and accumulate the results into an array / dataset of the appropriate shape.
@mikera: for a lot of the ETL we do - we don't need (or want to hold the whole dataset in memory - because its too big - it does all need to be consumed eventually of course...
We do this already in grafter - which uses the row oriented incanter.core.Dataset
representation but our custom functions where possible prefer to put a lazy sequence in the :rows
... and we tend to avoid most of incanter because of its eagerness preference... not that aren't problems with this approach too of course...
I think the ideal from my perspective would be to be able to to use a transducer inside the dataset - and then leave the decision of lazy-seq/channel/reducible to the outermost process... I'm quite interested in the idea of using a reducer to reduce into the target without paying the costs of laziness - but avoiding (where possible) holding everything in RAM
though I haven't explored this idea in any depth yet
I think reducing into the target is the way to go - assuming your target will fit in memory (the common case?). That way you can just discard the lazily loaded data after it is processed (being careful not to hold onto the head, of course ) If your target itself is bigger than memory it's going to get tricky whatever you do, you'll maybe have to look at things like Spark RDDs etc.
@mikera: that's what we currently do
but lazy sequences are pretty expensive in terms of object allocations compared to say reducers - plus reducers are easier to close properly
It would surprise me if the object allocations are the real bottleneck, assuming you are doing some non-trivial work with each data item
rickmoynihan: not sure if iota actually helps any of that. It would allow you to use reducers over something larger than memory (tho not larger than disk). Never been quite sure about the perf though.
@mikera: well I don't know for sure - but I do know the garbage collector gets hit pretty hard
@otfrom: we don't need larger than disk just now
I've been meaning to play with iota and read its code a bit more thoroughly - because it's pretty close to what I was wanting - though it might benefit with more generality... also most of it seems to be in the Java classes
I guess I don't quite yet see the benefit of having a Dataset-type implementation for your source data if all you are going to do is reduce / transduce over it's rows. That would mean your dataset wouldn't be able to hold onto it's head, which would be an odd implementation....
I've been thinking about where dplyr like things go too. I think that some are datascrubbing and some are more transformation as a part of calcs, but I think I need to think about it more.
@mikera: isn't holding onto the head only useful when you need to return to previous rows though? If you're just mapping a function over the rows and outputting them somewhere - without needing to aggregate/rollup etc...
@otfrom: by scrubbing - do you mean data cleansing?
@rickmoynihan: yes that is exactly right. But then I don't see what data structure makes sense for your dataset: a record with :rows
and :column-names
would be holding on to the head, for example if :rows
was a lazy seq
@mikera: I've thought about that too - and I think that is only an issue if you realise the dataset with the dataset still bound... I think if you use the dataset merely as a vehicle for keeping the :rows computation and the :column-names together - you can be careful to ensure locals clearing still kicks in
Makes sense. But then I wonder why bother having the Dataset in the first place..... is it just for column name tracking etc?
it holds the order of columns
which can be important
but keeping them in sync is a bit of a pain - as every dataset operation has to worry about it
@mikera: part of what we want to do with grafter is build user interfaces for building transformations.... openrefine style... e.g. this is a prototype interface we helped a project partner build: https://www.youtube.com/watch?v=zAruS4cEmvk
it's something we'd do ourselves if we weren't resource constrained - and one of our FP7 project partners needed something to do - so we had them build that ontop of grafter
but those kinda interfaces require columns to be tracked - and maintain a stable/predictable order
though munging is still something that might be done w/the dataset, (munging being rearranging more than cleansing)
@otfrom: Yeah - in my experience data cleaning is usually more ammeneable to row/stream processing... you don't normally need multiple passes over the data... but clearly the numerical usecases and rearranging ones do need that
what are you using currently for dplyr? We've used a tweaked version of this in the past: https://github.com/tomfaulhaber/split-apply-combine
but it's pretty much unmaintained - I'm curious where the inertia is now...
rickmoynihan: just straight up clojure, but I have some R converts who are missing things
yeah - we have one of those too
I think the split apply combine approach is pretty interesting - its obviously got some interesting parallels to map/reduce, fork/join/reducers etc...
I know next to nothing about R - but I skimmed over the R split apply combine paper a while back - and it seemed that it could be a lot more general in clojure...
am I being a bit dumb in that I can't find a good relational operator to join 2 datasets according to shared keys? (inner, left or right outer)
rickmoynihan: I think some of the limitations around data frames make some of the things that fit well there a bit easier as you don't need to worry as much about data structure shape
yeah that's certainly true - but then part of that is perhaps because clojure doesn't really like encapsulation
record based implementations have to worry quite a lot about the types stored in their keys etc... because otherwise equality breaks - that kinda thing - if you have a stronger protocol/type system you don't need to do so much runtime checking
bumping my earlier silly question: am I being a bit dumb in that I can't find a good relational operator to join 2 datasets according to shared keys? (inner, left or right outer)
Re: I am curious why you think core.matrix is a standard? Because it is the defacto standard, written up in several books and other documentation.
So, if I include Neanderthal examples in a book, or a scientific article, that would make it a standard? (Neanderthal already appeared in publication, and there is also other documentation).
Of course, this is not how standards (de jure or de facto) are made, but if core.matrix works well for many people, I am happy that it does.
You also need the traction - to some extent this is a case of 'first one with a plan wins'. Chances are you won't get that now, because core.matrix is settled into the ecosystem. 'Standards' become that because of people using them and that is exactly what is happening with core.matrix. But it also helps that the concepts behind core.matrix (especially the transparent pluggable impls) are really good.
But I am not after that. I created Neanderthal because I needed it as a base for my other projects, and open-sourced it because: why not. Many people don't need it since they are happy with core.matrix, and that's OK. Some other people experiment with Neanderthal, and I suppose that is also OK.
You don't seem to realize that core.matrix is really an API, interface if you will (which as mikera points out is not set in concrete), that can sit over many implementations. This enables user code to take advantage of new ('better' in certain ways - say performance) implementations without the need to change the user code. So, if Neanderthal (or perhaps a lower impl level of it) were another implementation of the c.m api (which maybe would require a tweek or two as well), a user could take advantage of its apparent performance advantage in their current code. This seems like an obvious clear win for all concerned. I'm at a loss to understand why you feel that this is somehow undesirable.