This page is not created by, affiliated with, or supported by Slack Technologies, Inc.

## 2018-07-08

## Channels

- # aws-lambda (3)
- # beginners (27)
- # boot (4)
- # cljs-dev (26)
- # clojure (7)
- # clojure-spec (8)
- # clojure-uk (17)
- # clojurescript (1)
- # core-typed (2)
- # data-science (53)
- # datomic (24)
- # emacs (1)
- # fulcro (7)
- # luminus (1)
- # off-topic (2)
- # onyx (3)
- # pedestal (5)
- # planck (2)
- # portkey (50)
- # re-frame (15)
- # reagent (5)
- # reitit (2)
- # shadow-cljs (19)
- # tools-deps (15)
- # vim (2)

perhaps comparing performance to sklearn's pca is useful as well... https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/decomposition/pca.py#L107

@aaelony Neanderthal does not come with an explicit function to do PCA, since it's out of scope. PCA is a statistical technique and belongs to a statistical library (based on neanderthal 🙂 On top of that, there are many ways of doing PCA. It is not one algorithm.

@justalanm good work btw.!

hopefully neanderthal will still perform faster when we just compare the matrix multiplication

That's correct, PCA may be too high level. I noticed Alan's pca implementation though and thought to mention sklearn

@aaelony the issue with sklearn is that it is a huge framework, if you look at that implementation you'll see that most of the code are if conditions and other checks, even my simple numpy implementation will outperform that one

@justalanm that's the part of the slowdown but I am sure that the majority of slowdown, at least with larger matrices, is dtype (unless Numpy does something incredibly silly, which I have to guess they aren't) 🙂 Thank you for checking this! (BTW there is no need to include chained matrix multiplication in benchmarks).

@blueberry results updated, now they are basically even

@alan, that's right, and that should be emphasized. A lot of that sklearn pca python code, IMHO, is there to overcome the limitations of python. (For example, see incremental pca.) I clearly think that using Neanderthal is superior, and wanted to highlight that any benchmark for pca performance might not want to exclude timings for one of the most popular python implementations, i.e. sklearn, to make a clear illustration of the benefits of using Neanderthal.

@aaelony that's exactly the reason why I'm implementing PCA in Neanderthal, I wanted to compare some real life scenario with somewhat real workloads

After that and depending on results it might be a good idea to benchmark sklearn as well

This might even turn into a first step towards a full-fledged machine-learning framework based on Neanderthal 😅

I think that's great. How did you come to pick PCA though? Why not compare svd first? or something like that?

My colleagues and I use PCA and similar reduction techniques all the time, and we do it with sklearn mostly, since I'm trying to "sell" clojure to some of them I thought it would have been interesting

in case you didn't know, incanter is implementing it like this: https://github.com/incanter/incanter/blob/master/modules/incanter-core/src/incanter/stats.clj#L2689

@justalanm @whilo @aaelony Do you know if/how Numpy can offload work onto the GPU? Is it built-in, or supported through 3-rd party add-ons, or something else? That's actually what I expect Neanderthal to be competitive at. Having equal performance to the most used state of the art library on the CPU is also not bad 🙂

And this is only matrix multiplication. Neanderhal's strong feature is that it offers fine point memory-level control (useful to minimize copying and other overhead in implementing higher-level algos such as PCA).

We'll see. I have a good feeling that we can demonstrate that Clojure is competitive when it comes to performance now.

it recently became memory compatible with numpy (0.4.0), so that you only have to copy the object to the cpu memory and can access it also with numpy functions

so, if I understand correctly, numpy has no way of GPU computing. What's being done in Python world is that there are other libraries (PyTorch etc) that have their own GPU computing, which knows how to transfer memory from Numpy and also offer Numpy-compatible API?

Pytorch sends tensors to gpu not for "cause I can", but for work with cudnn and so on

There are https://mathema.tician.de/software/pycuda/ and https://numba.pydata.org/ that give access to the GPU in Python

@justalanm pycuda and numba are general GPU programming libraires, if I understand well? they do not offer ready-made numpy functions that work on GPU, but give you environment for GPU programming, so you might be able to implement these functions yourfelf? and then there are libraries like PyTorch that come with GPU implementations for the stuff that they need (dnn etc.) but do not cover numpy completely? or they DO come with complete transparent implementation of a (large part of) Numpy API?

or, to make my question a bit less fuzzy: the code for PCA in Nympy that you posted today, can it run on the GPU when you include some 3-rd party Python GPU enabling library?

@blueberry got it, I can talk about Numba which I used, it covers part of Numpy https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html

But not everything, and it is really likely that pure numpy code won't run as fast as possible if not changed according to Numba philosophy

OK, but at least we can compare Neanderthal on the GPU code with Numpy + Numba code, and it will be an adequate comparison? (Meaning, it's the best way to do Numpy programming on the GPU)

Meh.... the issue is that there is JIT compilation on LLVM and benchmarking it might be a hassle

Anyway, https://gitlab.com/alanmarazzi/numpy-vs-neanderthal/tree/master full benchmark results!

It is really likely I did something stupid, especially with Neanderthal, so if there is something to fix just let me know 😄

@justalanm I would not say that it's stupid, but you are making the gravest of mistake of using pure functions in Neanderthal. They basically make a fresh copy of the data every time they are evaluated! Use methods with the ! suffix (mm! instead of mm, svd! instead of svd, etc.) whenever possible, and this is practically always when implementing algorithm like this one.

@justalanm also, you can compute means much more efficiently. See for example how I did similar thing in Bayadera for plain matrices (and there is even faster custom implementation, but only for GPU in Bayadera). https://github.com/uncomplicate/bayadera/blob/5d4344a3480b5ba5a0a7488e5a9904611163a76d/src/clojure/uncomplicate/bayadera/internal/extensions.clj#L50-L55

And how it is used for pure variant (when needed): https://github.com/uncomplicate/bayadera/blob/5d4344a3480b5ba5a0a7488e5a9904611163a76d/src/clojure/uncomplicate/bayadera/internal/extensions.clj#L75-L80

Not to mention that you don't need slow clojure (fmap inc (fv n)): use functions from vect-math package for that. Besides, this code does not do what you intend. It creates a vector of zeroes, and then increment them one by one into a new instance of vector of ones! If you need to create a vector of ones, use, for example, (entry! (raw (row a 0)) 1.0) which will be much, much faster.

1) Don't create new object unecessarily. Create a few working matrices, and call destructive functions on them.

2) Don't write naive iterations over elements of a vector when you have vectorized functions in vect-math (they work on both vectors and matrices in one sweep)

3) Don't write naive iterations over columns, and rows of a matrix, if you can achieve the same with linear algebra operations (such as mv! in the example of calculating means)

Maybe there are more details, but these are huge ones. Usually speedups are one or more orders of magnitude.

Also, you don't need an extra copy of m at the beginning. calculate covariance, it will give you a fresh copy, and then you can call destructive functions on this matrix - your original matrix stays untouched, and the whole function can be pure...

the inside let should be with-release. Each call creates a lot of off-heap garbage. It is cleaned up by Java's GC eventually anyway, buth with-release will do this explicitly, and make memory use minimal. + stuff will be faster with that, especially with tight loops in the benchmark.