This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2018-07-07
Channels
- # beginners (19)
- # cider (20)
- # cljs-dev (2)
- # cljsjs (2)
- # clojure (61)
- # clojure-spec (13)
- # clojure-uk (12)
- # clojurescript (12)
- # core-typed (1)
- # cursive (16)
- # data-science (30)
- # datomic (37)
- # fulcro (8)
- # hoplon (7)
- # jobs-discuss (1)
- # onyx (21)
- # planck (18)
- # protorepl (1)
- # re-frame (2)
- # reagent (1)
- # shadow-cljs (6)
- # tools-deps (4)
at worst, you can always "shell out" calls to R scripts with Conch (https://github.com/Raynes/conch/blob/master/README.md) as well.
@blueberry I've put benchmarks in the repo: https://gitlab.com/alanmarazzi/numpy-vs-neanderthal/tree/master. Anyway, I need your (or someone else) help to implement PCA in Neanderthal
@justalanm how would you implement it in an environment that you're familiar with? please post that code and I could give you hints about how to approach it in neanderthal.
@justalanm are you sure that Numpy uses MKL as a backend there? These results look great for Clojure and Neanderthal, since it is 2x - 20x faster across the board (for straightforward in-place matirx multiplication).
@justalanm does numpy in your code use single precision 32bit floats, or double-precision (64 bits)? It seems you're using the default doubles. Can you also try it with dtype=np.float32
I suspect that with float32 Neanderthal will still be faster, but not 2x for huge matrices.
@blueberry yes Numpy uses MKL by default when installed with conda (https://github.com/conda/conda/issues/2032)
This is how I implemented PCA with Numpy:
def pca(m, n, n_components=2):
"""Takes `m`, `n` dimensions and the
`n_components` number of principal
components and returns the reduction of
the generated matrix in `n_components`"""
mat = np.random.rand(m, n)
x = mat.copy()
mat -= np.mean(mat, axis=0)
cov = np.cov(mat, rowvar=False)
evals, evecs = np.linalg.eigh(cov)
idx = np.argsort(evals)[::-1]
evals = evals[idx]
evecs = evecs[:, idx]
evecs = evecs[:, :n_components]
return np.dot(evecs.T, x.T).T
The relevant comparison is double vs double and float vs float for both frameworks. These results are Neanderthal float vs Numpy double, which is expected to be roughly twice as fast...
I am not sure what each of these Numpy function do. The first question: is this a naive PCA or one of the better performing ones?
for which parts of this code do you find it difficult to find what to do in neanderthal?
Basically I center the matrix, then get the covariance matrix, calculate eigenvals & eigenvecs (np.linalg.eigh, which works on non-square matrices as well), then it is just sorting, selecting n_components and take the transpose of the dot product between eigen vectors and the matrix (np.dot(evecs.T, x.T).T)
I'm good up to covariance (though I'm not sure whether my implementation is going to be the best for performance) but then I have issues getting eigen values and vectors
Did you check out the documentation? https://neanderthal.uncomplicate.org/codox/uncomplicate.neanderthal.linalg.html#var-ev.21 and https://neanderthal.uncomplicate.org/codox/uncomplicate.neanderthal.linalg.html#var-es.21
Also, there are ton of raw orthogonal factorizations available for cases where you don't need the full ev or es algorithm...
Also this blog post: https://dragan.rocks/articles/17/Clojure-Linear-Algebra-Refresher-Eigenvalues-and-Eigenvectors
I'm actually going through that series to refresh everything algebra related and to get a better idea about Neanderthal
@justalanm you are also creating the matrices in the benchmark, right? https://gitlab.com/alanmarazzi/numpy-vs-neanderthal/blob/master/numpy_bench.py#L11
in the neanderthal benchmark you only call the benchmark code after allocating this memory: https://gitlab.com/alanmarazzi/numpy-vs-neanderthal/blob/master/neanderthal-bench/src/neanderthal_bench/core.clj#L25