Fork me on GitHub
#data-science
<
2018-07-07
>
aaelony15:07:37

at worst, you can always "shell out" calls to R scripts with Conch (https://github.com/Raynes/conch/blob/master/README.md) as well.

alan17:07:16

@blueberry I've put benchmarks in the repo: https://gitlab.com/alanmarazzi/numpy-vs-neanderthal/tree/master. Anyway, I need your (or someone else) help to implement PCA in Neanderthal

alan17:07:02

😅 I'm very new to Neanderthal and to the JVM in general

blueberry22:07:57

@justalanm how would you implement it in an environment that you're familiar with? please post that code and I could give you hints about how to approach it in neanderthal.

blueberry23:07:10

@justalanm are you sure that Numpy uses MKL as a backend there? These results look great for Clojure and Neanderthal, since it is 2x - 20x faster across the board (for straightforward in-place matirx multiplication).

blueberry23:07:38

@whilo @aria42 see the last few messages in data-science.

blueberry23:07:42

@justalanm does numpy in your code use single precision 32bit floats, or double-precision (64 bits)? It seems you're using the default doubles. Can you also try it with dtype=np.float32

blueberry23:07:17

only the fast tests. Odd tests are not relevant here.

blueberry23:07:53

I suspect that with float32 Neanderthal will still be faster, but not 2x for huge matrices.

alan23:07:08

@blueberry yes Numpy uses MKL by default when installed with conda (https://github.com/conda/conda/issues/2032)

alan23:07:38

I'll try with float32 as well

alan23:07:11

This is how I implemented PCA with Numpy:

def pca(m, n, n_components=2):
    """Takes `m`, `n` dimensions  and the
    `n_components` number of principal
    components and returns the reduction of
    the generated matrix in `n_components`"""

    mat = np.random.rand(m, n)
    x = mat.copy()
    mat -= np.mean(mat, axis=0)
    cov = np.cov(mat, rowvar=False)
    evals, evecs = np.linalg.eigh(cov)
    idx = np.argsort(evals)[::-1]
    evals = evals[idx]
    evecs = evecs[:, idx]
    evecs = evecs[:, :n_components]

    return np.dot(evecs.T, x.T).T

blueberry23:07:53

The relevant comparison is double vs double and float vs float for both frameworks. These results are Neanderthal float vs Numpy double, which is expected to be roughly twice as fast...

blueberry23:07:10

I am not sure what each of these Numpy function do. The first question: is this a naive PCA or one of the better performing ones?

blueberry23:07:37

for which parts of this code do you find it difficult to find what to do in neanderthal?

alan23:07:55

Basically I center the matrix, then get the covariance matrix, calculate eigenvals & eigenvecs (np.linalg.eigh, which works on non-square matrices as well), then it is just sorting, selecting n_components and take the transpose of the dot product between eigen vectors and the matrix (np.dot(evecs.T, x.T).T)

blueberry23:07:01

Cool. Which parts you know how to do in Neanderhtal and which parts you don't?

alan23:07:25

I'm good up to covariance (though I'm not sure whether my implementation is going to be the best for performance) but then I have issues getting eigen values and vectors

alan23:07:46

Well, I get them, but they're not as I'm used to 😄

blueberry23:07:20

Also, there are ton of raw orthogonal factorizations available for cases where you don't need the full ev or es algorithm...

blueberry23:07:52

to get a feel how they are called, check out tests. There are lots of.

alan23:07:30

I'm actually going through that series to refresh everything algebra related and to get a better idea about Neanderthal

alan23:07:37

😄

alan23:07:23

In the next few days I'll try to implement it

alan23:07:35

I'll let you know how it goes!

whilo23:07:11

in the neanderthal benchmark you only call the benchmark code after allocating this memory: https://gitlab.com/alanmarazzi/numpy-vs-neanderthal/blob/master/neanderthal-bench/src/neanderthal_bench/core.clj#L25