uncomplicate 2017-10-08 | Slack Archive

Ok. I couldn't get the OpenCL namespace to load for [uncomplicate.neanderthal core opencl] before adding mkl.

I think a pure opencl version might be useful in the longer run, as one could then also port neanderthal to ClojureScript with WebCL.

whilo13:10:40

This is not a feature request, just a thought I had. 🙂

blueberry13:10:27

Ah, I see. CLBlast and cuBLAS engines load mkl engine, but that can (potentially) be easily decoupled to load any default native engine. However, porting to ClojureScript would require much more than just WebCL. WebCL enables access to CL, but what about the actual linear algebra operations? Someone would have to implement those, which requires practically full-time commitment due to required knowledge and time to implement. Even when implemented, it would require the user to have those resources in the browser, and to have the hardware with appropriate drivers. In the practical sense, I do not see how porting computation engines to ClojureScript would be useful over using them in JVM. What would be useful to do in ClojureScript, is to support transfer of data and enable poking around matrix structures for display and plotting, and this is what some other libraries that have ClojureScript port are commonly used for. But, that is something available today. Turn any Neanderthal matrix to a clojure sequence by calling seq, or move it to an array or a vector by calling tranfer! and voila.

blueberry13:10:00

Sure, I didn't take it as a request. Feel free to openly discuss any idea that you might find interesting.

whilo13:10:48

I am playing around with autograd on top of neanderthal atm.

whilo13:10:49

The research group that I am in mostly uses pytorch, so do I, and I really like it. I will try to follow and have a bit a look into stalingrad (scheme compiler for autograd).

whilo13:10:47

I am at the very beginning though, not sure yet how to handle in-place ops in an immutable compute graph. I will postpone that until later.

whilo13:10:48

I think I have to wrap some common linear algebra operations and make them polymorphic for convenience, similar to pytorch, which follows numpy. I am not sure whether this aligns well with the perf. focus of neanderthal.

blueberry13:10:56

Hm, aren't they polymorphic already?

whilo13:10:06

I also thought a bit about the tradeoffs between core.matrx (which has a cljs port that I used for my Bayesian Inference playground: http://replikativ.io/sghmc/) and neanderthal.

whilo13:10:16

I mean also for scalars including broadcasting.

whilo13:10:55

But I try to keep it simple for now. This is something that pytorch provides.

whilo13:10:20

(independent of numpy, they implement their own tensor library as they originally did for lua)

whilo13:10:51

I think the problem with core.matrix, as far as I understood you also, is that it is also polymorphic on all internal low-level operations that are used to implement linear algebra routines, instead of just exposing a convenient polymorphic high-level API for users who do not temper with the linear algebra internals.

whilo13:10:58

Is it possible in neanderthal to switch the implementation libraries are using globally, so everybody uses the same backend?

blueberry13:10:58

I do not understand the question. The whole api is backend-agnostic, if that is what you mean. However, switching backends blindly (which you CAN do) is that CPU and GPU have their own strengths and weaknesses, so should be used together. Of course, if the machine have only the CPU, then it will be used for everything, but I have yet to encounter the machine that has the GPU, but does not have any CPU, and it runs JVM.

blueberry13:10:43

Use the factory that you want in the core namespace, per call, or set it as a global binding...

whilo13:10:02

Ok, good. So when I develop and release against the CPU version, a GPU user will automatically create all tensors on the GPU?

blueberry13:10:19

The thing is, of course, that you have to move data somehow from the main memory to the GPU memory. Currently the MKL engine serves that purpose, but in the extreme case, you could provide your own "dummy" native engine when you construct the OpenCL engine... Probably even nil will work (it did last year) but I'm not sure now.

blueberry14:10:00

That depends on how you set everything up. Everything in neanderthal is pluggable and configurable - it is up to you how you assemble it. There is default configuration for convenience (the native, opencl and cuda namespaces), but you do not have to use it.

whilo14:10:41

Ok, I think I need to have a closer look.

whilo14:10:55

What is your take on automatic broadcasting?

blueberry14:10:28

Automatic in what sense?

whilo14:10:41

In the sense that (+ scalar vector) will automatically broadcast the scalar to the size of vector.

blueberry14:10:42

If (when) I really need such operation, I am more inclined to create a separate broadcast method with well-defined semantics. As for that particular example, it is already supported in Neanderthal 0.17.0-SNAPSHOT by the linear-fraction method which can do the shift by a scalar, and works for both vectors and matrices, like this: (linear-frac a 3.333). I don't plan to pollute +, -, and other scalar operations in Neanderthal, but will probably add general broadcasting for tensors, when I add them. For vectors and matrices, I think it does much more harm than good, and can easily be achieved with existing functions.

whilo14:10:36

What happens fairly regular in implementing deep learning architectures is that you need to add a row-vector row-wise over a matrix. But I will not do automatic broadcasting for now, just focus on autograd.

blueberry14:10:10

I know that, but IMHO, it is a NN-specific requirement that 1) can be achieved with existing operations with some performance penalty 2) Can be easily implemented in a simple OpenCL/CUDA kernel with no penalty using clojurecl/clojurecuda. 3) Is not that well-defined in math linear algebra textbooks (so it is easy to misuse).

whilo14:10:06

Yes, I agree.

whilo14:10:16

How are your Bayesian Inference things going?

whilo14:10:51

I am mentioning it, because they are my goal also. Hopefully in form of an Anglican inference method for Bayesian Neural Networks.

whilo14:10:39

like SGHMC

blueberry15:10:51

I haven't have time to work on it due to other things, but plan to implement CUDA engine in the mid-term.

blueberry15:10:47

As for usability, it has been rather functional since 2015. I just didn't put it in Clojars...

whilo15:10:15

Ok, cool.

whilo15:10:30

Is there a reason why Vectors and Matrices do not print into a readable format?

blueberry15:10:34

You mean readable by clojure reader?

blueberry15:10:50

Since I think they are definitively readable by the human user in the REPL:

#RealGEMatrix[double, mxn:3x4, layout:column, offset:0]
   ▥       ↓       ↓       ↓       ↓       ┓    
   →    1.0     9.9     2.56E+2 1.18E+4         
   →    1.8     27.     8.70E+2 4.67E+4         
   →    4.0     80.     3.13E+3 1.92E+5         
   ┗                                       ┛

whilo15:10:00

Yes, I meant literal printing.

whilo15:10:16

I think they already print all information to reconstruct them.

blueberry15:10:17

As for Clojure, I think transfer! is much more appropriate. Transfer the data into whatever you want without the string conversion and parsing.

whilo15:10:58

Well, I often inline expressions from my REPL into my buffer, e.g. in tests. It is really handy if you have readable expressions then.

blueberry15:10:17

use seq in such cases

blueberry15:10:35

or (seq (view-vctr a))

whilo15:10:48

Then I have to walk my whole nested data structure to transform the matrices into a different type.

blueberry15:10:18

Can you give me an example?

whilo15:10:41

Right now I have a graph of an inner product:

blueberry16:10:24

How would Neanderthal matrices ideally be printed to help you here?

whilo16:10:43

The atoms won't work here anyway, since they are also printed in non-readable fashion.

whilo16:10:01

I think you can stick to your current way of printing, i would just avoid "n:1", "offset:0", and use either a map or "n 1 offset 0 stride 1" and put the whole matrix into the one expression after RealBlockVector.

whilo16:10:26

Ah, for the matrices the special characters won't work, ofc.

whilo16:10:42

But you can still print newlines.

whilo16:10:02

Well, if the matrices get large, do you only print edges then?

whilo16:10:17

numpy does this to avoid screwing your environment.

blueberry16:10:03

But why? This is the printing format optimized for human consumption (not perfect perhaps, and I am open for suggestions). What's intended for computers is transfer!. I understand your challenge here, but the actual issue is in printing your data structure, not the nested elements. In your case, I would implement the print-method for your defrecord that prints the nested elements in whatever way you think useful for your workflow. You can actually do this even for Neanderthal, just redefine print-method in your program.

blueberry16:10:26

Yes.

whilo16:10:18

Yes, I am just wondering why people in general do not print readable things. Even if you print into a map which is just describing what was there, you can at least still read your buffer and access all other parts.

whilo16:10:41

But "n:1" is not parseable.

blueberry16:10:57

But how would you print and read a 3000 x 1000 matrix?

blueberry16:10:11

How would you print (and read) a symmetric matrix?

blueberry16:10:17

Or a symmetric packed?

blueberry16:10:32

How to handle wildly different magnitudes of data?

whilo16:10:57

If you don't want to print it itself, you can still print something that is edn and machine readable.

blueberry16:10:10

blueberry16:10:19

I don't understand.

whilo16:10:27

I mean a placeholder.

blueberry16:10:48

look, (map seq a) gives you something like this: ((1 2 3) (4 5 5) (7 8 9))

whilo16:10:08

Something like #RealBlockVector{:stride 1, :n 1, :offset 0, :value [[1 2 :... 999 1000] :... [1000 :... 2000]]}

whilo16:10:28

If I have no read handler for it, the edn parser will just leave the map in its place.

whilo16:10:29

You can still use newline formatting to make it human readable.

blueberry16:10:56

It was like this before, but I never need to read it by the edn parser, and I always need to read it in the REPL (where the actual data is displayed better than you suggest).

blueberry16:10:17

But, anyway, why the seq approach doesn't work for you?

whilo16:10:00

Well it is a non-critical l issue, I can ofc. work around it and fix it myself. I think the data-driven approach of Clojure is very often sacrificed here and it makes copy and pasting values a lot harder. Like it is in Python where almost nothing is automatically readable and people pickle everything.

blueberry16:10:35

That use case is handled by the seq.

whilo16:10:15

whilo16:10:42

How close is neanderthal to the low-level BLAS operations in general?

whilo16:10:11

It is as close as possible, right?

blueberry16:10:54

When it makes sense, yes.

whilo16:10:38

Good.

whilo16:10:29

Do you have any requirements or desirables for autograd?

blueberry16:10:39

blueberry16:10:50

Something I'd like to see?

whilo16:10:57

Yes. You mentioned that you thought about adding something in this direction as well.

blueberry16:10:30

Sure. I'll first tackle (human-coded only) gradients and things related to them. Only then I'd know enough to form any strong opinion about auto-gradients. Until then, my primary concern is that they actually work, and are competitive with whatever is happening in other environments...

whilo16:10:24

reverse autograd in form of theano, tensorflow and pytorch is very close to manually backpropagated gradients nowadays. Doing it by hand is error prone and often obscures code. Working in pytorch is really cool on the other hand as the gradient follows python's control flow. So you can calculate loops with variable iteration length, e.g. for LSTMs.

whilo16:10:08

In fact the way I am doing it right now is similar to both pytorch and a manual NN implementation that I have in numpy.

whilo16:10:21

linear regression works now 🙂

blueberry16:10:41

Well, cool!

whilo16:10:42

A problem is that I have to box all things.

blueberry16:10:05

I'd love to see comparison with those libraries.

whilo16:10:18

Sure, first of all I need to get a reasonable example code 🙂

whilo16:10:10

This is how it looks like right now.

blueberry16:10:52

It looks simple enough.

blueberry16:10:08

Is there a reason you use loop and not map/reduce?

whilo16:10:39

It is destructive, gd! applies the gradients in place.

blueberry16:10:47

But I didn't mean the comparison of code primarily, but the comparison of performance.

blueberry16:10:32

Fluokitten's fmap! offers destructive map, and it works on all types of vectors and matrices (if that can help here).

whilo16:10:26

Right. Well I have to think about it. Usually I only use map/reduce in functional code, otherwise I use doseq or loop, to make it clear (for me).

whilo16:10:44

I just did it because this is how it would look like in Python.

whilo16:10:09

Not exactly like pytorch, I have made the compute graph lazy, so you have to call forward first.

whilo16:10:41

This allows to transform the graph before applying it, but pytorch does not do it normally, as it is more intuitive not to mix in lazy computations.

whilo16:10:15

So maybe I drop that.

whilo16:10:52

grads is a full graph decorated with the gradients, which are only applied by gd!.

whilo16:10:35

I think I will drop lazyness for now. This will allow the control flow to depend on the calculated values, which is much better.

whilo18:10:49

@blueberry How will the tensor support look like?

whilo18:10:45

I miss something like shape, which gives me all dimensions, e.g. in a vector for a matrix with 2 rows and 3 columns: [2 3]

whilo18:10:00

dim just yields the product of the dimensions

blueberry18:10:37

dim is exactly how dimension is defined in mathematics. You can get what you want with mrows and ncols for matrices. [2 3] would have huge performance penalty (more than 10x). As for tensors, I am not sure yet. shape as in your example will probably be supported, but I'll try to find an option that is also performant...

blueberry18:10:18

Of course, presently you can implement your own shape protocol that calls dim for vectors, and [(.mrows a) (.ncols a)] for matrices...

whilo19:10:41

Right.

whilo19:10:51

Good to know that it is expensive.

blueberry19:10:25

It is obvious and inevitable. You have the construction of a vector and two boxings.

whilo19:10:25

I need broadcasting now.

blueberry19:10:51

Not to mention the penalty when you actually read those number back

blueberry19:10:09

In itself it is not that much, but if you call it in a loop it adds up

whilo19:10:25

I mean I need broadcasting for scalars, so the gradient is projected to them correctly.

blueberry19:10:26

Sometimes it is important, sometimes it is not

whilo19:10:39

I see, yes.

blueberry19:10:09

If you call it once per matrix, then of course it is negligible

whilo19:10:54

Yes, but it could slip. I try to vectorize as much as possible though.

whilo19:10:05

Logistic regression also works.

whilo19:10:36

Have you done random initialization of matrices yet? For GPU ones doing it on device is better than copying from main memory (e.g. Java).

whilo19:10:00

I only need standard normals.

whilo19:10:06

Or maybe uniform.

blueberry19:10:14

I do that in Bayadera, and generate random samples of various distributions in GPU memory directly.

blueberry19:10:38

The challenge regarding random numbers is to use quality random generator that is also parallelizable.

blueberry19:10:06

This is solved in Bayadera, but will be overkill to introduce it in Neanderthal just to support testing.

blueberry19:10:56

My general idea is that Neanderthal is a general purpose vectorization and linear algebra library, while Bayadera is for statistics and random stuff

blueberry19:10:06

randomized stuff

whilo19:10:55

hehe

blueberry19:10:22

I might add random matirx data generation on the CPU in neanderthal though (feel free to open an issue, and I will think about that)

blueberry19:10:00

With the testing data generation quality, and intended only for generating random testing data

whilo19:10:48

How bad are the random generators you are worried about?

blueberry19:10:36

For MCMC, probably all you'd encounter in Java are poor, including the Marsenne Twister

blueberry19:10:26

There are exceptions, but they are not that widely used in common Java libraries.

blueberry19:10:00

Not to mention that the default random() is unusable 🙂

whilo19:10:40

Ok, I am a bit green here. Do you have any pointers?

blueberry19:10:52

I do. Use Bayadera 🙂

whilo19:10:57

I mean links.

whilo19:10:01

Ok 🙂

2017-10-08

Channels