Fork me on GitHub
#uncomplicate
<
2017-08-24
>
whilo14:08:42

@blueberry I guess you don't want to announce your used MC method for bayadera publicly? It is neither mentioned in the slides or in the source code. I am implementing SGHMC atm. which does not need branching, but rather works like momentum SGD. I implement it with pytorch's autograd. I think something like it would fly on neanderthal + autograd.

whilo14:08:23

It works for very high-dimensional problems like NNs over natural images: https://arxiv.org/abs/1705.09558

blueberry14:08:06

I would like to see some numbers before I can form an opinion.

whilo14:08:13

The paper has some. I will let you know once I can reproduce the results. What particular numbers are you interested in?

whilo14:08:55

I am mentioning it, because I think an autograd functionality would be very helpful as an intermediary abstraction from neanderthal to build bayesian statistics toolboxes.

whilo14:08:30

I don't have as much time as I would like to have, but I am exploring some clj-autodiff stuff atm. in https://github.com/log0ymxm/clj-auto-diff

blueberry14:08:26

I've skimmed through the paper, but this is the problem: most of those papers require familiarity with concrete problems (vision/classification in this case) to judge the results. There is no anchor I can use to judge this paper. I can not see the simplest data that I'm interested in when I hear about ANY MCMC implementation: how much steps it needs to converge for some problems that are easy to understand and compare, and how much time one step takes. For easy problems. If it works poorly for easy problems, I can not see how it can work great for harder problems. If it works ok for easy problems, then I can look at harder ones and see how it does there. There is so much hand-waving in general that usually does not surprise me that 99% of those is vaporware.

blueberry14:08:54

I was quite surprised when I saw the Anglican MCMC hello world at EuroClojure. The most basic thing you can use MCMC for took 239 SECONDS for the world's simplest beta-binomial.

whilo14:08:56

Generating such artificial images is very very difficult, even if they are far from perfect. So you definitely need a good method to get there. The original SGHMC paper has more explanations, but there is whole string of literature related to this.

whilo14:08:11

I mention it to you, because I can imagine that you don't have the time to dive into it.

whilo14:08:04

I understand your concerns. I on the other hand am interested on using Clojure again at some point for my optimization stuff. Esp. since Anglican represents fairly sophisticated tooling compared to Python's libraries on top of tensorflow or theano.

blueberry14:08:20

That might be very useful for that particular problem (or not - i don't know since I don't do computer vision) but it does not tell me whether SGHMC is worth exploring for general MCMC that I'm interested in.

whilo14:08:52

SGHMC samples the whole weight matrices of a 5-layer CNN in their case.

whilo14:08:27

That is a very high-dimensional problem. In MCMC there are in general only convergence proofs for toy problems, so I cannot tell you how well it explores the distribution.

blueberry14:08:27

In what sense? Are those matrices unknown?

whilo14:08:00

Instead of stochastic gradient descent the routine does not just find an optimum but samples from the posterior of the weight matrices given the data.

blueberry14:08:37

I'm not aware on ANY method that guarantees MCMC convergence, toy or non-toy! That is the most tricky part with MCMC.

blueberry14:08:12

So, each sample is the whole CNN?

blueberry14:08:18

How do their method compares to the state of the art? You know, those models Google/Facebook/Deep Mind or whoever else is the leader publishes?

blueberry15:08:02

How much memory one sample typically takes?

blueberry15:08:37

And how many samples they consider enough to fairly represent the posterior?

whilo15:08:06

They subsample from the chain, but I don't know much about these specifics yet. The sample probably takes a few megabytes.

whilo15:08:54

In this paper they managed to compete with deep learning GANs and exploit Bayesian features like multiple samples from the posterior.

blueberry15:08:02

Hmmm. I'm afraid I do not have enough knowledge to fairly comment what they do.

whilo15:08:44

Sure. I thought you might be interested in scalable high-dimensional sampling methods. So far I just wanted to talk a bit about with you.

whilo15:08:56

I can tell you more once I can get it run on more than toy problems.

blueberry15:08:14

Do they compete regarding results only or also on speed. Because getting same results for 1000x times is a bit underwhelming (if that is the case, of course).

whilo15:08:47

No, that is the cool thing. It is a bit slower than SGD, but not orders of magnitudes.

whilo15:08:00

You always pay for a Bayesian approach.

whilo15:08:23

Estimating a full distribution vs. a MAP or MLE estimate is a lot more expensive in general.

blueberry15:08:27

Not necessarily 🙂

blueberry15:08:04

That's why you do it only when you know you'll get qualitatively better answers

blueberry15:08:31

If it's only "we got slightly better X" then why bother?

blueberry15:08:14

I understand that it can be quite challenging in machine vision because the DL people really pushed the state of the art in the last decade.

whilo15:08:22

I agree. I think a statistics approach can help you to do informed optimization decisions though, even if you go towards an MLE estimate in the end.

whilo15:08:23

But if you can incorporate the strengths of deep nets, then you can also improve statistical methods. This is what a lot of people try do atm.

whilo15:08:44

Use neural networks to approximate internal functions to make their samplers faster or get a better variational approximation.

blueberry15:08:06

That's what they hope for, at least 😉

blueberry15:08:35

But also may be the case of "what a nice hammer I've got"

whilo15:08:41

I agree with your emphasis on performance. I explore it atm. because I am doing research. In a practical project I probably would stick to a CNN for these problems.

whilo15:08:15

Yes, I have this internal struggle with the Bayesian approach. But so far it helps to stretch in this direction for me.

whilo15:08:38

The math is sound and you can borrow a lot of intuitions and from past experiences.

whilo15:08:09

Have you thought about autograd at some point?

whilo15:08:43

Because that is really strong in Python, esp. with Pytorch. I am really happy, despite Python being slow and a big mess under the hood.

whilo15:08:47

cortex directly builds on layers and I don't understand why, except for business reasons. autograd has a very strong history in Lisp and it is an irony that Python is so much better at it than Clojure.

whilo15:08:22

theano, tensorflow and pytorch are all autograd libs to different degrees.

blueberry15:08:20

I'll add something more pragmatic (and effective IMO): vectorized gradients for all standard math functions on vector/matrix/tensor structures in neanderthal, but no general clojure code gradients.

whilo15:08:11

That is probably sufficient. I agree that general autograd might be too much, but there is very rich literature and impls. in scheme and they have tried hard to make it efficient.

whilo15:08:31

E.g. reverse autograd is like backprop.

whilo15:08:02

Do you have pointers of how you would do it in neanderthal?

blueberry15:08:26

I prefer to talk about things only once I am sure I can do it properly. I have a pretty good idea how to do it, but am not sure whether the results would be amazing, so I prefer to shut up for now 🙂

blueberry15:08:02

and, of course, there are quite a few things with higher priority now.