data-science 2018-12-04 | Slack Archive

otfrom09:12:31

@henrygarner where is the best place to ask kixi.stats questions? (I was wondering about doing bootstrapping (random sampling w/replacement) with kixi.stats but wanted any question to happen in a reasonable forum.

henrygarner10:12:55

@otfrom this is a great forum now that I have blown the layer of dust off my Slack app!

otfrom10:12:16

🙂

henrygarner10:12:43

It would be excellent to know what your bootstrapping objective is for context: which statistics in particular are you inspecting via bootstrapping?

henrygarner10:12:22

And what features would you like over and above e.g. (take n (repeatedly #(rand-nth coll))). e.g. determinism?

otfrom11:12:00

I'm not sure we need much more than that. I'd have to get @mattford @michael.ford and @seb231 to really get into the details. I think it is mostly about building up probability distributions as we go into MCMC simulation to show more uncertainty (esp when projecting into the future). You'll not be surprised to hear that this is to do w/witan.send. 😉

otfrom11:12:43

https://github.com/mastodonc/witan.send

otfrom11:12:49

now that you put it that way (take n,,,) it sounds like a pretty simple solution (just sample w/replacement from the historical data using that to build up the betas)

henrygarner11:12:51

Bootstrapping isn't a great fit for transducer-ification, so the above is broadly how it would look in kixi.stats too. But that's not an argument against inclusion: there are already several plain-old Clojure functions there too.

henrygarner11:12:49

(Might be worth my noting that it wouldn't make much sense to calculate params to a beta distribution from a bootstrapped sample, as the beta already measures the uncertainty present in it. More data would only produce unjustified confidence)

mattford15:12:36

@henrygarner do you mean we would effectively be double dipping?

henrygarner15:12:38

@mattford in a manner of speaking! They're different strategies for achieving the same end, and combining them invalidates the assumptions. If you've got a Bernoulli process (i.e. anything like a coin flip: a binary outcome with some probability p) then a given sample of n outcomes, m of which are true, can be modelled by the Binomial distribution. Given a single Binomial sample, i.e. given n and m, we can infer likely values of p. We have 3 options 1) take a point estimate based on the sample, i.e. m / n (naive, because the sample may be small and therefore the point estimate is unjustifiably exact), 2) bootstrap to empirically measure the variance in the m / n for a variety of bootstrapped samples which gives us an empirical measure of uncertainty in p (better), or 3) use the beta distribution which gives us the analytic equivalent of 2 directly (best?).

mattford15:12:03

In each tick of the simulation though we don't combine the variance of the successive applications of the beta though.

mattford15:12:29

So the variance doesn't grow as we predict into the future.

mattford15:12:53

That's the real crux of what we are trying to solve.

henrygarner15:12:07

Bootstrapping is a way of measuring variance rather than injecting it. Maybe you're actually trying to add noise?

mattford15:12:37

I'm not doing a good job here: the "confidence interval" we see at say 2050 is pretty much the same as we see at 2030. We use the same set of probability distributions for each predicted year but we don't combine the variances in anyway going forward. I hope that explains in my simplistic layman terms what I mean by combining the successive applications of the beta.

gigasquid17:12:11

Some talks from #NeurIPS have been posted https://www.facebook.com/pg/nipsfoundation/videos/

👍 8

gigasquid17:12:32

There are 9000 people attending this year !

2018-12-04

Channels