Fork me on GitHub
#data-science
<
2023-02-20
>
p-himik19:02:09

What would be a good way to compute a PDF of a continuous distribution given a sample of numbers? I've completely forgotten my statistics and am now using fastmath.random/distribution and fastmath.random/pdf for it without thinking too hard, but it's flaky. Very occasionally the underlying apache/commons-math3 throws TooManyEvaluationsException and running it again resolves the issue. Maybe there's something simpler, with more stable behavior?

respatialized20:02:41

I've never encountered this issue with fastmath. How big is the sample of numbers you're using to generate this distribution?

p-himik21:02:50

Pretty small actually, here's an example:

[24.662205 26.22636 28.7975 30.13393 27.76407 27.53007 27.3551 28.18512 29.27688 26.55993 26.18226 27.63607 24.98422 29.0125 28.63367 29.17 30.09404 26.916 31.31946 28.2223 30.086 28.13168 30.823 30.51522 29.10788 26.03864 26.66382 28.87991 28.81368 28.154 29.31557 28.8736 26.68266 28.69051 25.9755 29.87031 26.53113 26.88417 29.66284 25.27433 25.8366 26.48693 31.09369 30.50727 26.20031 27.20536 29.59743 26.916 26.88279 29.33172 26.89102 30.71332 29.29106 26.43015 33.43071 28.95731 29.04467 33.07451 26.62252 28.157 28.85739 29.25335 26.53265 26.67003 28.52562 23.56133 26.68511 22.26829 30.16807 24.38851 28.77972 25.42692 25.69079 24.92336 26.92622 23.48808 26.57914 27.48241 28.68939 25.80033 23.99235 28.77972 25.90417 25.22066 27.23775 27.15951 25.18368 23.94114 28.73562 28.82026 28.65881 27.42978 26.22352 27.9369 23.30742 24.84015 20.39061 26.59407 25.80815 25.17016 28.11044 23.16944 26.50303 27.82452 28.36578 28.45895 25.33019 28.116 29.89495 26.449 28.51798 27.77474 26.834 24.56845 24.91341 28.29181 25.84656 23.29391 27.769 24.56845 27.289 27.47815 25.57202 27.79252 23.79035 26.94153 27.10403 26.07558 26.689 28.356 27.724 25.83944 28.249 25.92906 26.45254 25.12322 24.62535 27.36457 24.738 26.03432 30.44576 24.39064 27.29963 23.68355 28.678 27.914 23.46674 25.59762 26.29322 30.225 26.127 21.75148 26.714 23.36574 25.16803 27.47103 24.742 26.01512 27.15 26.531 30.11688 26.81154 28.19691 26.40956 26.30815 27.9707 28.33691 29.51762 29.1009 26.67003 27.95808 24.39071 28.02753 27.04887 30.48365 26.65109 27.472 28.716 30.07956 30.01 26.96679 29.12615 28.63998 24.74911 27.09977 26.68898 27.16883 26.62394 28.27377 26.17755 30.14901 23.38048 23.89191 27.8318 28.10961 30.13638 30.65412 28.90516 27.46559 27.44665 28.35585 27.16883 25.96287 27.39614 27.611 23.98808 27.074 27.308 28.09698 27.59818 29.23349 28.84203 26.95416 29.42291 26.00076 27.106 27.25723 30.124 27.09307 26.14598 28.8294 26.22175 27.26354 27.78129 30.27529 29.63127 28.29271 29.555 27.73077 28.31797 27.85652 27.40245 28.034 29.50499 27.31405 27.95808 25.508 28.05278 26.758 26.304 29.53833 29.429 30.06693 30.42051 28.11592 27.36145 29.20051 27.67395 24.17603 27.54767 28.33691 29.79269 30.32203 28.18538 27.75603 25.69311 30.692 28.51849 27.67367 30.0627 27.60561 26.44273 29.20764 29.18873 29.37865 31.88457 28.29154 32.50926 27.34013 29.94366 28.52968 26.68852 28.42383 27.35627 27.67488 27.5098 29.0865 28.03705 30.01995 29.02651 28.29866 30.53022 28.09405 29.40799 31.33371 28.0778 28.46762 26.46388 28.94999 28.83506 29.18002 28.66519 26.41967 30.12537 25.94756 29.32459 25.41854 29.62382 28.80559 30.09702 28.67527 27.11273 30.01852 26.85449 28.69968 29.50367 30.6755 28.72613 29.10245 28.96537 27.26172 31.20992 29.09454 29.21773 30.90404 29.8903 29.13071 29.01318 30.41744 30.33348 31.40007 26.60825 27.91378 28.15818 28.44011 27.23905 30.49302 30.56921 26.91842 30.96443 31.13551 30.18451 29.68795 31.11906 29.10878 28.45946 27.38545 28.60632 28.1348 27.36145 29.26169 30.36261 30.37127 28.79738 30.76594 29.59891 29.02719 30.9556 30.27216 28.69763 28.26707 31.42145 28.19691 29.85181 29.81906 30.17739 29.90881 31.15116 31.30788 29.46289 31.82531 30.64263 29.30826 29.23614 28.44115 30.19379 26.49804 27.17492 28.387 26.00659 27.42139 28.95568 29.49236 29.52198 28.20432 26.44273 26.64387 28.87991 26.25963 28.08781 27.554 31.38135 24.561 27.491 28.08292 26.733 27.01099 28.58947 27.43402 32.17924 30.2816 28.14905 28.22029 29.83331 29.537 28.173 28.59578 29.24 29.25243 27.32668 26.15229 28.73469 25.805 29.34457 31.22462 31.55682 26.30383 27.35825 30.5973 25.33944 28.103 27.90125 27.19409 27.91129 27.6019 22.96176 25.38567 25.79606 22.27398 25.56135 27.68085 25.58624 27.80023 27.27617 26.32908 27.90756 28.09698 28.84203 30.26897 30.35105 26.9731 28.75363 29.70072 30.93194 29.1729 27.04474 27.92305 31.72522 27.92421 31.57595 32.30747 30.12752 29.11086 29.284 29.2822 28.6386 30.72594 30.16676 28.54599 27.44176 26.92362 26.99679 25.269028 26.594 29.24 27.62975 27.289 27.28249 27.13726 29.45996 27.25654 28.457 28.76585 28.766 26.1613 29.82544 26.803 28.18538 29.08858 28.52155 29.8567 32.16499 30.59771 25.62465 24.35508 25.80175]

p-himik21:02:13

(count data) => 476

p-himik21:02:18

Maybe it has something to do with multithreading. Here https://clojurians.slack.com/archives/C03S1KBA2/p1676926759317809?thread_ts=1676924932.649369&amp;cid=C03S1KBA2 I tried to come up with a way to circumvent the issue. But it made things worse - out of 3 concurrent attempts to compute pdf, 1 returns immediately and 2 always wait for a long time and fail.

respatialized21:02:39

Yeah seems like size is probably not the issue if the data are that small, I was just guessing because of the TooManyEvaluationsException

p-himik21:02:09

Ah yeah, RombergIntegrator doesn't seem to be thread-safe... :(

respatialized21:02:54

https://github.com/generateme/fitdistr I think if you're looking for estimation methods that will produce a parameterized continuous distribution given a set of discrete samples, you might look at this library which accompanies fastmath.random

p-himik21:02:34

Given that I use just 2 functions from fastmath.random - how exactly will fitdistr help me?

respatialized21:02:38

You mentioned in your original question that you wanted a PDF of a continuous distribution that was based on discrete data - that means you have to make an assumption about which type of continuous distribution best represents your data (normal, Gaussian, Poisson, Pareto, etc). Once you have an idea you can use fitdistr to estimate the parameters for the distribution you choose.

p-himik21:02:51

Damn, the amount of info I forgot is probably larger than I remember. I think that when I wrote that code, I pretty much replicated what some R package was doing that my statistician co-workers were using at the time. > that means you have to make an assumption about which type of continuous distribution best represents your data But I don't know that, the data is external and visually different samples are completely different. What does (fastmath.random/distribution :continuous-distribution ...) do then? Just smooths things over?

genmeblog07:02:40

Issue with RombergIntegrator is fixed in a SNAPSHOT (in previous versions integrator was global var, which was obvious mistake).

genmeblog07:02:51

:continous-distribution uses kernel density estimation method to provide pdf. cdf and icdf are calculated through integration.

genmeblog08:02:27

KDE produces custom distribution according to your data. If you know your data are from known distribution you can use fitdistr to estimate parameters.

p-himik08:02:55

Thanks!

👍 2