off-topic 2022-02-14 | Slack Archive

Stuart16:02:40

QUick math questions, I have a series of dates, my x-axis and a series of values for each date (my y-axis), the dates are 15 minutes segments. What do I want to DDG to figure out the trend of the values and see when that trend will cut the x axis ?

Stuart16:02:47

I mean what is the name of the math 'thing' that does this ?

dpsutton16:02:48

i think the term is linear regression

📈 1

javahippie16:02:50

Linear Regression?

Stuart16:02:32

Thanks, I'll look that up!

Stuart16:02:09

Say my points look something like this (sorry for hte terrible MS Paint) Would linear regression have a way to ignore the big spike that is so far outside the normal range ? OR would I have to figure out way to remove this first before doing the linear regression part ?

p-himik16:02:40

The second. Linear regression just finds a, well, line, that fits your data better than any other. To filter out such spikes, you have to do it yourself - either by manually filtering for outliers or by using some statistical tool with some cut-offs.

Stuart16:02:33

Thanks, What I basically have is data recorded from a device, but it contains some rare spikes. However if the values outside that spike are trending down, I'd like to be able to figure out after each recording a rough prediction when it will be at 0

Stuart16:02:56

To within a couple of days, assuming samples are taken every 15 minutes

Stuart16:02:51

Does removing the spikes then linear regression seem like a reasonable way to achieve this ?

p-himik16:02:02

The easiest way is, no joke, to use Excel for that. :D Or LibreOffice Calc or whatever else spreadsheet software you might have. Assuming the amount of data is not huge.

p-himik16:02:31

> Does removing the spikes then linear regression seem like a reasonable way to achieve this ? Yes. Without any model behind the data, it's pretty much the way to do it.

Stuart16:02:48

I'd like to somehow automate it, the data coming in is automated, and I'd like to somehow get an alert to me that I need to do something

dpsutton16:02:59

data science is always a two step process: clean the data, analyze the data

Stuart16:02:16

So I was thinking do this is at the service that is pulling the data off a rabbit MQ queue I have setup

p-himik16:02:31

Well, I'd also add "create a model for your data" as the step 0 above - otherwise, you can't really clean your data.

quoll18:02:21

One possibility is to determine the mean μ, and standard deviation σ, and then filter out (or clip) all data that differs from the mean by, say, 2σ. So filtering: (filter #(< (abs (- % mean)) (* 2 std-dev)) data) or to clip:

(map #(let [std-dev2 (* 2 std-dev)
            diff (- % mean)]
        (if (> 0 diff)
          (max % (- mean std-dev2))
          (min % (+ mean std-dev2))))
     data)

That tends to clean up most extreme outliers. It’s not perfect. If outliers are really extreme, then it can change the mean to pull the some data outside of a 2σ range (this is incredibly unlikely). If that were to happen, then a 2 step process can help, where standard deviation is calculated once, then some large multiple of σ is used to filter the data, and then repeat the process with a more modest multiple of σ. (I’ve never seen this needed)

👍 1

Stuart18:02:00

Thanks that's brilliant!

phronmophobic18:02:19

As a bit of trivia, clipping extreme values is called https://en.wikipedia.org/wiki/Winsorizing

slipset20:02:28

@U0K1RLM99 might just have som info on this.

jasonbell21:02:12

Hello. I’m with @U2FRKM4TW on this, Excel/LibreOffice is the perfect starting point for any sort of Linear Regression. In the book I used Apache Commons Math to put together a 2D array of double values, that will output LR as well. Easy enough to retrain from RabbitMQ as well but you’ll have to persist the data somewhere obviously. I’m only skim reading the thread, so apologies if it’s already been covered.

josef moudrik22:02:05

Or you can also use linear model, but fitted with different error function. Normally, one uses Least squares error, but then outliers (e.g. your spike) can have disproportionate effect. So you might want to checkout Huber loss. Other keywords are outlier detection/removal, as @U051N6TTC suggests. You can also do that by "distance from the fitted line", ie, remove the points, that are most far away from the fit (instead of data mean).

👍 2

genekim04:02:21

+1 on @U051N6TTC technique. I'd plot the values on a histogram, and if it looks reasonably bell-curvy, clipping out values 3 or 2 std deviations from the mean might work well. But convince yourself that you’re genuinely trimming the tails/outliers.

mal02:11:43

I’d filter it out - you know unless you’ve just discovered another particle past the Higgs.

borkdude22:02:29

Anyone here got this watch - any good? https://shop.espruino.com/banglejs2

borkdude22:02:15

It seems like the thing I'm after for while: good battery life, nothing fancy: time + pedometer, that's it. But this thing is even programmable.

Darin Douglass00:02:43

I have the original bangle but haven't used it much, it was too big. This looks much more manageable I thought about making a bangle-cljs wrapper lib back when I first got it :p

adi04:02:40

that would be funny because it would wrap over your ... caREPL tunnel

🅱️ 2

🅾️ 2

0️⃣ 2

Stefan11:03:17

@U04V15CAJ Did you end up ordering one? Looks quite interesting.

2022-02-14

Channels