Fork me on GitHub
#data-science
<
2022-04-07
>
Cameron Kingsbury20:04:46

goal: linear interpolation of missing values while avoiding lookahead (using scicloj/techascent stack) inspiration for my attempt (but tell me there is something simpler): tech.v3.dataset.rolling has mean, and I'm trying to create an extrapolate reducer in the same style to use with rolling... broken impl:

(defn extrapolator
  "double extrapolation of data"
  (^double [data options]
   (if (== 0 (tech.v3.datatype.base/ecount data))
     Double/NaN
     (let [diffs (map - (rest data) data)
           {:keys [n-elems sum]} (tech.v3.datatype.reductions/staged-double-consumer-reduction
                                  :tech.numerics/+ options diffs)
           mean-diff
           (com.github.ztellman.primitive-math// (double sum)
                                                 (double n-elems))]
       (->> data
            (filter some?)
            last
            (+ mean-diff)))))
  (^double [data]
   (extrapolator data nil)))
(basically the mean reducer with higher-level clojure injected to apply the average slope to the last non-nil value) fn__90967 cannot be cast to class clojure.lang.Associative, so I probably have to provide a map somewhere instead of a function...

chrisn20:04:56

https://cnuernber.github.io/dtype-next/tech.v3.datatype.gradient.html#var-diff1dhttps://cnuernber.github.io/dtype-next/tech.v3.datatype.functional.html#var-mean If the data is a column then you can get a bitmap of the missing indexes via https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-missing. It appears you want to fill in the first missing value with an extrapolated value from the average difference of the existing values added to the last known value. Potentially you may like to try the dataset https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-induction pathway which should work for this.

✔️ 1
Cameron Kingsbury21:04:48

Thanks so much for the response!! I decided to give the clojure-only route a try last night after seeing , and now it is just a matter of piecing together everything from that outer wrapper and the tech.v3.dataset stuff. Simultaneously getting into time-series means that I'm probably reproducing a lot of things or taking the wrong approach.

Cameron Kingsbury22:04:34

Here is my somewhat goofy approach to filling missing values using a rolling mean:

(require
 '[tech.v3.dataset :as tds]
 '[tech.v3.dataset.rolling :as roll]
 '[scicloj.ml.dataset :as ds])

;; fill with rolling mean
(ds/replace-missing
 unemp-rand-missing
 :unemp-rate
 :value
 (-> unemp-rand-missing
     (roll/rolling
      {:window-type :fixed
       :window-size 4
       :relative-window-position :left}
      {:mean (roll/mean :unemp-rate)})
     (ds/select-rows
      (tds/missing
       (:unemp-rate
        unemp-rand-missing)))
     :mean))

chrisn20:04:40

Interesting. Honestly if it all works for you I think it is great. Lots of ways to get that done 🙂.

Cameron Kingsbury21:04:30

There are already some interesting pieces of R I'd like to figure out in this domain. all.dates <- seq(from = start.date, to = end.date, by = "months") (how do they even standardize here?) Also just the concept of "rolling joins" on time series indices. I'm using clojure.java-time, and YearMonth in one learning case. This feels like the best practice given a year and month to preserve the granularity of the original data. Maybe it is best to also create a timestamp version with defaults that enables functions with built in timestamp aggregation/manipulation? Any libraries focused on time series for this stack?