Fork me on GitHub
#data-science
<
2024-01-14
>
vonadz15:01:33

Is the purpose of the variable-window for TMD/rolling for when the dataset has missing values? I'm trying to understand if it's the conventional solution for a problem I'm facing, where I have a dataset that has city production values for different time periods. Some cities have 2017-01 through 2022-10, while others might be missing a month here or there, or only have a couple of months total. There are around 27k cities and trying to figure out how I can accurately say "this city had a X% change since last month and X% change since 12 months ago in production" for each city.

Daniel Slutsky16:01:16

Here is an old experiment doing something like that: https://github.com/scicloj/sci-fu/tree/main/projects/index-experiments We used a tree-based index to quickly fetch, for every given row, all relevant rows of the last few days (or in general, of a certain time window). If you find this direction helpful, I can document it a bit better in a couple of days (and make sure it works with the current stack of libraries, etc.).

Daniel Slutsky16:01:42

(Probably, the https://cnuernber.github.io/dtype-next/tech.v3.datatype.rolling.html#var-variable-rolling-window-ranges you mentioned should provide a faster solution, but I haven't tried it with dates, etc.)

chrisn17:01:11

group-by city would be my first goto for this problem - if missing is always interpreted as zero then I may handle that in the query function of the result of the group-by.

chrisn18:01:45

but maybe goup-by [city month]

vonadz18:01:19

What I'm doing is grouping by city, then doing a rolling function on the values in the group-by (2 for previous month, 13 for a year ago). I'm using this approach because I want the values for every time period. I haven't come up with a good way to guarantee last month or a year ago though. I don't think the variable window function does what I want, but still need to test something. I might resort to just generating 0 values for the time periods I'm missing.

vonadz19:01:43

Actually looks like the rolling function with the variable window works exactly how I need it to.

Daniel Slutsky19:01:50

That is great to know 🙏

vonadz19:01:25

Hmm actually, never mind. It seems there's an issue. I have this:

(-> (ds/->dataset [{:test 1 :time-period "2021-01-01"}
                   {:test 2 :time-period "2021-02-01"}
                   {:test 3 :time-period "2021-10-01"}
                   {:test 4 :time-period "2021-11-01"}
                   {:test 5 :time-period "2021-12-01"}
                   {:test 6 :time-period "2022-01-01"}
                   {:test 7 :time-period "2022-02-01"}]
                  {:parse-fn {:time-period :local-date}})
    (ds-rolling/rolling
     {:window-type :variable
      :column-name :time-period
      :units :monts
      :window-size 13
      :relative-window-position :right}
     {:rolling
      {:column-name :time-period
       :reducer #(vec (take 100 %))}}))
Which throws a casting error, saying String can't be casted to Number. Not really sure why that's an issue. Originally I had :time-period as just ints, so 202101, 202102, 202103 (not interested in the day). It seemed to work for any values within a year, but because there's a gap between 202112 and 202201, the window function doesn't count those as two consecutive values and would cut off early.

🙏 1
Daniel Slutsky07:01:15

I'll look in a few hours and try to propose something.

vonadz08:01:47

Wow, thanks 🙂 I'm hacking on it most of today as well

🙂 1
vonadz09:01:55

@U066L8B18 no need, I'm an idiot. There was a typo in my dataset initialization. Instead of :parser-fn I had :parse-fn, so it wasn't parsing :time-period properly as a date, but now everything works fine it seems.

Daniel Slutsky12:01:09

Ohh great, makes sense 🙏