Fork me on GitHub
#data-science
<
2023-11-21
>
Drew Verlee00:11:47

whats the name of a graph that contains only nodes that are part of a cycle?

val_waeselynck09:12:55

@U0DJ4T5U1 only just saw this! AFAICT, if what you mean is "any vertex of the graph is contained in some cycle subgraph", then this is equivalent to saying The block decomposition of the graph only contains blocks of at least 3 vertices. (A block is a maximal non-separable subset of the edges. I'm assuming a simple undirected graph i.e no parallel edges). EDIT: nope, sorry, that's wrong. The property is more like: "The blocks of size at least 3 cover all vertices". I don't know that there is a name for that kind of graph.

val_waeselynck09:12:10

Proof sketch: 1. If every vertex is in a cycle, then that cycle is a subgraph of a block, which therefore has size at least 3. 2. If a vertex is in a block of size at least 3, then it has at least 2 incident edges, which lie on a common cycle (by non-separability).

👀 1
Drew Verlee08:12:24

Interesting observation val. Thanks. Fwiw i figured out how to do the thing. Graph logic helped but so did just doing the problem by hand like 50 times lol.

Benjamin12:11:44

With pytorch and cuda, can I represent future data (tensor)? I would like to load some data concurrently while my main cpu work is enqueing work on the default stream. At time t1, the default stream should incorporate the loaded data (say loaded data was prepared on a separate cuda stream) into the current working tensor. I know I can synchronize between default stream and load stream with cuda events. But I don't have the loaded tensor at hand when I build the default stream pipeline on the host. Hence my desire for a tensor promise or such.

vonadz14:11:19

Anyone have any suggestions on an idiomatic way to rank values across multiple datasets (tech.v3.dataset)? Each dataset is a unique city with the rows as the month between 2017 and 2023, with the average electricity rate and bill values (means are 12 month rolling averages that I added via a window function, which was nice).

|      :rate |        :bill | :time-period | :city-id | :state-id |   :bill-mean | :rate-mean |
|-----------:|-------------:|-------------:|---------:|----------:|-------------:|-----------:|
| 0.12869940 | 120.47459840 |       201701 |     9434 |        50 | 120.47459840 | 0.12869940 |
| 0.13724871 |  90.26271599 |       201702 |     9434 |        50 | 117.95694153 | 0.12941184 |
| 0.14096601 |  94.16050992 |       201703 |     9434 |        50 | 115.76410083 | 0.13043406 |
I want to rank each city for each time period across all cities nationally, with something like postgres' rank function. So far I've implemented a rank function that achieves that, but requires me to break out each dataset into rows (ds/rows) and do it via maps. I was wondering if people have accomplished something similar using built-in dataset features.

genmeblog15:11:38

There is a by-rank function in tablecloth (https://scicloj.github.io/tablecloth/#Other). Though it sorts a dataset by assigned rank.

vonadz15:11:13

Ah cool, thanks. Not sure how useful it is, but I'll play around with it.

Harold18:11:48

Perhaps something like this (assuming I haven't misunderstood your intention):

user> (require '[tech.v3.dataset :as ds])
nil
user> (def ds1 (ds/->dataset {:period (range 3) :v (repeatedly 3 rand)}
                             {:dataset-name :ds1}))
#'user/ds1
user> ds1
:ds1 [3 2]:

| :period |         :v |
|--------:|-----------:|
|       0 | 0.72157336 |
|       1 | 0.60308133 |
|       2 | 0.95952875 |
user> (def ds2 (ds/->dataset {:period (range 3) :v (repeatedly 3 rand)}
                             {:dataset-name :ds2}))
#'user/ds2
user> (def ds3 (ds/->dataset {:period (range 3) :v (repeatedly 3 rand)}
                             {:dataset-name :ds3}))
#'user/ds3
user> ds2
:ds2 [3 2]:

| :period |         :v |
|--------:|-----------:|
|       0 | 0.24824334 |
|       1 | 0.71106105 |
|       2 | 0.38359701 |
user> ds3
:ds3 [3 2]:

| :period |         :v |
|--------:|-----------:|
|       0 | 0.50819464 |
|       1 | 0.16713668 |
|       2 | 0.49927642 |
user> (-> (reduce (fn [eax ds]
                    (ds/concat eax (assoc ds :ds (ds/dataset-name ds))))
                  (ds/empty-dataset)
                  [ds1 ds2 ds3])
          (ds/group-by-column :period)
          (update-vals (fn [ds]
                         (-> (ds/sort-by-column ds :v >)
                             (assoc :rank (range))))))
{0 _unnamed [3 4]:

| :period |         :v |  :ds | :rank |
|--------:|-----------:|------|------:|
|       0 | 0.72157336 | :ds1 |     0 |
|       0 | 0.50819464 | :ds3 |     1 |
|       0 | 0.24824334 | :ds2 |     2 |
, 1 _unnamed [3 4]:

| :period |         :v |  :ds | :rank |
|--------:|-----------:|------|------:|
|       1 | 0.71106105 | :ds2 |     0 |
|       1 | 0.60308133 | :ds1 |     1 |
|       1 | 0.16713668 | :ds3 |     2 |
, 2 _unnamed [3 4]:

| :period |         :v |  :ds | :rank |
|--------:|-----------:|------|------:|
|       2 | 0.95952875 | :ds1 |     0 |
|       2 | 0.49927642 | :ds3 |     1 |
|       2 | 0.38359701 | :ds2 |     2 |
}
Now, if :v were integers, and ties were possible you'd need a slightly more clever way to assign the ranks (like Postgres rank() does), but it looks like your data are floating point. hth (:

vonadz20:11:47

Oh this is a pretty cool way of doing it. The values are floats, but they can be the same, and are, pretty often. Mostly a result from how the data is calculated. I do like your approach though and I'll use it as inspiration going forward. Thanks a bunch! 🙂

Harold21:11:26

You got it - interested to see what you come up with for rank, could be something that belongs in TMD, so feel free to share if you like.

vonadz18:01:09

@UJ7RSSWDU I came up with something like below. It ranks using the postgresql rank, where all tied items have the same last rank (ie 1 2 2 3 would have rank 1 3 3 4 respectively)

(defn rank-ds [dataset sort-key ^java.util.Comparator comparer rank-key]
  (let [sorted-ds (ds/sort-by-column dataset sort-key comparer)
        grouped-data (ds/group-by-column->indexes sorted-ds sort-key)
        rank-map (into {} (reduce (fn [sum curr]
                                    (let [k (first curr)
                                          v (second curr)
                                          previous-rank (or (second (last sum)) 0)
                                          curr-rank (+ previous-rank (count v))]
                                      (conj sum [k curr-rank])))
                                  []
                                  grouped-data))]
    (ds/column-map sorted-ds rank-key #(get rank-map %) [sort-key])))

vonadz19:01:20

What I would still like to figure out though would be a better way to handle this situation where I need to rank across a bunch of different groups (every combination of time period, fuel category, and state). The current performance is pretty dismal and I only have about 400k rows.

(defn add-production-state-rank
  [dataset]
  (let [time-periods (ds-column/unique
                      (ds/column
                       dataset
                       :time-period))
        fuel-categories (ds-column/unique
                         (ds/column
                          dataset
                          :category))
        state-ids (ds-column/unique
                   (ds/column
                    dataset
                    :state-id))]
    (ds-join/pd-merge
     dataset
     (apply
      ds/concat
      (doall
       (for [time-period time-periods
             fuel-category fuel-categories
             state-id state-ids]
         (as-> dataset uds
           (ds/filter
            uds
            #(and (= state-id (:state-id %))
                  (= time-period (:time-period %))
                  (= fuel-category (:category %))))
           (rank-ds uds :production > :production-state-rank)
           (if (= (second (ds/shape uds)) 0)
             uds
             (ds/select-columns uds [:city-id
                                     :time-period
                                     :category
                                     :production-state-rank]))))))
     {:on [:city-id :time-period :category]})))

genmeblog19:01:47

This is how it can be done in tablecloth just three lines:

genmeblog19:01:50

(def exdata (tc/dataset {:cat1 (repeatedly 100 #(rand-nth [:a :b :c :c]))
                       :cat2 (repeatedly 100 #(rand-nth ["A" "B" "A"]))
                       :value (repeatedly 100 #(rand-nth (range 10)))}))

exdata
;; => _unnamed [100 3]:
;;    | :cat1 | :cat2 | :value |
;;    |-------|-------|-------:|
;;    |    :c |     A |      6 |
;;    |    :a |     B |      1 |
;;    |    :b |     A |      2 |
;;    |    :b |     B |      3 |
;;    |    :c |     B |      9 |
;;    |    :a |     A |      6 |
;;    |    :c |     A |      3 |
;;    |    :c |     A |      3 |
;;    |    :c |     A |      8 |
;;    |    :a |     A |      5 |
;;    |   ... |   ... |    ... |
;;    |    :a |     B |      2 |
;;    |    :c |     B |      5 |
;;    |    :b |     A |      0 |
;;    |    :a |     A |      3 |
;;    |    :b |     A |      0 |
;;    |    :c |     B |      6 |
;;    |    :b |     A |      9 |
;;    |    :b |     A |      5 |
;;    |    :b |     B |      4 |
;;    |    :b |     A |      5 |
;;    |    :c |     A |      0 |


(-> exdata
    (tc/group-by [:cat1 :cat2])
    (tc/add-column :rank #(tablecloth.api.utils/rank (% :value) :max))
    (tc/ungroup))
;; => _unnamed [100 4]:
;;    | :cat1 | :cat2 | :value | :rank |
;;    |-------|-------|-------:|------:|
;;    |    :c |     A |      6 |    20 |
;;    |    :c |     A |      3 |    10 |
;;    |    :c |     A |      3 |    10 |
;;    |    :c |     A |      8 |    26 |
;;    |    :c |     A |      9 |    30 |
;;    |    :c |     A |      8 |    26 |
;;    |    :c |     A |      2 |     7 |
;;    |    :c |     A |      9 |    30 |
;;    |    :c |     A |      5 |    17 |
;;    |    :c |     A |      7 |    23 |
;;    |   ... |   ... |    ... |   ... |
;;    |    :a |     A |      4 |     8 |
;;    |    :a |     A |      9 |    15 |
;;    |    :a |     A |      0 |     1 |
;;    |    :a |     A |      5 |    10 |
;;    |    :a |     A |      7 |    13 |
;;    |    :a |     A |      7 |    13 |
;;    |    :a |     A |      4 |     8 |
;;    |    :a |     A |      2 |     3 |
;;    |    :a |     A |      3 |     5 |
;;    |    :a |     A |      1 |     2 |
;;    |    :a |     A |      3 |     5 |

genmeblog19:01:43

tablecloth.api.utils/rank is 0-based, so you need to add 1 after all.

genmeblog19:01:30

(tablecloth.api.utils/rank [1 2 2 3] :max);; => (0 2 2 3)

genmeblog19:01:20

the doc for tablecloth.api.utils/rank is as follows (if you need other tie strategies):

Sample ranks. See [R docs]().
  Rank uses 0 based indexing.
  
  Possible tie strategies: `:average`, `:first`, `:last`, `:random`, `:min`, `:max`, `:dense`.
  `:dense` is the same as in `data.table::frank` from R

vonadz19:01:47

@U1EP3BZ3Q jesus, you have to be kidding me haha. How did I miss that? Thank you so much. There goes a day though haha.

😁 1
genmeblog19:01:03

The true is that rank is not exposed anywhere in the docs (yet). It probably will land in the columns API (in progress)

vonadz09:01:50

Fair enough, but I still shouldn't have missed the grouping / ungrouping functionality

vonadz09:01:10

This is such a life saver though, it's amazing. I owe you a beverage

vonadz11:01:25

@U1EP3BZ3Q are you aware of any implementation of rolling window functions in tablecloth? I searched the source code on github and couldn't find any, but wanted to double check since I missed the rank function.

vonadz17:01:21

Thanks! Yeah currently using TMD for it, just wanted to make sure tablecloth didn't have anything.

genmeblog18:01:30

I'll probably add it to the TC API in the next release. I have some functionalities relating to kernel filtering/smoothing in fastmath which might be worth porting.

❤️ 1
vonadz20:01:06

That would be awesome. It would be great DX to be able to run the rolling functions across grouped items like how you demonstrated it with rank.

👍 1
Harold17:01:18

Cool! 😎

vonadz15:01:07

In case this is interesting for anyone in the future, I implemented the rolling window with group by like this:

(defn add-rolling-column
  [dataset
   group-by-keys
   window-size
   relative-window-position
   existing-column
   new-column
   reducer]
  (-> dataset
      (tc/group-by group-by-keys)
      (tc/without-grouping->
       (tc/map-columns :data
                       :data
                       (fn [das]
                         (ds-rolling/rolling
                          das
                          {:window-type :fixed
                           :window-size window-size
                           :relative-window-position relative-window-position}
                          {new-column {:column-name existing-column
                                       :reducer reducer}}))))
      (tc/ungroup)))