data-science

pwrflx 2025-02-10T10:15:11.226819Z

hi! Is there an initiative similar to cuDF in Python? (GPU accelerated dataframe)

rgm 2025-02-10T23:44:47.522179Z

Hi everybody ... I have another tech.ml.dataset question. I do not appear to really understand what's going on with ds/row-map. I'm trying to derive missing values in one column based on the contents of another (details in thread).

Harold 2025-02-20T16:56:14.329989Z

> I have a code convention I try to stick to with ->> and -> where I'm not allowed to change the shape mid-thread Keeping this stuff simple is definitely a good idea (especially when we expect someone to need to read the code later). One area where this pops up for me a lot is with ds/group-by-* which produces a map instead of a dataset, and I find that I'm often -> into ds/group-by-* and then want to ->> from there... Most commonly, I actually just do this, and nest them. Maybe the nesting sort of 'follows' the change of shape?

👍 1
rgm 2025-02-20T18:37:06.486809Z

I find my own style has evolved to extracting either a let binding or a function once the (-> ,,, (->> ,,,)) gets going, based on past regrets.

👍 1
rgm 2025-02-20T18:38:09.854999Z

or, (as-> x $ ,,,) is a catch-all signal for "here be cleverness; consider some more coffee first"

rgm 2025-02-17T17:43:00.851649Z

ohhh, ok, yeah ... this is not the first time I've confused myself by not lifting my brain up into thinking about whole datasets at once

rgm 2025-02-17T17:44:50.239939Z

I find it pretty easy to accidentally flip over from column-thinking (ds) into row-thinking (seq of maps)

👍 1
rgm 2025-02-17T17:46:13.100199Z

I have a code convention I try to stick to with ->> and -> where I'm not allowed to change the shape mid-thread

rgm 2025-02-17T17:47:26.244929Z

I'm mulling one where maybe I just don't thread datasets and instead it has to be (ds/bind-> ,,,) as a signal to be more careful

Harold 2025-02-12T17:07:31.212459Z

Thought about this a little more and the 'surprising' behavior ultimately makes sense. I put some comments in the issue: https://github.com/techascent/tech.ml.dataset/issues/452 If anyone has ideas about how to improve the row-map docs, there's probably room for improvement there.

rgm 2025-02-10T23:54:38.468899Z

I had understood a dataset as something I could thread and run row-map on in multiple passes. But it seems like I've accidentally set up a "last pass wins" situation ... I'd have expected the other article numbers to fill in but they stay blank (see 2 in code here). Also, I'm trying to understand these prn statements ... it makes sense to me that each row-map is running 1x per row, so 9 statements, but I'm struggling to understand what I'm doing that's blowing away previously added data.

rgm 2025-02-10T23:57:28.658849Z

hm, never mind ... it seems that returning the whole row gives me the behaviour I want. Now I'm trying to figure out where I got the idea that I could return just a patch from ds/row-map

Harold 2025-02-11T00:39:26.842799Z

Perhaps some of the confusion is coming from the fact that backfill-description sometimes returns nil (!)...

rgm 2025-02-11T00:42:15.035699Z

could be ... where it got odd and confusing is that I had another row-map function in a when (so it could return nil), BUT the patch that it returned didn't collide with any existing column names, ie. it added columns instead of mutating ones that existed before the row-map

rgm 2025-02-11T00:43:07.082789Z

oh here's where I got the idea of just returning the columns that I was interested in changing

rgm 2025-02-11T00:43:16.772249Z

anyway, I can assoc in; no big deal

Harold 2025-02-11T00:43:47.746579Z

In general, you don't need to return a map with all the keys:

user> (-> (ds/->dataset {:a [1 2 3]})
          (ds/row-map (fn [{:keys [a]}]
                        {:b (inc a)})))
_unnamed [3 2]:

| :a | :b |
|---:|---:|
|  1 |  2 |
|  2 |  3 |
|  3 |  4 |
user> (-> (ds/->dataset {:a [1 2 3]})
          (ds/row-map (fn [{:keys [a]}]
                        (when (= 2 a)
                          {:b (inc a)}))))
_unnamed [3 2]:

| :a | :b |
|---:|---:|
|  1 |    |
|  2 |  3 |
|  3 |    |

rgm 2025-02-11T00:44:41.826949Z

right, and both of those examples are adding a :b column where none existed before the row-map

Harold 2025-02-11T00:45:13.560949Z

Ah! Perhaps this is the surprising behavior:

user> (-> (ds/->dataset {:a [1 2 3]})
          (ds/row-map (fn [{:keys [a]}]
                        (when (= 2 a)
                          {:a 0 :b (inc a)}))))
_unnamed [3 2]:

| :a | :b |
|---:|---:|
|    |    |
|  0 |  3 |
|    |    |

rgm 2025-02-11T00:45:38.472369Z

yes!

Harold 2025-02-11T00:46:00.554779Z

hm, happens even if it's the first row:

user> (-> (ds/->dataset {:a [1 2 3]})
          (ds/row-map (fn [{:keys [a]}]
                        (when (= 1 a)
                          {:a 0 :b (inc a)}))))
_unnamed [3 2]:

| :a | :b |
|---:|---:|
|  0 |  2 |
|    |    |
|    |    |

rgm 2025-02-11T00:47:27.768799Z

I could be wrong but it seems like the distinction is in touching the existing column, versus adding a new one?

Harold 2025-02-11T00:47:28.654459Z

Unsure if this is satisfying:

user> (-> (ds/->dataset {:a [1 2 3]})
          (ds/row-map (fn [{:keys [a] :as row}]
                        (if (= 1 a)
                          {:a 0 :b (inc a)}
                          row))))
_unnamed [3 2]:

| :a | :b |
|---:|---:|
|  0 |  2 |
|  2 |    |
|  3 |    |

Harold 2025-02-11T00:48:01.024129Z

Seems to be 'touching the existing column sometimes and returning nil sometimes'...

rgm 2025-02-11T00:48:42.600389Z

well, it seems like returning a complete row unconditionally gets me over some very confusing behaviour, maybe at some hard to measure performance cost

rgm 2025-02-11T00:49:06.982979Z

but my datasets are in the 10k-100k lines range here so it's fine

Harold 2025-02-11T00:49:56.621829Z

yeah, row-map is fast, and returning an unchanged row is likely to be fast as well.

rgm 2025-02-11T00:50:27.353949Z

maybe it all just mixes badly with missing values

Harold 2025-02-11T00:52:06.436729Z

it could be a bug, we can dig into it: https://github.com/techascent/tech.ml.dataset/issues/452 Thanks for bringing it up!

🙏 1
rgm 2025-02-11T00:52:27.952619Z

yw! Thanks for talking it through with me

🙇 1
Harold 2025-02-11T00:52:45.713889Z

glad there's a sensible workaround as well.

rgm 2025-02-11T00:53:29.795189Z

(love the lib, by the by ... been replacing a bunch of brittle docjure with 1/3 to 1/2 the LOC by keeping things as datasets as long as possible)

👍 1
Harold 2025-02-11T00:55:03.714349Z

Nice! To me this is always one of the selling points - once things are up and running it's easy to consume a variety of formats. Years ago we wrote a lot of one-off stuff to ingest excel over and over, now (like you're saying) we do it a lot more quickly and reliably.

👍 1
rgm 2025-02-11T00:59:11.164649Z

perhaps of interest: I was delighted to find that I can rename these columns with the stacked headers with a column header map of eg. {:arrangement "2 Screw" :spacing-vertical 16.0} or {:arrangement "2 Screw" :spacing-vertical 24.0}, and then it's a fairly straightforward mapcat to unroll the row into one map per cell.

Harold 2025-02-11T01:01:18.211649Z

That sounds rad - we've also found it very powerful to have the ability to do Object columns with arbitrary clojure data in them. The fact that .nippy serialization can maintain these kinds of arrangements with acceptable freeze/thaw times gives us flexibility that isn't frequently seen in other systems/languages.

🌟 1
rgm 2025-02-11T01:02:19.761399Z

all hail value equality semantics

🟰 1